Integrative analysis of omics data for cancer research

In the last several years, in cancer molecular biology and other fields it is becoming increasingly common to have the multiple-level descriptions of a set of clinical biological samples, including genome, DNA methylome, transcriptome, miRNAome. In addition, new levels of molecular descriptions started to appear at a rapid pace, such as various types of histone modification profiles, profiling of conformational properties of chromosomes and topologically associating domains (Hi-C technology), quantification of proteome using the mass spectrometry-based techniques. In the last few years, we saw a rise of single-cell technologies allowing us to obtain the molecular characterization of thousands of single cells (currently, mostly transcriptome, but other types of profiles will be certainly available in the nearest future). This rapid accumulation of the data poses new challenges for our group, with the main question of how to exploit the multiple data levels such that their combination would complement each other and allowed better prediction of the temporal evolution/clinical outcome than any of the single data type alone.

We believe that the key for achieving this goal is in creating consistent mathematical models connecting multiple levels of cell functioning together such that each data type could provide information for its own layer of model parameters. Creating such a complete mathematical model from the first principles is tremendously challenging: in order to accelerate it, we suggest a number of pragmatic phenomenological approaches, based on application of machine learning and statistical methods, allowing to describe certain biological processes in a more abstract fashion, which allows lumping many detailed parameters together. We are currently exploring several such approaches, with deconvolution of omics data into mutually independent processes and using multiplex network-based clustering, being very promising directions.

We exploit the growing volume of multi-level profiles available through the The Cancer Genome Atlas (TCGA) consortium, namely the Pan-Cancer dataset currently comprising more than 30 solid cancer types. TCGA database can serve us for two purposes: a) as a largescale test case of the data integration methods that we suggest; and b) as an estimation of the expected background distribution of the omics profiles to which a new dataset can be compared: in this way, exploiting TCGA can increase statistical power of the proprietary analyses. Importantly, the pan-cancer nature of the TCGA collection allows distinguishing universal (such as proliferation) and tissue-specific (such as re-activated differentiation programs) mechanisms of cancer progression, which is crucial in order to improve our understanding of cancer biology.

We apply data integration methods to studying pediatric cancers. The group has a long-term expertise working with Ewing sarcoma multi-level description, and in the recent years new large cohorts of Ewing sarcoma patients have been profiled for exon mutations, miRNAome and DNA methylation. These data will certainly feed our activity in the direction of developing integrative methodologies. Besides this, the M5 (Mathematical Modeling of Molecular Mechanisms of Medulloblastoma) project provided to us an early access to the high-quality multi-level description of a cohort of medulloblastoma patients, including quantification of proteome and phosphoproteome at genome-wide scale. We aim at improving the stratification of medulloblastoma patients using the multi-level tumoral profiles, and obtain insights into the biological mechanisms driving this stratification.

We decipher the properties of tumoral microenvironment from the multilevel description of tumoral samples. Weexploit new publicly available single-cell datasets of both tumoral and stromal components of the tumoral microenvironment, in order to better characterize the co-influence of tumoral and stromal cells, with a particular focus on the role of immune-related processes. Having in mind this general perspective, we investigate the role of the cancer-associated fibroblasts (CAFs) in shaping the immune response within the tumors. Besides public data, from our collaborators we have early access to the FACs-sorted and single-cell data on CAFs in breast and ovary cancer, as well as ATAC-seq profiles of different CAF subtypes. These data will potentially allow us to infer the transcriptional programs shaping the function of CAFs inside the tumoral microenvironment, including their immunogenic function.

Our group developed a number of pioneering approaches for integrative omics data analysis, where several types of data were combined within one computational method.

We developed a number of advanced methods for NGS data analysis helping better interpretation of the sequencing results. Control-FREEC is a continuation of the successful FREEC pipeline for assessing the copy number profiles, included the detection of LOH profiles from the sequencing data (Boeva et al, Bioinformatics, 2012a). Nebula web-server based on Galaxy open source network was developed for user-friendly analysis of CHiP-Seq data including using de novo discovery of sequence motifs (Boeva et al, Bioinformatics, 2012b). SV-Bay tool was developed for the analysis of paired-end data in order to detect structural variants in the genome taking into account copy number changes (Iakovishina et al, Bioinformatics, 2016). HMCan and HMCan-diff tools were developed in order to quantify the chromatin modifications in cancer taking into account the copy number changes (Ashoor et al, Bioinformatics, 2013; Ashoor et al, Nucleic Acids Res., 2017). We investigated the global connections existing between expression of miRNAs and mRNAs and their functional effect. In particular, we suggested new methods for quantifying the regulatory effect of miRNA on transcriptome and applied to the case of triple negative breast cancer and Ewing sarcoma (Martignetti et al, BMC Genomics, 2015; Martignetti et al, PLoS One, 2012). Also, kinetic signatures of miRNA action were suggested in order to connect the dynamical properties of miRNA, mRNA and protein molecules (Morozova et al, RNA, 2012; Zinovyev et al, Adv Exp Med Biol., 2013). We developed a novel methodology of integrative data analysis based on an existing statistical method of matrix factorization, Independent Component Analysis (ICA) (Kairov et al, BMC Genomics, 2017; Zinovyev et al, Biochem Biophys Res Commun., 2013). The methodology was successfully applied in order to deconvolute the bladder cancer transcriptome and compare the revealed biological mechanisms to other cancer types (Biton et al, Cell Rep., 2014, see also Figure below). This allowed getting insights into the molecular subtyping of bladder cancer and suggested a particular role of a transcriptional factor related to differentiation (PPARG), which was experimentally validated. The ICA-based analysis involved integrating of multiple omics data types, including copy number profiles and histopathological imaging data.