Difference between revisions of "CSB Assignment Week 6"
m |
m (→MARA) |
||
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<div id="CSB"> | <div id="CSB"> | ||
<div class="b1"> | <div class="b1"> | ||
− | Assignments for Week 6 | + | Assignments for Week 6<br/> |
+ | <span style="font-size: 70%">Gene Regulatory Networks revisited</span> | ||
</div> | </div> | ||
+ | <table style="width:100%;"><tr> | ||
+ | <td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[CSB_Assignment_Week_5|< Assignment 5]]</td> | ||
+ | <td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[CSB_Assignment_Week_7|Assignment 7 >]]</td> | ||
+ | </tr></table> | ||
{{Active}} | {{Active}} | ||
+ | <!-- | ||
Exercises for this week relate to this week's lecture.<br /> | Exercises for this week relate to this week's lecture.<br /> | ||
Pre-reading for this week will prepare next week's lecture.<br /> | Pre-reading for this week will prepare next week's lecture.<br /> | ||
Exercises and pre-reading will be topics on next week's quiz. | Exercises and pre-reading will be topics on next week's quiz. | ||
+ | --> | ||
+ | |||
+ | __TOC__ | ||
+ | |||
+ | |||
+ | ==Context== | ||
+ | |||
+ | One of the interesting parts of the Mogrify workflow is the use of a network weighting method, based on STRING and GRN networks - the ''network-based sphere of influence''. The idea behind this is that effects of genes '''propagate''' a certain distance through networks. Such network-based analytics are systems biology methods ''par excellence''. In our workflow, transcription factors are ranked, based upon how many differentially expressed genes they are associated with. | ||
+ | |||
+ | From your iGraph tutorial, you will recall that networks can be constructed from adjacency matrices, or from edge lists. Whatever the source is: if we want to build a network, we need to define the nodes, and we need to define when to posit edges between the nodes. This seems quite straightforward for STRING - we can download the whole database as an edge-list. TFs are nodes, the neighbourhood of one node is quickly determined from the edges provided by STRING, and we can easily evaluate the DESeq results for each neighbour. But how is the MARA network constructed? | ||
+ | |||
+ | |||
+ | ===MARA=== | ||
+ | |||
+ | You will recall that we had a long discussion last Tuesday about MARA. Rackham ''et al''. state: "MARA provides protein-DNA interactions for transcription factors with known binding sites in the promoter regions of a gene." (online Methods, Step 3.){{#pmid:26780608|rackham_2016}} Nodes are presumably genes. But what exactly are the edges? The Rackham paper does not say. The initial iteration of the Ontoscope workflow assumed that a MARA edge list was available for download. But matters are more complicated. | ||
+ | |||
+ | The MARA algorithm was described in detail in 2009{{#pmid:19377474|FANTOM_2009}}, in the paper's <span class="PDFlink">[http://www.nature.com/ng/journal/v41/n5/extref/ng.375-S1.pdf Supplementary Information]</span>. Fundamentally, known TFA binding-sites are sought in the promoter regions of differentially expressed genes. The construction of the Weight Matrices to identify the binding-sites is an involved procedure to begin with. But the core of the procedure is to identify "motif activities" - i.e. the contribution of a single TF/motif interaction to the observed expression change in a sample. The procedure is complex and not described to be reproduced. The end result is a z-value which could be interpreted as a probability that the expression change is actually due to a particular TF. | ||
+ | |||
+ | :The core network was constructed by first selecting all predicted regulatory interactions (z-value at least 1.5) between core motifs and promoters that are associated with a gene which is a TF that in turn is associated with a core motif. This set of predicted regulatory interactions was then filtered by choosing only interactions that have independent experimental support of at least one of the following types. 1) The regulatory interaction has been reported in the literature 2) There is a ChIP-chip experiment in which binding of one of the TFs associated with the motif to the promoter of the target gene has been reported. 3) In our siRNA experiments the target promoter is observed to be perturbed in expression (B-statistic larger than zero) after knockdown of a TF associated with the motif. | ||
+ | |||
+ | The bottom line for this is: it seems implausible to reproduce this procedure within the limited scope of our project. We could make use of the [https://ismara.unibas.ch/fcgi/mara ISMARA server]{{#pmid:24515121|balwierz_2014}}. However this requires upload of whole expression profiles to the SIB servers and extensive postprocessing of results. By all means - we should pursue this, not the least to be able to compare individual results, but it should '''not''' be on the critical path of our project. | ||
+ | |||
+ | Moreover - and I consider this a big downside to the procedure - the MARA network is separately constructed for each DESeq result set, it can't be precomputed. | ||
+ | |||
+ | '''Finally, you should note that there is a certain tautology in using expression data to predict a network, and then using that network to explain the expression data. These cannot be considered informationally orthogonal.''' | ||
+ | |||
+ | ===Alternatives=== | ||
+ | But is it necessary to use MARA? We note that the careful quantitative analysis of motif activities is not actually used, other than to define network edges. These edges are not considered "weighted" edges in the GRN (Gene Regulatory Network) graph. Why not work with static graphs based on ENCODE data or similar instead, and rely on the differential expression of the neighbourhood to provide the correct ranking? Is MARA really better? | ||
+ | |||
+ | ;Here is where you come in. We will analyze and evaluate the procedures that are currently available to build TF target lists or GRNs. | ||
+ | Here is a short, recent overview of the methods: | ||
+ | {{#pmid: 25937810}} | ||
+ | |||
+ | |||
+ | And here are recent papers in the field. | ||
+ | |||
+ | {{#pmid: 24137002}} | ||
+ | (Note: Bioconductor package available.) | ||
+ | |||
+ | {{#pmid: 24511376}} | ||
+ | (Note: [http://wiki.c2b2.columbia.edu/califanolab/index.php/Software/ARACNE source code available].) | ||
+ | |||
+ | {{#pmid: 25791631}} | ||
+ | |||
+ | {{#pmid: 25904632}} | ||
+ | |||
+ | {{#pmid: 25979476}} | ||
+ | |||
+ | {{#pmid: 26056275}} | ||
+ | |||
+ | {{#pmid: 26066708}} | ||
+ | |||
+ | {{#pmid: 26164700}} | ||
+ | (Note: MARA authors) | ||
+ | |||
+ | {{#pmid: 26235087}} | ||
+ | |||
+ | {{#pmid: 26393364}} | ||
+ | |||
+ | {{#pmid: 26424082}} | ||
+ | |||
+ | {{#pmid: 26586801}} | ||
+ | |||
+ | {{#pmid: 26823190}} | ||
+ | |||
+ | {{#pmid: 26862054}} | ||
+ | |||
+ | {{#pmid: 26864687}} | ||
+ | (Note: [https://github.com/omranian/inference-of-GRN-using-Fused-LASSO R code available].) | ||
+ | |||
+ | {{#pmid: 26888907}} | ||
+ | |||
− | |||
− | |||
− | |||
<!-- {{#lst:Cytoscape|exercises_I}} --> | <!-- {{#lst:Cytoscape|exercises_I}} --> | ||
+ | |||
+ | ==Analyzing GRN construction== | ||
+ | |||
+ | {{task|1= | ||
+ | |||
+ | * Choose one of the papers cited here that provides an exact computational procedure how to build a TF target list or a GRN from public data<ref>You ''may'' choose a different paper if you e-mail me the reference AND I approve.</ref>. | ||
+ | * Email me on Monday which paper you have chosen. | ||
+ | * Analyze the approach with a SPN diagram and enough annotation that you could design the algorithm. | ||
+ | * Bring your diagram and annotation to class on Tuesday. Refer to [[Eval_Sessions#Assigned_material|the marking rubrics for Assigned Material]] for how to make this an excellent piece work. Also: "late rules" like last time: same day but not in class: marks * 0.5, next day: marks * 0.2, the day after: marks * 0.1 - then 0. The diagrams will be marked by me for a maximum of six marks. No quiz. | ||
+ | |||
+ | In class, I would like to compare and contrast approaches. Can yours replace MARA for our purposes? Let's discuss... | ||
+ | |||
+ | }} | ||
+ | |||
+ | <!-- {{#lst:Interactome|reading}} --> | ||
− | + | {{Vspace}} | |
+ | {{#lst:CSB_Assignment_Week_1|assignment_footer}} | ||
− | |||
+ | <table style="width:100%;"><tr> | ||
+ | <td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[CSB_Assignment_Week_5|< Assignment 5]]</td> | ||
+ | <td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[CSB_Assignment_Week_7|Assignment 7 >]]</td> | ||
+ | </tr></table> | ||
[[Category:Computational_Systems_Biology]] | [[Category:Computational_Systems_Biology]] | ||
</div> | </div> |
Latest revision as of 15:07, 7 March 2016
Assignments for Week 6
Gene Regulatory Networks revisited
< Assignment 5 | Assignment 7 > |
Note! This assignment is currently active. All significant changes will be announced on the mailing list.
Contents
Context
One of the interesting parts of the Mogrify workflow is the use of a network weighting method, based on STRING and GRN networks - the network-based sphere of influence. The idea behind this is that effects of genes propagate a certain distance through networks. Such network-based analytics are systems biology methods par excellence. In our workflow, transcription factors are ranked, based upon how many differentially expressed genes they are associated with.
From your iGraph tutorial, you will recall that networks can be constructed from adjacency matrices, or from edge lists. Whatever the source is: if we want to build a network, we need to define the nodes, and we need to define when to posit edges between the nodes. This seems quite straightforward for STRING - we can download the whole database as an edge-list. TFs are nodes, the neighbourhood of one node is quickly determined from the edges provided by STRING, and we can easily evaluate the DESeq results for each neighbour. But how is the MARA network constructed?
MARA
You will recall that we had a long discussion last Tuesday about MARA. Rackham et al. state: "MARA provides protein-DNA interactions for transcription factors with known binding sites in the promoter regions of a gene." (online Methods, Step 3.)[1] Nodes are presumably genes. But what exactly are the edges? The Rackham paper does not say. The initial iteration of the Ontoscope workflow assumed that a MARA edge list was available for download. But matters are more complicated.
The MARA algorithm was described in detail in 2009[2], in the paper's Supplementary Information. Fundamentally, known TFA binding-sites are sought in the promoter regions of differentially expressed genes. The construction of the Weight Matrices to identify the binding-sites is an involved procedure to begin with. But the core of the procedure is to identify "motif activities" - i.e. the contribution of a single TF/motif interaction to the observed expression change in a sample. The procedure is complex and not described to be reproduced. The end result is a z-value which could be interpreted as a probability that the expression change is actually due to a particular TF.
- The core network was constructed by first selecting all predicted regulatory interactions (z-value at least 1.5) between core motifs and promoters that are associated with a gene which is a TF that in turn is associated with a core motif. This set of predicted regulatory interactions was then filtered by choosing only interactions that have independent experimental support of at least one of the following types. 1) The regulatory interaction has been reported in the literature 2) There is a ChIP-chip experiment in which binding of one of the TFs associated with the motif to the promoter of the target gene has been reported. 3) In our siRNA experiments the target promoter is observed to be perturbed in expression (B-statistic larger than zero) after knockdown of a TF associated with the motif.
The bottom line for this is: it seems implausible to reproduce this procedure within the limited scope of our project. We could make use of the ISMARA server[3]. However this requires upload of whole expression profiles to the SIB servers and extensive postprocessing of results. By all means - we should pursue this, not the least to be able to compare individual results, but it should not be on the critical path of our project.
Moreover - and I consider this a big downside to the procedure - the MARA network is separately constructed for each DESeq result set, it can't be precomputed.
Finally, you should note that there is a certain tautology in using expression data to predict a network, and then using that network to explain the expression data. These cannot be considered informationally orthogonal.
Alternatives
But is it necessary to use MARA? We note that the careful quantitative analysis of motif activities is not actually used, other than to define network edges. These edges are not considered "weighted" edges in the GRN (Gene Regulatory Network) graph. Why not work with static graphs based on ENCODE data or similar instead, and rely on the differential expression of the neighbourhood to provide the correct ranking? Is MARA really better?
- Here is where you come in. We will analyze and evaluate the procedures that are currently available to build TF target lists or GRNs.
Here is a short, recent overview of the methods:
Liu (2015) Reverse Engineering of Genome-wide Gene Regulatory Networks from Gene Expression Data. Curr Genomics 16:3-22. (pmid: 25937810) |
[ PubMed ] [ DOI ] Transcriptional regulation plays vital roles in many fundamental biological processes. Reverse engineering of genome-wide regulatory networks from high-throughput transcriptomic data provides a promising way to characterize the global scenario of regulatory relationships between regulators and their targets. In this review, we summarize and categorize the main frameworks and methods currently available for inferring transcriptional regulatory networks from microarray gene expression profiling data. We overview each of strategies and introduce representative methods respectively. Their assumptions, advantages, shortcomings, and possible improvements and extensions are also clarified and commented. |
And here are recent papers in the field.
Diez et al. (2014) Systematic identification of transcriptional regulatory modules from protein-protein interaction networks. Nucleic Acids Res 42:e6. (pmid: 24137002) |
[ PubMed ] [ DOI ] Transcription factors (TFs) combine with co-factors to form transcriptional regulatory modules (TRMs) that regulate gene expression programs with spatiotemporal specificity. Here we present a novel and generic method (rTRM) for the reconstruction of TRMs that integrates genomic information from TF binding, cell type-specific gene expression and protein-protein interactions. rTRM was applied to reconstruct the TRMs specific for embryonic stem cells (ESC) and hematopoietic stem cells (HSC), neural progenitor cells, trophoblast stem cells and distinct types of terminally differentiated CD4(+) T cells. The ESC and HSC TRM predictions were highly precise, yielding 77 and 96 proteins, of which ∼75% have been independently shown to be involved in the regulation of these cell types. Furthermore, rTRM successfully identified a large number of bridging proteins with known roles in ESCs and HSCs, which could not have been identified using genomic approaches alone, as they lack the ability to bind specific DNA sequences. This highlights the advantage of rTRM over other methods that ignore PPI information, as proteins need to interact with other proteins to form complexes and perform specific functions. The prediction and experimental validation of the co-factors that endow master regulatory TFs with the capacity to select specific genomic sites, modulate the local epigenetic profile and integrate multiple signals will provide important mechanistic insights not only into how such TFs operate, but also into abnormal transcriptional states leading to disease. |
(Note: Bioconductor package available.)
Jang et al. (2013) hARACNe: improving the accuracy of regulatory model reverse engineering via higher-order data processing inequality tests. Interface Focus 3:20130011. (pmid: 24511376) |
[ PubMed ] [ DOI ] A key goal of systems biology is to elucidate molecular mechanisms associated with physiologic and pathologic phenotypes based on the systematic and genome-wide understanding of cell context-specific molecular interaction models. To this end, reverse engineering approaches have been used to systematically dissect regulatory interactions in a specific tissue, based on the availability of large molecular profile datasets, thus improving our mechanistic understanding of complex diseases, such as cancer. In this paper, we introduce high-order Algorithm for the Reconstruction of Accurate Cellular Network (hARACNe), an extension of the ARACNe algorithm for the dissection of transcriptional regulatory networks. ARACNe uses the data processing inequality (DPI), from information theory, to detect and prune indirect interactions that are unlikely to be mediated by an actual physical interaction. Whereas ARACNe considers only first-order indirect interactions, i.e. those mediated by only one extra regulator, hARACNe considers a generalized form of indirect interactions via two, three or more other regulators. We show that use of higher-order DPI resulted in significantly improved performance, based on transcription factor (TF)-specific ChIP-chip data, as well as on gene expression profile following RNAi-mediated TF silencing. |
(Note: source code available.)
Blatti et al. (2015) Integrating motif, DNA accessibility and gene expression data to build regulatory maps in an organism. Nucleic Acids Res 43:3998-4012. (pmid: 25791631) |
[ PubMed ] [ DOI ] Characterization of cell type specific regulatory networks and elements is a major challenge in genomics, and emerging strategies frequently employ high-throughput genome-wide assays of transcription factor (TF) to DNA binding, histone modifications or chromatin state. However, these experiments remain too difficult/expensive for many laboratories to apply comprehensively to their system of interest. Here, we explore the potential of elucidating regulatory systems in varied cell types using computational techniques that rely on only data of gene expression, low-resolution chromatin accessibility, and TF-DNA binding specificities ('motifs'). We show that static computational motif scans overlaid with chromatin accessibility data reasonably approximate experimentally measured TF-DNA binding. We demonstrate that predicted binding profiles and expression patterns of hundreds of TFs are sufficient to identify major regulators of ∼200 spatiotemporal expression domains in the Drosophila embryo. We are then able to learn reliable statistical models of enhancer activity for over 70 expression domains and apply those models to annotate domain specific enhancers genome-wide. Throughout this work, we apply our motif and accessibility based approach to comprehensively characterize the regulatory network of fruitfly embryonic development and show that the accuracy of our computational method compares favorably to approaches that rely on data from many experimental assays. |
Medina-Rivera et al. (2015) RSAT 2015: Regulatory Sequence Analysis Tools. Nucleic Acids Res 43:W50-6. (pmid: 25904632) |
[ PubMed ] [ DOI ] RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. |
Nicolle et al. (2015) CoRegNet: reconstruction and integrated analysis of co-regulatory networks. Bioinformatics 31:3066-8. (pmid: 25979476) |
[ PubMed ] [ DOI ] UNLABELLED: CoRegNet is an R/Bioconductor package to analyze large-scale transcriptomic data by highlighting sets of co-regulators. Based on a transcriptomic dataset, CoRegNet can be used to: reconstruct a large-scale co-regulatory network, integrate regulation evidences such as transcription factor binding sites and ChIP data, estimate sample-specific regulator activity, identify cooperative transcription factors and analyze the sample-specific combinations of active regulators through an interactive visualization tool. In this study CoRegNet was used to identify driver regulators of bladder cancer. AVAILABILITY: CoRegNet is available at http://bioconductor.org/packages/CoRegNet CONTACT: remy.nicolle@issb.genopole.fr or mohamed.elati@issb.genopole.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
Jiang et al. (2015) Inference of transcriptional regulation in cancers. Proc Natl Acad Sci U.S.A 112:7731-6. (pmid: 26056275) |
[ PubMed ] [ DOI ] Despite the rapid accumulation of tumor-profiling data and transcription factor (TF) ChIP-seq profiles, efforts integrating TF binding with the tumor-profiling data to understand how TFs regulate tumor gene expression are still limited. To systematically search for cancer-associated TFs, we comprehensively integrated 686 ENCODE ChIP-seq profiles representing 150 TFs with 7484 TCGA tumor data in 18 cancer types. For efficient and accurate inference on gene regulatory rules across a large number and variety of datasets, we developed an algorithm, RABIT (regression analysis with background integration). In each tumor sample, RABIT tests whether the TF target genes from ChIP-seq show strong differential regulation after controlling for background effect from copy number alteration and DNA methylation. When multiple ChIP-seq profiles are available for a TF, RABIT prioritizes the most relevant ChIP-seq profile in each tumor. In each cancer type, RABIT further tests whether the TF expression and somatic mutation variations are correlated with differential expression patterns of its target genes across tumors. Our predicted TF impact on tumor gene expression is highly consistent with the knowledge from cancer-related gene databases and reveals many previously unidentified aspects of transcriptional regulation in tumor progression. We also applied RABIT on RNA-binding protein motifs and found that some alternative splicing factors could affect tumor-specific gene expression by binding to target gene 3'UTR regions. Thus, RABIT (rabit.dfci.harvard.edu) is a general platform for predicting the oncogenic role of gene expression regulators. |
Han et al. (2015) TRRUST: a reference database of human transcriptional regulatory interactions. Sci Rep 5:11432. (pmid: 26066708) |
[ PubMed ] [ DOI ] The reconstruction of transcriptional regulatory networks (TRNs) is a long-standing challenge in human genetics. Numerous computational methods have been developed to infer regulatory interactions between human transcriptional factors (TFs) and target genes from high-throughput data, and their performance evaluation requires gold-standard interactions. Here we present a database of literature-curated human TF-target interactions, TRRUST (transcriptional regulatory relationships unravelled by sentence-based text-mining, http://www.grnpedia.org/trrust), which currently contains 8,015 interactions between 748 TF genes and 1,975 non-TF genes. A sentence-based text-mining approach was employed for efficient manual curation of regulatory interactions from approximately 20 million Medline abstracts. To the best of our knowledge, TRRUST is the largest publicly available database of literature-curated human TF-target interactions to date. TRRUST also has several useful features: i) information about the mode-of-regulation; ii) tests for target modularity of a query TF; iii) tests for TF cooperativity of a query target; iv) inferences about cooperating TFs of a query TF; and v) prioritizing associated pathways and diseases with a query TF. We observed high enrichment of TF-target pairs in TRRUST for top-scored interactions inferred from high-throughput data, which suggests that TRRUST provides a reliable benchmark for the computational reconstruction of human TRNs. |
Pemberton-Ross et al. (2015) ARMADA: Using motif activity dynamics to infer gene regulatory networks from gene expression data. Methods 85:62-74. (pmid: 26164700) |
[ PubMed ] [ DOI ] Analysis of gene expression data remains one of the most promising avenues toward reconstructing genome-wide gene regulatory networks. However, the large dimensionality of the problem prohibits the fitting of explicit dynamical models of gene regulatory networks, whereas machine learning methods for dimensionality reduction such as clustering or principal component analysis typically fail to provide mechanistic interpretations of the reduced descriptions. To address this, we recently developed a general methodology called motif activity response analysis (MARA) that, by modeling gene expression patterns in terms of the activities of concrete regulators, accomplishes dramatic dimensionality reduction while retaining mechanistic biological interpretations of its predictions (Balwierz, 2014). Here we extend MARA by presenting ARMADA, which models the activity dynamics of regulators across a time course, and infers the causal interactions between the regulators that drive the dynamics of their activities across time. We have implemented ARMADA as part of our ISMARA webserver, ismara.unibas.ch, allowing any researcher to automatically apply it to any gene expression time course. To illustrate the method, we apply ARMADA to a time course of human umbilical vein endothelial cells treated with TNF. Remarkably, ARMADA is able to reproduce the complex observed motif activity dynamics using a relatively small set of interactions between the key regulators in this system. In addition, we show that ARMADA successfully infers many of the key regulatory interactions known to drive this inflammatory response and discuss several novel interactions that ARMADA predicts. In combination with ISMARA, ARMADA provides a powerful approach to generating plausible hypotheses for the key interactions between regulators that control gene expression in any system for which time course measurements are available. |
(Note: MARA authors)
Gitter & Bar-Joseph (2016) The SDREM Method for Reconstructing Signaling and Regulatory Response Networks: Applications for Studying Disease Progression. Methods Mol Biol 1303:493-506. (pmid: 26235087) |
[ PubMed ] [ DOI ] The Signaling and Dynamic Regulatory Events Miner (SDREM) is a powerful computational approach for identifying which signaling pathways and transcription factors control the temporal cellular response to a stimulus. SDREM builds end-to-end response models by combining condition-independent protein-protein interactions and transcription factor binding data with two types of condition-specific data: source proteins that detect the stimulus and changes in gene expression over time. Here we describe how to apply SDREM to study human diseases, using epidermal growth factor (EGF) response impacting neurogenesis and Alzheimer's disease as an example. |
Narang et al. (2015) Automated Identification of Core Regulatory Genes in Human Gene Regulatory Networks. PLoS Comput Biol 11:e1004504. (pmid: 26393364) |
[ PubMed ] [ DOI ] Human gene regulatory networks (GRN) can be difficult to interpret due to a tangle of edges interconnecting thousands of genes. We constructed a general human GRN from extensive transcription factor and microRNA target data obtained from public databases. In a subnetwork of this GRN that is active during estrogen stimulation of MCF-7 breast cancer cells, we benchmarked automated algorithms for identifying core regulatory genes (transcription factors and microRNAs). Among these algorithms, we identified K-core decomposition, pagerank and betweenness centrality algorithms as the most effective for discovering core regulatory genes in the network evaluated based on previously known roles of these genes in MCF-7 biology as well as in their ability to explain the up or down expression status of up to 70% of the remaining genes. Finally, we validated the use of K-core algorithm for organizing the GRN in an easier to interpret layered hierarchy where more influential regulatory genes percolate towards the inner layers. The integrated human gene and miRNA network and software used in this study are provided as supplementary materials (S1 Data) accompanying this manuscript. |
Liu et al. (2015) RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database (Oxford) 2015:. (pmid: 26424082) |
[ PubMed ] [ DOI ] Transcriptional and post-transcriptional regulation of gene expression is of fundamental importance to numerous biological processes. Nowadays, an increasing amount of gene regulatory relationships have been documented in various databases and literature. However, to more efficiently exploit such knowledge for biomedical research and applications, it is necessary to construct a genome-wide regulatory network database to integrate the information on gene regulatory relationships that are widely scattered in many different places. Therefore, in this work, we build a knowledge-based database, named 'RegNetwork', of gene regulatory networks for human and mouse by collecting and integrating the documented regulatory interactions among transcription factors (TFs), microRNAs (miRNAs) and target genes from 25 selected databases. Moreover, we also inferred and incorporated potential regulatory relationships based on transcription factor binding site (TFBS) motifs into RegNetwork. As a result, RegNetwork contains a comprehensive set of experimentally observed or predicted transcriptional and post-transcriptional regulatory relationships, and the database framework is flexibly designed for potential extensions to include gene regulatory networks for other organisms in the future. Based on RegNetwork, we characterized the statistical and topological properties of genome-wide regulatory networks for human and mouse, we also extracted and interpreted simple yet important network motifs that involve the interplays between TF-miRNA and their targets. In summary, RegNetwork provides an integrated resource on the prior information for gene regulatory relationships, and it enables us to further investigate context-specific transcriptional and post-transcriptional regulatory interactions based on domain-specific experimental data. Database URL: http://www.regnetworkweb.org. |
Kulakovskiy et al. (2016) HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res 44:D116-25. (pmid: 26586801) |
[ PubMed ] [ DOI ] Models of transcription factor (TF) binding sites provide a basis for a wide spectrum of studies in regulatory genomics, from reconstruction of regulatory networks to functional annotation of transcripts and sequence variants. While TFs may recognize different sequence patterns in different conditions, it is pragmatic to have a single generic model for each particular TF as a baseline for practical applications. Here we present the expanded and enhanced version of HOCOMOCO (http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco10), the collection of models of DNA patterns, recognized by transcription factors. HOCOMOCO now provides position weight matrix (PWM) models for binding sites of 601 human TFs and, in addition, PWMs for 396 mouse TFs. Furthermore, we introduce the largest up to date collection of dinucleotide PWM models for 86 (52) human (mouse) TFs. The update is based on the analysis of massive ChIP-Seq and HT-SELEX datasets, with the validation of the resulting models on in vivo data. To facilitate a practical application, all HOCOMOCO models are linked to gene and protein databases (Entrez Gene, HGNC, UniProt) and accompanied by precomputed score thresholds. Finally, we provide command-line tools for PWM and diPWM threshold estimation and motif finding in nucleotide sequences. |
Affeldt et al. (2016) 3off2: A network reconstruction algorithm based on 2-point and 3-point information statistics. BMC Bioinformatics 17 Suppl 2:12. (pmid: 26823190) |
[ PubMed ] [ DOI ] BACKGROUND: The reconstruction of reliable graphical models from observational data is important in bioinformatics and other computational fields applying network reconstruction methods to large, yet finite datasets. The main network reconstruction approaches are either based on Bayesian scores, which enable the ranking of alternative Bayesian networks, or rely on the identification of structural independencies, which correspond to missing edges in the underlying network. Bayesian inference methods typically require heuristic search strategies, such as hill-climbing algorithms, to sample the super-exponential space of possible networks. By contrast, constraint-based methods, such as the PC and IC algorithms, are expected to run in polynomial time on sparse underlying graphs, provided that a correct list of conditional independencies is available. Yet, in practice, conditional independencies need to be ascertained from the available observational data, based on adjustable statistical significance levels, and are not robust to sampling noise from finite datasets. RESULTS: We propose a more robust approach to reconstruct graphical models from finite datasets. It combines constraint-based and Bayesian approaches to infer structural independencies based on the ranking of their most likely contributing nodes. In a nutshell, this local optimization scheme and corresponding 3off2 algorithm iteratively "take off" the most likely conditional 3-point information from the 2-point (mutual) information between each pair of nodes. Conditional independencies are thus derived by progressively collecting the most significant indirect contributions to all pairwise mutual information. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The approach is shown to outperform both constraint-based and Bayesian inference methods on a range of benchmark networks. The 3off2 approach is then applied to the reconstruction of the hematopoiesis regulation network based on recent single cell expression data and is found to retrieve more experimentally ascertained regulations between transcription factors than with other available methods. CONCLUSIONS: The novel information-theoretic approach and corresponding 3off2 algorithm combine constraint-based and Bayesian inference methods to reliably reconstruct graphical models, despite inherent sampling noise in finite datasets. In particular, experimentally verified interactions as well as novel predicted regulations are established on the hematopoiesis regulatory networks based on single cell expression data. |
Ruyssinck et al. (2016) Netter: re-ranking gene network inference predictions using structural network properties. BMC Bioinformatics 17:76. (pmid: 26862054) |
[ PubMed ] [ DOI ] BACKGROUND: Many algorithms have been developed to infer the topology of gene regulatory networks from gene expression data. These methods typically produce a ranking of links between genes with associated confidence scores, after which a certain threshold is chosen to produce the inferred topology. However, the structural properties of the predicted network do not resemble those typical for a gene regulatory network, as most algorithms only take into account connections found in the data and do not include known graph properties in their inference process. This lowers the prediction accuracy of these methods, limiting their usability in practice. RESULTS: We propose a post-processing algorithm which is applicable to any confidence ranking of regulatory interactions obtained from a network inference method which can use, inter alia, graphlets and several graph-invariant properties to re-rank the links into a more accurate prediction. To demonstrate the potential of our approach, we re-rank predictions of six different state-of-the-art algorithms using three simple network properties as optimization criteria and show that Netter can improve the predictions made on both artificially generated data as well as the DREAM4 and DREAM5 benchmarks. Additionally, the DREAM5 E.coli. community prediction inferred from real expression data is further improved. Furthermore, Netter compares favorably to other post-processing algorithms and is not restricted to correlation-like predictions. Lastly, we demonstrate that the performance increase is robust for a wide range of parameter settings. Netter is available at http://bioinformatics.intec.ugent.be. CONCLUSIONS: Network inference from high-throughput data is a long-standing challenge. In this work, we present Netter, which can further refine network predictions based on a set of user-defined graph properties. Netter is a flexible system which can be applied in unison with any method producing a ranking from omics data. It can be tailored to specific prior knowledge by expert users but can also be applied in general uses cases. Concluding, we believe that Netter is an interesting second step in the network inference process to further increase the quality of prediction. |
Omranian et al. (2016) Gene regulatory network inference using fused LASSO on multiple data sets. Sci Rep 6:20533. (pmid: 26864687) |
[ PubMed ] [ DOI ] Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions. |
(Note: R code available.)
Zerbino et al. (2016) Ensembl regulation resources. Database (Oxford) 2016:. (pmid: 26888907) |
[ PubMed ] [ DOI ] New experimental techniques in epigenomics allow researchers to assay a diversity of highly dynamic features such as histone marks, DNA modifications or chromatin structure. The study of their fluctuations should provide insights into gene expression regulation, cell differentiation and disease. The Ensembl project collects and maintains the Ensembl regulation data resources on epigenetic marks, transcription factor binding and DNA methylation for human and mouse, as well as microarray probe mappings and annotations for a variety of chordate genomes. From this data, we produce a functional annotation of the regulatory elements along the human and mouse genomes with plans to expand to other species as data becomes available. Starting from well-studied cell lines, we will progressively expand our library of measurements to a greater variety of samples. Ensembl's regulation resources provide a central and easy-to-query repository for reference epigenomes. As with all Ensembl data, it is freely available at http://www.ensembl.org, from the Perl and REST APIs and from the public Ensembl MySQL database server at ensembldb.ensembl.org. Database URL: http://www.ensembl.org. |
Analyzing GRN construction
Task:
- Choose one of the papers cited here that provides an exact computational procedure how to build a TF target list or a GRN from public data[4].
- Email me on Monday which paper you have chosen.
- Analyze the approach with a SPN diagram and enough annotation that you could design the algorithm.
- Bring your diagram and annotation to class on Tuesday. Refer to the marking rubrics for Assigned Material for how to make this an excellent piece work. Also: "late rules" like last time: same day but not in class: marks * 0.5, next day: marks * 0.2, the day after: marks * 0.1 - then 0. The diagrams will be marked by me for a maximum of six marks. No quiz.
In class, I would like to compare and contrast approaches. Can yours replace MARA for our purposes? Let's discuss...
- That is all.
Footnotes and references
- ↑
Rackham et al. (2016) A predictive computational framework for direct reprogramming between human cell types. Nat Genet 48:331-5. (pmid: 26780608) [ PubMed ] [ DOI ] Transdifferentiation, the process of converting from one cell type to another without going through a pluripotent state, has great promise for regenerative medicine. The identification of key transcription factors for reprogramming is currently limited by the cost of exhaustive experimental testing of plausible sets of factors, an approach that is inefficient and unscalable. Here we present a predictive system (Mogrify) that combines gene expression data with regulatory network information to predict the reprogramming factors necessary to induce cell conversion. We have applied Mogrify to 173 human cell types and 134 tissues, defining an atlas of cellular reprogramming. Mogrify correctly predicts the transcription factors used in known transdifferentiations. Furthermore, we validated two new transdifferentiations predicted by Mogrify. We provide a practical and efficient mechanism for systematically implementing novel cell conversions, facilitating the generalization of reprogramming of human cells. Predictions are made available to help rapidly further the field of cell conversion.
- ↑
Suzuki et al. (2009) The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet 41:553-62. (pmid: 19377474) [ PubMed ] [ DOI ] Using deep sequencing (deepCAGE), the FANTOM4 study measured the genome-wide dynamics of transcription-start-site usage in the human monocytic cell line THP-1 throughout a time course of growth arrest and differentiation. Modeling the expression dynamics in terms of predicted cis-regulatory sites, we identified the key transcription regulators, their time-dependent activities and target genes. Systematic siRNA knockdown of 52 transcription factors confirmed the roles of individual factors in the regulatory network. Our results indicate that cellular states are constrained by complex networks involving both positive and negative regulatory interactions among substantial numbers of transcription factors and that no single transcription factor is both necessary and sufficient to drive the differentiation process.
- ↑
Balwierz et al. (2014) ISMARA: automated modeling of genomic signals as a democracy of regulatory motifs. Genome Res 24:869-84. (pmid: 24515121) [ PubMed ] [ DOI ] Accurate reconstruction of the regulatory networks that control gene expression is one of the key current challenges in molecular biology. Although gene expression and chromatin state dynamics are ultimately encoded by constellations of binding sites recognized by regulators such as transcriptions factors (TFs) and microRNAs (miRNAs), our understanding of this regulatory code and its context-dependent read-out remains very limited. Given that there are thousands of potential regulators in mammals, it is not practical to use direct experimentation to identify which of these play a key role for a particular system of interest. We developed a methodology that models gene expression or chromatin modifications in terms of genome-wide predictions of regulatory sites and completely automated it into a web-based tool called ISMARA (Integrated System for Motif Activity Response Analysis). Given only gene expression or chromatin state data across a set of samples as input, ISMARA identifies the key TFs and miRNAs driving expression/chromatin changes and makes detailed predictions regarding their regulatory roles. These include predicted activities of the regulators across the samples, their genome-wide targets, enriched gene categories among the targets, and direct interactions between the regulators. Applying ISMARA to data sets from well-studied systems, we show that it consistently identifies known key regulators ab initio. We also present a number of novel predictions including regulatory interactions in innate immunity, a master regulator of mucociliary differentiation, TFs consistently disregulated in cancer, and TFs that mediate specific chromatin modifications.
- ↑ You may choose a different paper if you e-mail me the reference AND I approve.
- Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible. the following two links:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example
- ... are required reading.
< Assignment 5 | Assignment 7 > |