Computational Systems Biology Main Page

From "A B C"
Revision as of 14:56, 24 March 2015 by Boris (talk | contribs)
Jump to navigation Jump to search

Computational Systems Biology

Course Wiki for BCB420 (Computational Systems Biology) and JTB2020 (Applied Bioinformatics).

This is our main tool to coordinate information, activities and projects in University of Toronto's computational systems biology course BCB420. If you are not one of our students, you can still browse this site, however only users with a login account can edit or contribute or edit material. If you are here because you are interested in general aspects of bioinformatics or computational biology, you may want to review the Wikipedia article on bioinformatics, or visit Wikiomics. Contact boris.steipe(at)utoronto.ca with any questions you may have.



BCB420 / JTB2020

These are the course pages for BCB420H (Computational Systems Biology). Welcome, you'll feel right at home here.


These are also the course pages for JTB2020H (Applied Bioinformatics). How come? Why is JTB2020 not the graduate equivalent of BCB410 (Applied Bioinformatics)? Let me explain. When this course was conceived as a required part of the (then so called) Collaborative PhD Program in Proteomics and Bioinformatics in 2003, there was an urgent need to bring graduate students to a minimal level of computer skills and programming; prior experience was virtually nonexistent. Fortunately, the field has changed and the Program has changed, and now our graduate students are usually quite competent at least in some practical aspects of computational biology. Not uniformly however, and the wide disparity of previous experience has made it increasingly difficult to provide course offerings that address students' needs. JTB2020 therefore shares its lecture components with BCB420 course, and there is a large range of topics in Applied Bioinformatics that are covered by students in self-study and discussion with the lecturer, customized to their actual needs.


The 2015 course...

This year's course will be very different from previous year's courses. In previous years we have worked with a structured, lecture-style format. This year we will be pursuing a wholly problem oriented format. This is the plan:

  • We'll identify an interesting challenge in computational systems biology
  • We'll formulate an approach to this challenge as a project
  • We'll define the resources we need - data sources, algorithms, programming- and collaboration support
  • We'll define students' roles in the project according to their skills and experience
  • Then we will implement the project.



Organization

Dates
BCB420/JTB2020 is a Winter Term course.
Lectures: Wednesdays, 14:00 to 16:00. (Classes start at 10 minutes past the hour.)
Exam: None for this course.


Location
MS 4279 (Medical Sciences Building).


Departmental information
For BCB420 see the BCB420 Biochemistry Department Course Web page.
For JTB2020 see the JTB2020 Course Web page for general information.


Submissions
This is an electronic submission only course; but if you must print material, you might consider printing double-sided. Learn how, at the Print-Double-Sided Student Initiative.


Recommended textbooks

Depending on your background, various levels of textbooks may be suitable. I will bring my evaluation copies to class so you can decide what may work for you.
Understanding Bioinformatics (Zvelebil & Baum) is a decent general introduction to many aspects of bioinformatics. It was published in 2007, an updated version is urgently needed. Still, some of the basics (like the algorithm for optimal sequence alignment) don't change. (Amazon) (Indigo) (ABE books)
Practical Bioinformatics (Agostino) covers some of the material of the BCH441 exercises. Expect a no-nonsense introduction to the very most basic stuff. I have my pet peeves about this book (as I have for many others, eg. why in the world do they still teach CLUSTAL when all available studies demonstrate it to be the least accurate MSA algorithm by a margin???), but if you haven't taken BCH441, this may serve you well. And if you did take BCH441, it may consolidate some ideas that I wasn't clear about. (Amazon) (Indigo) (ABE books)
If you are aware of recent good textbooks, or have your own opinions about these or other books, let me know.





Grading and Activities

Activity Weight
BCB410 - (Undergraduates)
Weight
JTB2020 - (Graduates)
3 In-class introductory quizzes 18 marks (3 x 6) 12 marks (3 x 4)
9 Course objective evaluations 27 marks (9 x 3) 18 marks (9 x 2)
Class project contributions 40 marks 40 marks
"Classroom" participation 15 marks 15 marks
Project Manuscript Draft   15 marks
Total 100 marks 100 marks


A note on marking

I do not adjust marks towards a target mean and variance (i.e. there will be no "belling" of grades). I feel strongly that such "normalization" detracts from a collaborative and mutually supportive learning environment. If your classmate gets a great mark because you helped him with a difficult concept, this should never have the effect that it brings down your mark through class average adjustments. Collaborate as much as possible, it is a great way to learn.


Prerequisites

You must have taken an introductory bioinformatics course as a prerequisite, or otherwise acquired the necessary knowledge. Therefore I expect familiarity with the material of my BCH441 course. If you have not taken BCH441, please update your knowledge and skills before the course starts. I will not make accommodations for lack of prerequisites. Please check the syllabus for this course below to find whether you need to catch up on additional material, and peruse this site to find the information you may need. A (non-exhaustive) overview of topics and useful links is linked here.


Course Objectives

Building Software

Understand principles of software design and implementation in a collaborative environment.

This objective is implicit in students' project participation.

Gene Lists

Understand sources of and types of gene lists, gene IDs.

Gene IDs and gene lists are in many respects the raw material from which we construct bioinformatics. Here are two articles to set the stage:


BioDBnet is a data-warehouse at the US National Cancer Institute.

Mudunuri et al. (2009) bioDBnet: the biological database network. Bioinformatics 25:555-6. (pmid: 19129209)

PubMed ] [ DOI ] SUMMARY: bioDBnet is an online web resource that provides interconnected access to many types of biological databases. It has integrated many of the most commonly used biological databases and in its current state has 153 database identifiers (nodes) covering all aspects of biology including genes, proteins, pathways and other biological concepts. bioDBnet offers various ways to work with these databases including conversions, extensive database reports, custom navigation and has various tools to enhance the quality of the results. Importantly, the access to bioDBnet is updated regularly, providing access to the most recent releases of each individual database. AVAILABILITY: http://biodbnet.abcc.ncifcrf.gov.


The Molecular Signatures Database (MSigDB) collects examples of gene sets, lists of gene identifiers with a shared property. The paper discusses V3.0, the database has grown since then.

Liberzon et al. (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics 27:1739-40. (pmid: 21546393)

PubMed ] [ DOI ] MOTIVATION: Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets. RESULTS: We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site. AVAILABILITY AND IMPLEMENTATION: MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.

The currently available gene sets are described here : http://www.broadinstitute.org/gsea/msigdb/collections.jsp


 

Reading response (~ 1/2 page; due April 2.): We have discussed pairwise interactions for systems discovery. How would you use gene-lists like those contained in the MSigDB as an additional information sources?

 


Co-expression

Understand the use of expression information to infer co-regulation.

A principle we sometimes call "guilt by association" states that genes that have similar features have a functional relationship. Applied to expression levels, the inference is: if the expression levels of two genes are correlated, we may assume that they are co-regulated. And if nature has evolved them to be co-regulated, the selective advantage is derived from a shared function. To put this into practice, one need to find a suitable set of experiments for our expression profiles, one needs to assess multiple experiments in a common frame of reference, an one needs to calculate correlation in a meaningful way. Two databases have recently been published that store such co-expression values: it is useful to compare and contrast their approaches.

Williams (2015) Database of Gene Co-Regulation (dGCR): A Web Tool for Analysing Patterns of Gene Co-regulation across Publicly Available Expression Data. J Genomics 3:29-35. (pmid: 25628763)

PubMed ] [ DOI ] The database of Gene Co-Regulation (dGCR) is a web tool for the analysis of gene relationships based on correlated patterns of gene expression over publicly available transcriptional data. The motivation behind dGCR is that genes whose expression patterns correlate across many experiments tend to be co-regulated and hence share biological function. In addition to revealing functional connections between individual gene pairs, extended sets of co-regulated genes can also be assessed for enrichment of gene ontology classes and interaction pathways. This functionality provides an insight into the biological function of the query gene itself. The dGCR web tool extends the range of expression data curated by existing co-regulation databases and provides additional insights into gene function through the analysis of pathways, gene ontology classes and co-regulation modules.

Fahrenbach et al. (2014) The CO-Regulation Database (CORD): a tool to identify coordinately expressed genes. PLoS ONE 9:e90408. (pmid: 24599084)

PubMed ] [ DOI ] BACKGROUND: Meta-analysis of gene expression array databases has the potential to reveal information about gene function. The identification of gene-gene interactions may be inferred from gene expression information but such meta-analysis is often limited to a single microarray platform. To address this limitation, we developed a gene-centered approach to analyze differential expression across thousands of gene expression experiments and created the CO-Regulation Database (CORD) to determine which genes are correlated with a queried gene. RESULTS: Using the GEO and ArrayExpress database, we analyzed over 120,000 group by group experiments from gene microarrays to determine the correlating genes for over 30,000 different genes or hypothesized genes. CORD output data is presented for sample queries with focus on genes with well-known interaction networks including p16 (CDKN2A), vimentin (VIM), MyoD (MYOD1). CDKN2A, VIM, and MYOD1 all displayed gene correlations consistent with known interacting genes. CONCLUSIONS: We developed a facile, web-enabled program to determine gene-gene correlations across different gene expression microarray platforms. Using well-characterized genes, we illustrate how CORD's identification of co-expressed genes contributes to a better understanding a gene's potential function. The website is found at http://cord-db.org.

Quiz (March 25): Brief quiz on this paper. Understand the methods.

 


Molecular Interaction

Understand the use of interaction data to infer contribution to common function.

Interaction databases provide some of the best evidence for functional relationships between biomolecules, but to use them productively can be challenging. First of all, we are leaving the paradigm of individual molecules and list of molecules behind, and entering the world of graphs and networks. Secondly, interaction databases have historically struggled to maintain their data to a common standard, and the source data can be of widely varying reliability. As a result, the overlap between different databases has been embarrassingly low, and integration efforts that simply take the superset of all reported interactions suffer from too many false positives. A good introduction to the topic is here:

Orchard (2012) Molecular interaction databases. Proteomics 12:1656-62. (pmid: 22611057)

PubMed ] [ DOI ] Molecular interaction databases are playing an ever more important role in our understanding of the biology of the cell. An increasing number of resources exist to provide these data and many of these have adopted the controlled vocabularies and agreed-upon standardised data formats produced by the Molecular Interaction workgroup of the Human Proteome Organization Proteomics Standards Initiative (HUPO PSI-MI). Use of these standards allows each resource to establish PSI Common QUery InterfaCe (PSICQUIC) service, making data from multiple resources available to the user in response to a single query. This cooperation between databases has been taken a stage further, with the establishment of the International Molecular Exchange (IMEx) consortium which aims to maximise the curation power of numerous data resources, and provide the user with a non-redundant, consistently annotated set of interaction data.

Quiz (April 1): Brief quiz on this paper. Understand the issues in curating and storing interaction data.

 


Function prediction from network data assumes some functions have been annotated and the network will guide which functions to transfer to un-annotated nodes. Two major approaches have been proposed: diffusion based approaches and clustering. We will discuss clustering elsewhere, here is a recent example of diffusion approaches.

Ma et al. (2014) Integrative approaches for predicting protein function and prioritizing genes for complex phenotypes using protein interaction networks. Brief Bioinformatics 15:685-98. (pmid: 23788799)

PubMed ] [ DOI ] With the rapid development of biotechnologies, many types of biological data including molecular networks are now available. However, to obtain a more complete understanding of a biological system, the integration of molecular networks with other data, such as molecular sequences, protein domains and gene expression profiles, is needed. A key to the use of networks in biological studies is the definition of similarity among proteins over the networks. Here, we review applications of similarity measures over networks with a special focus on the following four problems: (i) predicting protein functions, (ii) prioritizing genes related to a phenotype given a set of seed genes that have been shown to be related to the phenotype, (iii) prioritizing genes related to a phenotype by integrating gene expression profiles and networks and (iv) identification of false positives and false negatives from RNAi experiments. Diffusion kernels are demonstrated to give superior performance in all these tasks, leading to the suggestion that diffusion kernels should be the primary choice for a network similarity metric over other similarity measures such as direct neighbors and shortest path distance.

 

GO

Understand the use of GO and GOA databases, and how to compute semantic similarity.

The notion of "function" is notoriously difficult to compute with, and the most successful approach to date is contributed by the Gene Ontology (GO) Consortium. GO is an ontology of concepts, organized in a DAG (Directed Acyclic Graph: a hierarchical data-structure like a tree, but nodes can have more than one parent). Actually GO comprises three ontologies for (i) biological processes, (ii) cellular components and (iii) molecular functions. Gene Ontology Annotation (GOA) is a database curated by UniPROT, which annotates the UniProt KB proteins with GO terms. The collection of GO terms for a protein is presumed to reflect its function. Visit these sites for a brief introduction.

To work with GO, we need a somewhat deeper understanding of the principles. The discussion of changes to the ontology is a useful start.

Huntley et al. (2014) Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt. Gigascience 3:4. (pmid: 24641996)

PubMed ] [ DOI ] The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.


For this course, one question is particularly important: to analyze whether two genes collaborate. This can usually not be directly inferred from their annotations, but the similarity of their annotated GO terms is an important indicator. There are many ways to compute such semantic similarity. Here is a recent paper that proposes a new measure and compares it to previous approaches:

Zhang & Lai (2015) Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information. Gene 558:108-17. (pmid: 25550042)

PubMed ] [ DOI ] Quantifying the semantic similarities between pairs of terms in the Gene Ontology (GO) structure can help to explore the functional relationships between biological entities. A common approach to this problem is to measure the information they have in common based on the information content of their common ancestors. However, many studies have their limitations in measuring the information two GO terms share. This study presented a new measurement, exclusively inherited shared information (EISI) that captured the information shared by two terms based on an intuitive observation on the multiple inheritance relationships among the terms in the GO graph. EISI was derived from the information content of the exclusively inherited common ancestors (EICAs), which were screened from the common ancestors according to the attribute of their direct children. The effectiveness of EISI was evaluated against some state-of-the-art measurements on both artificial and real datasets, it produced more relevant results with experts' scores on the artificial dataset, and supported the prior knowledge of gene function in pathways on the Saccharomyces genome database (SGD). The promising features of EISI are the following: (1) it provides a more effective way to characterize the semantic relationship between two GO terms by taking into account multiple common ancestors related, and (2) can quickly detect all EICAs with time complexity of O(n), which is much more efficient than other methods based on disjunctive common ancestors. It is a promising alternative to multiple inheritance based methods for practical applications on large-scale dataset. The algorithm EISI was implemented in Matlab and is freely available from http://treaton.evai.pl/EISI/.

Quiz (April 1): Brief quiz on this paper. Understand the principle of "semantic similarity".

 

Pathways

Understand the contents of pathway databases, and their use for gene-pair annotation.

Pathways are the classical paradigm to organize biochemistry into a meaningful framework, with metabolic pathways coming first, later additions include regulatory/signalling pathways and developmental pathways. In a sense such pathways should correlate with a notion of systems as collaborating entitities, or at least form the cores of such systems. But how to exploit this information is not trivial, since pathways are also just conceptual entities: paths in much larger, multiply interconnected networks. One of the classic databases in this field is KEGG, it contains both signalling as well as metabolic pathways, MetaCYC/BioCYC probably has the current lead in breadth of reactions but is metabolic only, Reactome is excellently curated by the EBI, contains metabolic and signalling pathways, but is human only.

I have not found a good, current paper that utilizes database-scale pathway information for the discovery of broad principles. But here is a good, relatively recent overview of Metacyc/Biocyc to set the stage.

Caspi et al. (2014) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 42:D459-71. (pmid: 24225315)

PubMed ] [ DOI ] The MetaCyc database (MetaCyc.org) is a comprehensive and freely accessible database describing metabolic pathways and enzymes from all domains of life. MetaCyc pathways are experimentally determined, mostly small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains >2100 pathways derived from >37,000 publications, and is the largest curated collection of metabolic pathways currently available. BioCyc (BioCyc.org) is a collection of >3000 organism-specific Pathway/Genome Databases (PGDBs), each containing the full genome and predicted metabolic network of one organism, including metabolites, enzymes, reactions, metabolic pathways, predicted operons, transport systems and pathway-hole fillers. Additions to BioCyc over the past 2 years include YeastCyc, a PGDB for Saccharomyces cerevisiae, and 891 new genomes from the Human Microbiome Project. The BioCyc Web site offers a variety of tools for querying and analysis of PGDBs, including Omics Viewers and tools for comparative analysis. New developments include atom mappings in reactions, a new representation of glycan degradation pathways, improved compound structure display, better coverage of enzyme kinetic data, enhancements of the Web Groups functionality, improvements to the Omics viewers, a new representation of the Enzyme Commission system and, for the desktop version of the software, the ability to save display states.


 

Reading response (~ 1/2 page; due April 2.): Regulatory pathways are usually named according to key proteins they are organized around. Can you think of a better way?

 



Graph features

Understand the analysis of graphs and computation of graph features.

Graph theory is the most important theoretical framework for systems biology. Here is an introduction with a perspective on biological networks.

Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.

Quiz (March 25): Brief quiz on this paper. Understand basic concepts and terms.

 

Graphs, Pathways, and Networks

Understand the representation of interaction data in systems biology as graphs.

Here Nataša Pržulj develops an analysis of the interaction network topology.

Janjić & Pržulj (2014) The topology of the growing human interactome data. J Integr Bioinform 11:238. (pmid: 24953453)

PubMed ] [ DOI ] We have long moved past the one-gene–one-function concept originally proposed by Beadle and Tatum back in 1941; but the full understanding of genotype–phenotype relations still largely relies on the analysis of static, snapshot-like, interaction data sets. Here, we look at what global patterns can be uncovered if we simply trace back the human interactome network over the last decade of protein- protein interaction (PPI) screening. We take a purely topological approach and find that as the human interactome is getting denser, it is not only gaining in structure (in terms of now being better fit by structured network models than before), but also there are patterns in the way in which it is growing: (a) newly added proteins tend to get linked to existing proteins in the interactome that are not know to interact; and (b) new proteins tend to link to already well connected proteins. Moreover, the alignment between human and yeast interactomes spanning over 40% of yeast’s proteins — that are involved in regulation of transcription, RNA splicing and other cellcycle-related processes—suggests the existence of a part of the interactome which remains topologically and functionally unaffected through evolution. Furthermore, we find a small sub-network, specific to the “core” of the human interactome and involved in regulation of transcription and cancer development, whose wiring has not changed within the human interactome over the last 10 years of interacome data acquisition. Finally, we introduce a generalisation of the clustering coefficient of a network as a new measure called the cycle coefficient, and use it to show that PPI networks of human and model organisms are wired in a tight way which forbids the occurrence large cycles.

 ((PDF link here))


 

Reading response (~ 1/2 page; due April 2.): Sketch the relationship between network topology and "system".

 



Graph clustering

Understand the principles and application of modern graph-clustering algorithms.

Cluster theory is a powerful approach to structure data. The basic idea is simple: define clusters as subsets that share more of a certain property within a set than between sets. To put this into practice however is non-trivial - everything depends on the precise definition of the property we are using to organize the data, and what we mean precisely by "within" and "between". Applying the notion of clusters to graphs has its own set of theoretical challenges: in this case we are clustering topological relations, not object attributes. But the implications are profound and range from an improved understanding of biological network structure to a consistent strategy for function annotation. And perhaps biological systems discovery.

Trivodaliev et al. (2014) Exploring function prediction in protein interaction networks via clustering methods. PLoS ONE 9:e99755. (pmid: 24972109)

PubMed ] [ DOI ] Complex networks have recently become the focus of research in many fields. Their structure reveals crucial information for the nodes, how they connect and share information. In our work we analyze protein interaction networks as complex networks for their functional modular structure and later use that information in the functional annotation of proteins within the network. We propose several graph representations for the protein interaction network, each having different level of complexity and inclusion of the annotation information within the graph. We aim to explore what the benefits and the drawbacks of these proposed graphs are, when they are used in the function prediction process via clustering methods. For making this cluster based prediction, we adopt well established approaches for cluster detection in complex networks using most recent representative algorithms that have been proven as efficient in the task at hand. The experiments are performed using a purified and reliable Saccharomyces cerevisiae protein interaction network, which is then used to generate the different graph representations. Each of the graph representations is later analysed in combination with each of the clustering algorithms, which have been possibly modified and implemented to fit the specific graph. We evaluate results in regards of biological validity and function prediction performance. Our results indicate that the novel ways of presenting the complex graph improve the prediction process, although the computational complexity should be taken into account when deciding on a particular approach.


 

Reading response (~ 1/2 page; due April 2.): What is your preferred approach to validate the "systems" we discover through clustering? Why?

 





In depth...


Resources

Course related


Contents related


325C78 7097B8 9BACCF A8A5CC D7C0F0