Computational Systems Biology Main Page

From "A B C"
Jump to navigation Jump to search

Computational Systems Biology

Course Wiki for BCB420 (Computational Systems Biology) and JTB2020 (Applied Bioinformatics).

This is our main tool to coordinate information, activities and projects in University of Toronto's computational systems biology course BCB420. If you are not one of our students, you can still browse this site, however only users with a login account can edit or contribute or edit material. If you are here because you are interested in general aspects of bioinformatics or computational biology, you may want to review the Wikipedia article on bioinformatics, or visit Wikiomics. Contact boris.steipe(at)utoronto.ca with any questions you may have.



BCB420 / JTB2020

These are the course pages for BCB420H (Computational Systems Biology). Welcome, you'll feel right at home here.


These are also the course pages for JTB2020H (Applied Bioinformatics). How come? Why is JTB2020 not the graduate equivalent of BCB410 (Applied Bioinformatics)? Let me explain. When this course was conceived as a required part of the (then so called) Collaborative PhD Program in Proteomics and Bioinformatics in 2003, there was an urgent need to bring graduate students to a minimal level of computer skills and programming; prior experience was virtually nonexistent. Fortunately, the field has changed and the Program has changed, and now our graduate students are usually quite competent at least in some practical aspects of computational biology. Not uniformly however, and the wide disparity of previous experience has made it increasingly difficult to provide course offerings that address students' needs. JTB2020 therefore shares its lecture components with BCB420 course, and there is a large range of topics in Applied Bioinformatics that are covered by students in self-study and discussion with the lecturer, customized to their actual needs.


The 2015 course...

This year's course will be very different from previous year's courses. In previous years we have worked with a structured, lecture-style format. This year we will be pursuing a wholly problem oriented format. This is the plan:

  • We'll identify an interesting challenge in computational systems biology
  • We'll formulate an approach to this challenge as a project
  • We'll define the resources we need - data sources, algorithms, programming- and collaboration support
  • We'll define students' roles in the project according to their skills and experience
  • Then we will implement the project.



Organization

First lecture this term: Today, January 12. 2016 at 16:00 (4 pm), MSB 4279.

Attendance in person at the first lecture is mandatory. You will be penalized with two participation marks for non-attendance.[1]

This is as silly as it is unfortunately necessary - I can't get this course started effectively if not all students are present, and past experience in this regard has been poor. In this class we will coordinate the organization of the course, sign you up to mailing list and Student Wiki, and discuss the syllabus for this term.



CAUTION:

Until discussed in class today ALL material on this page is to be considered highly preliminary!



Dates
BCB420/JTB2020 is a Winter Term course.
Lectures: Tuesdays, 16:00 to 18:00. (Classes start at 10 minutes past the hour.)
Exam: None for this course.


Location
MS 4279 (Medical Sciences Building).


Departmental information
For BCB420 see the BCB420 Biochemistry Department Course Web page.
For JTB2020 see the JTB2020 Course Web page for general information.


Submissions
This is an electronic submission only course; but if you must print material, you might consider printing double-sided. Learn how, at the Print-Double-Sided Student Initiative.


Recommended textbooks

Depending on your background, various levels of textbooks may be suitable. I will bring my evaluation copies to class so you can decide what may work for you.
Understanding Bioinformatics (Zvelebil & Baum) is a decent general introduction to many aspects of bioinformatics. It was published in 2007, an updated version is urgently needed. Still, some of the basics (like the algorithm for optimal sequence alignment) don't change. (Amazon) (Indigo) (ABE books)
Practical Bioinformatics (Agostino) covers some of the material of the BCH441 exercises. Expect a no-nonsense introduction to the very most basic stuff. I have my pet peeves about this book (as I have for many others, eg. why in the world do they still teach CLUSTAL when all available studies demonstrate it to be the least accurate MSA algorithm by a margin???), but if you haven't taken BCH441, this may serve you well. And if you did take BCH441, it may consolidate some ideas that I wasn't clear about. (Amazon) (Indigo) (ABE books)
If you are aware of recent good textbooks, or have your own opinions about these or other books, let me know.





Grading and Activities

Activity Weight
BCB410 - (Undergraduates)
Weight
JTB2020 - (Graduates)
3 In-class introductory quizzes 18 marks (3 x 6) 12 marks (3 x 4)
9 Course objective evaluations 27 marks (9 x 3) 18 marks (9 x 2)
Class project contributions 40 marks 40 marks
"Classroom" participation 15 marks 15 marks
Project Manuscript Draft   15 marks
Total 100 marks 100 marks


A note on marking

I do not adjust marks towards a target mean and variance (i.e. there will be no "belling" of grades). I feel strongly that such "normalization" detracts from a collaborative and mutually supportive learning environment. If your classmate gets a great mark because you helped him with a difficult concept, this should never have the effect that it brings down your mark through class average adjustments. Collaborate as much as possible, it is a great way to learn.


Prerequisites

You must have taken an introductory bioinformatics course as a prerequisite, or otherwise acquired the necessary knowledge. Therefore I expect familiarity with the material of my BCH441 course. If you have not taken BCH441, please update your knowledge and skills before the course starts. I will not make accommodations for lack of prerequisites. Please check the syllabus for this course below to find whether you need to catch up on additional material, and peruse this site to find the information you may need. A (non-exhaustive) overview of topics and useful links is linked here.


Course Objectives

Building Software

Understand principles of software design and implementation in a collaborative environment.

This objective is implicit in students' project participation.

Gene Lists

Understand sources of and types of gene lists, gene IDs.

Gene IDs and gene lists are in many respects the raw material from which we construct bioinformatics. Here are two articles to set the stage:


BioDBnet is a data-warehouse at the US National Cancer Institute.

Mudunuri et al. (2009) bioDBnet: the biological database network. Bioinformatics 25:555-6. (pmid: 19129209)

PubMed ] [ DOI ]


The Molecular Signatures Database (MSigDB) collects examples of gene sets, lists of gene identifiers with a shared property. The paper discusses V3.0, the database has grown since then.

Liberzon et al. (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics 27:1739-40. (pmid: 21546393)

PubMed ] [ DOI ]

The currently available gene sets are described here : http://www.broadinstitute.org/gsea/msigdb/collections.jsp


 

Reading response (~ 1/2 page; due April 2.): We have discussed pairwise interactions for systems discovery. How would you use gene-lists like those contained in the MSigDB as an additional information sources?

 


Co-expression

Understand the use of expression information to infer co-regulation.

A principle we sometimes call "guilt by association" states that genes that have similar features have a functional relationship. Applied to expression levels, the inference is: if the expression levels of two genes are correlated, we may assume that they are co-regulated. And if nature has evolved them to be co-regulated, the selective advantage is derived from a shared function. To put this into practice, one need to find a suitable set of experiments for our expression profiles, one needs to assess multiple experiments in a common frame of reference, an one needs to calculate correlation in a meaningful way. Two databases have recently been published that store such co-expression values: it is useful to compare and contrast their approaches.

Williams (2015) Database of Gene Co-Regulation (dGCR): A Web Tool for Analysing Patterns of Gene Co-regulation across Publicly Available Expression Data. J Genomics 3:29-35. (pmid: 25628763)

PubMed ] [ DOI ]

Fahrenbach et al. (2014) The CO-Regulation Database (CORD): a tool to identify coordinately expressed genes. PLoS ONE 9:e90408. (pmid: 24599084)

PubMed ] [ DOI ]

Quiz (March 25): Brief quiz on this paper. Understand the methods.

 


Molecular Interaction

Understand the use of interaction data to infer contribution to common function.

Interaction databases provide some of the best evidence for functional relationships between biomolecules, but to use them productively can be challenging. First of all, we are leaving the paradigm of individual molecules and list of molecules behind, and entering the world of graphs and networks. Secondly, interaction databases have historically struggled to maintain their data to a common standard, and the source data can be of widely varying reliability. As a result, the overlap between different databases has been embarrassingly low, and integration efforts that simply take the superset of all reported interactions suffer from too many false positives. A good introduction to the topic is here:

Orchard (2012) Molecular interaction databases. Proteomics 12:1656-62. (pmid: 22611057)

PubMed ] [ DOI ]

Quiz (April 1): Brief quiz on this paper. Understand the issues in curating and storing interaction data.

 


Function prediction from network data assumes some functions have been annotated and the network will guide which functions to transfer to un-annotated nodes. Two major approaches have been proposed: diffusion based approaches and clustering. We will discuss clustering elsewhere, here is a recent example of diffusion approaches.

Ma et al. (2014) Integrative approaches for predicting protein function and prioritizing genes for complex phenotypes using protein interaction networks. Brief Bioinformatics 15:685-98. (pmid: 23788799)

PubMed ] [ DOI ]

 

GO

Understand the use of GO and GOA databases, and how to compute semantic similarity.

The notion of "function" is notoriously difficult to compute with, and the most successful approach to date is contributed by the Gene Ontology (GO) Consortium. GO is an ontology of concepts, organized in a DAG (Directed Acyclic Graph: a hierarchical data-structure like a tree, but nodes can have more than one parent). Actually GO comprises three ontologies for (i) biological processes, (ii) cellular components and (iii) molecular functions. Gene Ontology Annotation (GOA) is a database curated by UniPROT, which annotates the UniProt KB proteins with GO terms. The collection of GO terms for a protein is presumed to reflect its function. Visit these sites for a brief introduction.

To work with GO, we need a somewhat deeper understanding of the principles. The discussion of changes to the ontology is a useful start.

Huntley et al. (2014) Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt. Gigascience 3:4. (pmid: 24641996)

PubMed ] [ DOI ]


For this course, one question is particularly important: to analyze whether two genes collaborate. This can usually not be directly inferred from their annotations, but the similarity of their annotated GO terms is an important indicator. There are many ways to compute such semantic similarity. Here is a recent paper that proposes a new measure and compares it to previous approaches:

Zhang & Lai (2015) Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information. Gene 558:108-17. (pmid: 25550042)

PubMed ] [ DOI ]

Quiz (April 1): Brief quiz on this paper. Understand the principle of "semantic similarity".

 

Pathways

Understand the contents of pathway databases, and their use for gene-pair annotation.

Pathways are the classical paradigm to organize biochemistry into a meaningful framework, with metabolic pathways coming first, later additions include regulatory/signalling pathways and developmental pathways. In a sense such pathways should correlate with a notion of systems as collaborating entitities, or at least form the cores of such systems. But how to exploit this information is not trivial, since pathways are also just conceptual entities: paths in much larger, multiply interconnected networks. One of the classic databases in this field is KEGG, it contains both signalling as well as metabolic pathways, MetaCYC/BioCYC probably has the current lead in breadth of reactions but is metabolic only, Reactome is excellently curated by the EBI, contains metabolic and signalling pathways, but is human only.

I have not found a good, current paper that utilizes database-scale pathway information for the discovery of broad principles. But here is a good, relatively recent overview of Metacyc/Biocyc to set the stage.

Caspi et al. (2014) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 42:D459-71. (pmid: 24225315)

PubMed ] [ DOI ]


 

Reading response (~ 1/2 page; due April 2.): Regulatory pathways are usually named according to key proteins they are organized around. Can you think of a better way?

 



Graph features

Understand the analysis of graphs and computation of graph features.

Graph theory is the most important theoretical framework for systems biology. Here is an introduction with a perspective on biological networks.

Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

PubMed ] [ DOI ]

Quiz (March 25): Brief quiz on this paper. Understand basic concepts and terms.

 

Graphs, Pathways, and Networks

Understand the representation of interaction data in systems biology as graphs.

Here Nataša Pržulj develops an analysis of the interaction network topology.

Janjić & Pržulj (2014) The topology of the growing human interactome data. J Integr Bioinform 11:238. (pmid: 24953453)

PubMed ] [ DOI ]

 ((PDF link here))


 

Reading response (~ 1/2 page; due April 2.): Sketch the relationship between network topology and "system".

 



Graph clustering

Understand the principles and application of modern graph-clustering algorithms.

Cluster theory is a powerful approach to structure data. The basic idea is simple: define clusters as subsets that share more of a certain property within a set than between sets. To put this into practice however is non-trivial - everything depends on the precise definition of the property we are using to organize the data, and what we mean precisely by "within" and "between". Applying the notion of clusters to graphs has its own set of theoretical challenges: in this case we are clustering topological relations, not object attributes. But the implications are profound and range from an improved understanding of biological network structure to a consistent strategy for function annotation. And perhaps biological systems discovery.

Trivodaliev et al. (2014) Exploring function prediction in protein interaction networks via clustering methods. PLoS ONE 9:e99755. (pmid: 24972109)

PubMed ] [ DOI ]


 

Reading response (~ 1/2 page; due April 2.): What is your preferred approach to validate the "systems" we discover through clustering? Why?

 





In depth...


Resources

Course related


Contents related


325C78 7097B8 9BACCF A8A5CC D7C0F0


 

Notes

  1. Only in case you are sick will you be excused. But in that case you must contact me before class.