Computational Systems Biology Main Page

From "A B C"
Jump to navigation Jump to search

Computational Systems Biology

Course Wiki for BCB420 (Computational Systems Biology) and JTB2020 (Applied Bioinformatics).


 

This is our main tool to coordinate information, activities and projects in University of Toronto's computational systems biology course BCB420. If you are not one of our students, this site is unlikely to be useful. If you are here because you are interested in general aspects of bioinformatics or computational biology, you may want to review the Wikipedia article on bioinformatics, or visit Wikiomics. Contact boris.steipe(at)utoronto.ca with any questions you may have.


 

Please do not forget to sign up for the Final Oral Exams.


 


 


 

BCB420 / JTB2020

These are the course pages for BCB420H (Computational Systems Biology). Welcome, you're in the right place.

These are also the course pages for JTB2020H (Applied Bioinformatics). How come? Why is JTB2020 not the graduate equivalent of BCB410 (Applied Bioinformatics)? Let me explain. When this course was conceived as a required part of the (then so called) Collaborative PhD Program in Proteomics and Bioinformatics in 2003, there was an urgent need to bring graduate students to a minimal level of computer skills and programming; prior experience was virtually nonexistent. Fortunately, the field has changed and our current graduate students are usually quite competent at least in some practical aspects of computational biology. In this course we profit from the rich and diverse knowledge of the problem-domain our graduate students have, while bringing everyone up to a level of competence in the practical, computational aspects.


The 2018 course...

In this course we pursue a task in computational systems biology of human genes in project oriented format. This will proceed in three phases:

  • First, we will review basic computational skills and bioinformatics knowledge to bring everyone to the same level. In all likelihood you will need to start with these tasks well in advance of the actual lectures. This phase will end with a comprehensive quiz in week 3;
  • Next we'll focus on data integration and definition of features. As an example, we will integrate gene expression data from different experiments into a common set of features. Each student will contribute data from one experiment. The results of this phase will be the topic of our First Oral Exam;
  • Finally, we will each adopt a biological "system" in human cells and use machine learning methods to attempt to refine its gene membership and assign roles to its member genes. The results will form the basis of our Final Oral Exam;
  • There are several meta-skills that you will pick up "on the side" these include time management, working according to best practice of reproducible research in a collaborative environment on GitHub; report writing, and keeping a scientific lab journal.



Organization

Dates
BCB420/JTB2020 is a Winter Term course.
Lectures: Tuesdays, 16:00 to 18:00. (Classes start at 10 minutes past the hour.)
Final Exam: None for this course.


Location
MS 4279 (Medical Sciences Building).


Departmental information
For BCB420 see the BCB420 Biochemistry Department Course Web page.
For JTB2020 see the JTB2020 Course Web page for general information.


Prerequisites and Preparation

This course has formal prerequisites of BCH441H1 (Bioinformatics) or CSB472H1 (Computational Genomics and Bioinformatics). I have no way of knowing what is being taught in CSB472, and no way of confirming how much you remember from any of your previous courses, like BCH441 or BCB410. Moreover there are many alternative ways to become familiar with important course contents. Thus I generally enforce prerequisites only very weakly and you should not assume at all that having taken any particular combination of courses will have prepared you sufficiently. What I try to do instead is make the contents of the course very explicit. If your preparation is lacking, you will have to expend a very significant amount of effort. This is certainly possible, but whether you will succeed will depend on your motivation and aptitude.

The course requires (i) a solid understanding of molecular biology, (ii) solid, introductory level knowledge of bioinformatics, (iii) a working knowledge of the R programming language.


 

The prerequisite material for this course covers the contents of the 2017 BCH441 course:

  • <command>-Click to open the Bioinformatics Learning Units Map in a new tab, scale for detail.
A knowledge network map of the bioinformatics learning units.
  • Open the Bioinformatics Knowledge Network Map and get an overview of the material. You should confidently be able to execute the tasks in the four   Integrator Units  .
  • If you have taken an earlier version of BCH441 (pre 2017), you will need to work through many of the units, since very much new material has been added.
  • If you have taken BCH441 in 2017, most of the material will be familiar. You will need to review some of the units and familiarize yourself more with the R programming aspects.
  • If you have not taken BCH441, you will need to work through the material rather carefully.


 

The preparatory material for BCB420 is linked from the BCB420-specific map below. It covers a subset of the BCH441 material and will be the subject of our first Quiz in the third week of class. We will hold a mock-quiz in the second week.




The "Knowledge Network"

Supporting learning units for this course are organized in a "Knowledge Network" of self-contained units that can be worked on according to students' individual needs and timing. Here is the detailed map. It contains links to all of the units.


  • <command>-Click to open the Learning Units Map in a new tab, scale for detail.
A map of the BCB420 learning units.
  • Hover over a learning unit to see its keywords.
  • Click on a learning unit to open the associated page.
  • The nodes of the learning unit network are colour-coded:
    •   Live units   are green
    •   Units under development   are light green. These are still in progress.
    •   Stubs   (placeholders) are pale. These still need basic contents.
    •   Milestone units   are blue. These collect a number of prerequisites to simplify the network.
    •   Integrator units   are red. These embody the main goals of the course.
    •   Units that require revision  are pale orange.
  • Units that have a   black border   have deliverables that can be submitted for credit. Visit the node for details.
  • Arrows point from a prerequisite unit to a unit that builds on its contents.

(Many new units will be added to the map as the course progresses, reload the map frequently.)


 

Navigating the course

Everything starts with the following three units:

This should be the first learning unit you work with, since your Course Journal will be kept on a Wiki, as well as all other deliverables. This unit includes an introduction to authoring Wikitext and the structure of Wikis, in particular how different pages live in separate "Namespaces". The unit also covers the standard markup conventions - "Wikitext markup" - the same conventions that are used on Wikipedia - as well as some extensions that are specific to our Course- and Student Wiki. We also discuss page categories that help keep a Wiki organized, licensing under a Creative Commons Attribution license, and how to add licenses and other page components through template codes.


Keeping a journal is an essential task in a laboratory. To practice keeping a technical journal, you will document your activities as you are working through the material of the course. A significant part of your term grade will be given for this Course Journal. This unit introduces components and best practice for lab- and course journals and includes a wiki-source template to begin your own journal on the Student Wiki.


In paralell with your other work, you will maintain an insights! page on which you collect valuable insights and learning experiences of the course. Through this you ask yourself: what does this material mean - for the field, and for myself.


  • Once you have completed these three units, get started immediately on the Introduction-to-R units. You need time and practice, practice, practice[1] to acquire the programming skills you will need for the course.
  • Whenever you want to take a break from studying R, get done with the other preparatory units.

At the end of our preparatory phase (after week 2) we will hold a comprehensive, non-trivial quiz on the preparatory units and on R basics.



 

Grading and Activities

 
Activity Weight
BCB410 - (Undergraduates)
Weight
JTB2020 - (Graduates)
Self-evaluation and Feedback session on preparatory material("Quiz"[2]) 20 marks 20 marks
First Oral Exam (Feb. 15/16) 20 marks 15 marks
Final Oral Exam (Mar. 29/30) 30 marks 25 marks
Journal 25 marks 25 marks
Insights 5 marks 5 marks
Manuscript Draft   10 marks
Total 100 marks 100 marks


 


Marks adjustments

I do not adjust marks towards a target mean and variance (i.e. there will be no "belling" of grades). I feel strongly that such "normalization" detracts from a collaborative and mutually supportive learning environment. If your classmate gets a great mark because you helped them with a difficult concept, this should never have the effect that it brings down your mark through class average adjustments. Collaborate as much as possible, it is a great way to learn. But do keep it honest and carefully consider our rules on Plagiarism and Academic Misconduct.


 

Timetable and syllabus

Syllabus and activities in progress for the 2018 Winter Term ...


 


Week In class: Tuesday, January 9 This week
1
  • No class meeting this day!
  • Preparations I
  • Important dates
  • Grading
  • Organization
  • Signup to mailing list and Student Wiki.


 


Week In class: Tuesday, January 16 This week
2
  • First class meeting
  • Review of preparatory materials (you should have worked through all of the materials in preparation for class).
  • Practice quiz on preparations (not for credit)
  • Preparations II
  • Defining the class projects


 


Week In class: Tuesday, January 23 This week
3
  • First Quiz
  • Data Integration I
  • Data sources and workflows
  • Development principles
  • Writing R packages
  • Collaboration tools


 


Week In class: Tuesday, January 30 This week
4
  • ...
  • Data Integration II


 


Week In class: Tuesday, February 6 This week
5
  • ...
  • Data Integration III


 


Week In class: Tuesday, February 13 This week
6
  • Finish Data Integration tasks
  • Discuss and adopt Systems tasks


 


Week In class: Tuesday, February 20 This week
  • No class meeting - Reading Week
  • Systems readings


 


Week In class: Tuesday, February 27 This week
7
  • Summary of gene expression data results and perspectives for expression normalization
  • Summary of categorical feature tasks and perspectives for feature dimension reduction
  • Outline of changed course objectives
  • Current Genome Analysis
Reuter et al. (2018) The Personal Genome Project Canada: findings from whole genome sequences of the inaugural 56 participants. CMAJ 190:E126-E136. (pmid: 29431110)

PubMed ] [ DOI ] BACKGROUND: The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information. We describe genomic variation identified in the initial recruitment cohort of 56 volunteers. METHODS: Volunteers were screened for eligibility and provided informed consent for open data sharing. Using blood DNA, we performed whole genome sequencing and identified all possible classes of DNA variants. A genetic counsellor explained the implication of the results to each participant. RESULTS: Whole genome sequencing of the first 56 participants identified 207 662 805 sequence variants and 27 494 copy number variations. We analyzed a prioritized disease-associated data set (n = 1606 variants) according to standardized guidelines, and interpreted 19 variants in 14 participants (25%) as having obvious health implications. Six of these variants (e.g., in BRCA1 or mosaic loss of an X chromosome) were pathogenic or likely pathogenic. Seven were risk factors for cancer, cardiovascular or neurobehavioural conditions. Four other variants - associated with cancer, cardiac or neurodegenerative phenotypes - remained of uncertain significance because of discrepancies among databases. We also identified a large structural chromosome aberration and a likely pathogenic mitochondrial variant. There were 172 recessive disease alleles (e.g., 5 individuals carried mutations for cystic fibrosis). Pharmacogenomics analyses revealed another 3.9 potentially relevant genotypes per individual. INTERPRETATION: Our analyses identified a spectrum of genetic variants with potential health impact in 25% of participants. When also considering recessive alleles and variants with potential pharmacologic relevance, all 56 participants had medically relevant findings. Although access is mostly limited to research, whole genome sequencing can provide specific and novel information with the potential of major impact for health care.

Reading notes ...  ▽△

The first order of reading a paper is of course to actually understand its terminology and methods. Terms that are not familiar need to be looked up. Some less common concepts I see in the text include: trait; allele; mosaic loss; haplotype and diplotype; semidominant and codominant; homo- and heterozygous; SNV, indel, CNV and SV; pharmacogenes; re-identification; WGS sequencing; ancestry determination, ROH analysis and relationship inference; GRCh37/hg19.

The paper itself is quite straightforward, but the point here is the methodology: you need to be familiar with the standards and information resources that are being used - the background and context of the study. I have listed some items that came up (but this list is not exhaustive): the project website and the data that the authors make available there ( fastq? sam/bam? vcf?); the Genome Reference Consortium; Clinical Genomic Database; Human Gene Mutation Database; ClinVar; the Genome Aggregation Database (gnomAD); the ACMG guidelines. Much reference data has come from the 1000 genomes project; additional sequencing was provided by Complete Genomics; A number of local institutions have participated in this - and as a Uoft BCB student I hope you would be familiar with what these are: Deep Genomics; DNAstack; the McLaughlin Centre; TCAG; Ontario Genomics ...

Much detail may be published in supplementary material and this often includes data that can be used for further studies. The supplementary material definitely needs to be read carefully. Could you reproduce the findings given the methods and data that is available? How?

The actual results need to be critically evaluated: What do the figures tell you? What are the final conclusions of the authors? Are they supported by data? Relevant? Was the study successful regarding its objectives, and are the objectives valid for the field, for the population at large, and for yourself?

After assessing the information that is presented, you should draw some conclusion from your reading: is this good work? Why, or why not? Can you anticipate what the next steps should be - this is what your future grant proposals or project ideas (if you were to write any) would be about. What would you improve and what additional analyses could/should be done?


Finally, it's probably useful to document your activities in your journal.


 


Week In class: Tuesday, March 6 This week
8
  • In-class discussion of Personal Genome Project Canada paper
    • Focus on analysis pipeline and tools
  • Specifications for programming tasks
  • Current Population Scale Genome Analysis
Dolle et al. (2017) Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Res 27:300-309. (pmid: 27986821)

PubMed ] [ DOI ] We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.

Reading notes ...  ▽△

You are now familiar with the basic vocabulary of genome scale sequence analysis, and I see only a small number of additional terms: "read pairs", and "phase". The main focus of interest here are the workflows and algorithms that support population-scale genome analysis. Much is based on the BWT and the FM index for data management and retrieval, and (coloured) deBruijn graphs (in Cortex) (tips? unitigs?) for error correction and you need to understand these algorithms and how they are used. Bloom filters (and sequence Bloom trees) are an alternative worth keeping in mind. Workflows described in the paper include error correction, k-mer retrieval and application to reference- and non-reference mapping. You should think about how one would do GWAS type queries, how CNV/SV assessment would work, and how one would construct actual assemblies. Part of this hinges on the question exactly how CNV/SV variations are stored in GRCh37/38. The need to store quality data is a big concern, and the concept of "controlled loss of bas qualities" is important.

Once again, a number of interesting resources and standards are mentioned: the CRAM format (interesting discussion here), GENCODE anotations, Genome in a Bottle, the GATK toolkit and SAMtools, the KRAKEN classifier, and Smalt.

Important contents is also in the supplementary data.

As always: document your activities in your journal.



 


Week In class: Tuesday, March 13 This week
9
  • In-class discussion of Population Genome Analysis paper
    • Focus on BWT and population BWT
  • Specifications for programming tasks
  • Wednesday March 14: BCB420S drop date
  • Current Single Cell Transcriptional Profile Analysis
Yuzwa et al. (2017) Developmental Emergence of Adult Neural Stem Cells as Revealed by Single-Cell Transcriptional Profiling. Cell Rep 21:3970-3986. (pmid: 29281841)

PubMed ] [ DOI ] Adult neural stem cells (NSCs) derive from embryonic precursors, but little is known about how or when this occurs. We have addressed this issue using single-cell RNA sequencing at multiple developmental time points to analyze the embryonic murine cortex, one source of adult forebrain NSCs. We computationally identify all major cortical cell types, including the embryonic radial precursors (RPs) that generate adult NSCs. We define the initial emergence of RPs from neuroepithelial stem cells at E11.5. We show that, by E13.5, RPs express a transcriptional identity that is maintained and reinforced throughout their transition to a non-proliferative state between E15.5 and E17.5. These slowly proliferating late embryonic RPs share a core transcriptional phenotype with quiescent adult forebrain NSCs. Together, these findings support a model wherein cortical RPs maintain a core transcriptional identity from embryogenesis through to adulthood and wherein the transition to a quiescent adult NSC occurs during late neurogenesis.

Reading notes ...  ▽△

Focus on the analysis pipeline including: PCA, t-SNE, clustering tools in Seurat (DBSCAN, k-means), the developmental trajectory analysis, MSTs and the Waterfall method, hierarchical clustering, Cyclone, Venn diagrams and Panther classification.

As always: document your activities in your journal.


 


Week In class: Tuesday, March 20 This week
10
  • In-class discussion of Single Cell Transcriptomics paper
    • Focus on data quality, processing pipeline, PCA and tSNE
  • Specifications for programming tasks
  • Current Network Analysis
Mezlini & Goldenberg (2017) Incorporating networks in a probabilistic graphical model to find drivers for complex human diseases. PLoS Comput Biol 13:e1005580. (pmid: 29023450)

PubMed ] [ DOI ] Discovering genetic mechanisms driving complex diseases is a hard problem. Existing methods often lack power to identify the set of responsible genes. Protein-protein interaction networks have been shown to boost power when detecting gene-disease associations. We introduce a Bayesian framework, Conflux, to find disease associated genes from exome sequencing data using networks as a prior. There are two main advantages to using networks within a probabilistic graphical model. First, networks are noisy and incomplete, a substantial impediment to gene discovery. Incorporating networks into the structure of a probabilistic models for gene inference has less impact on the solution than relying on the noisy network structure directly. Second, using a Bayesian framework we can keep track of the uncertainty of each gene being associated with the phenotype rather than returning a fixed list of genes. We first show that using networks clearly improves gene detection compared to individual gene testing. We then show consistently improved performance of Conflux compared to the state-of-the-art diffusion network-based method Hotnet2 and a variety of other network and variant aggregation methods, using randomly generated and literature-reported gene sets. We test Hotnet2 and Conflux on several network configurations to reveal biases and patterns of false positives and false negatives in each case. Our experiments show that our novel Bayesian framework Conflux incorporates many of the advantages of the current state-of-the-art methods, while offering more flexibility and improved power in many gene-disease association scenarios.


 


Week In class: Tuesday, March 27 This week
11
  • In-class discussion of Network Analysis paper
    • Focus on network models (Hotnet2), graphical models (Conflux) - background and implementation.
  • Specifications for programming tasks



 


Week In class: Tuesday, April 3 This week
12
  • No class meeting this day
  • Deadline for computational tasks to be documented in journal
  • Deadline for all remaining course deliverables

NA


 




Resources

Course related


 
Miller et al. (2011) Strategies for aggregating gene expression data: the collapseRows R function. BMC Bioinformatics 12:322. (pmid: 21816037)

PubMed ] [ DOI ] BACKGROUND: Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied. RESULTS: We introduce the R function collapseRows that implements several collapsing methods and evaluate its performance in three applications. First, we study a crucial step of the meta-analysis of microarray data: the merging of independent gene expression data sets, which may have been measured on different platforms. Toward this end, we collapse multiple microarray probes for a single gene and then merge the data by gene identifier. We find that choosing the probe with the highest average expression leads to best between-study consistency. Second, we study methods for summarizing the gene expression profiles of a co-expression module. Several gene co-expression network analysis applications show that the optimal collapsing strategy depends on the analysis goal. Third, we study aggregating the information of cell type marker genes when the aim is to predict the abundance of cell types in a tissue sample based on gene expression data ("expression deconvolution"). We apply different collapsing methods to predict cell type abundances in peripheral human blood and in mixtures of blood cell lines. Interestingly, the most accurate prediction method involves choosing the most highly connected "hub" marker gene. Finally, to facilitate biological interpretation of collapsed gene lists, we introduce the function userListEnrichment, which assesses the enrichment of gene lists for known brain and blood cell type markers, and for other published biological pathways. CONCLUSIONS: The R function collapseRows implements several standard and network-based collapsing methods. In various genomic applications we provide evidence that both types of methods are robust and biologically relevant tools.

Chang et al. (2013) Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC Bioinformatics 14:368. (pmid: 24359104)

PubMed ] [ DOI ] BACKGROUND: As high-throughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and meta-analysis have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a meta-analysis method given an application; the decision essentially requires both statistical and biological considerations. RESULTS: We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS(A): DE genes with non-zero effect sizes in all studies, (2) HS(B): DE genes with non-zero effect sizes in one or more studies and (3) HS(r): DE gene with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six large-scale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data structure, respectively. CONCLUSIONS: The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS(A), HS(B), and HS(r)). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author's publication website.

Thompson et al. (2016) Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 4:e1621. (pmid: 26844019)

PubMed ] [ DOI ] Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


 
325C78 7097B8 9BACCF A8A5CC D7C0F0


 

Notes

  1. It's practice!
  2. I call these activities Quiz sessions for brevity, however they are not quizzes in the usual sense, since they rely on self-evaluation and immediate feedback.