BCB410
BCB410H1F - 2017
Contents
Objectives and Participants
The "Applied Bioinformatics" course is offered as a part of the BCB Program curriculum to ensure that our students know enough about application issues in the field to be able to put their knowledge into practice in a research lab setting. This is to support the Specialist Program goal: to prepare students for graduate studies in the discipline.
As a required course in the BCB curriculum, BCB410 assumes the prerequisites and goals of fourth-year students in the BCB Specialist Program. Other students may be permitted to enrol on a case by case basis, but they may need to catch up on prerequisites in computer science or life-science courses that BCB students have taken at this point. Generally speaking, this is an advanced course that presupposes familiarity with programming principles, algorithm analysis, and methods of modern systems biology, as well as introductory knowledge of linear algebra, graph theory, information theory, statistics, as well as molecular–, structural– and cellular biology. The varying topics will be discussed at a highly technical level that is likely only useful for students who plan to integrate much of this material into their actual practice.
In this course we will build contents for a knowledge network in applied bioinformatics.
Knowledge Network
- This year's course will focus on "Data Science" in bioinformatics.
- <command>-Click to open the Knowledge Network in a new tab, scale for detail.
- Hover over a learning unit to see its keywords.
- Click on a learning unit to open the associated page.
- The nodes of the learning unit network are colour-coded:
- Live units are green
- Units under development are light green. These are still in progress.
- Stubs (placeholders) are pale. These still need basic contents.
- Milestone units are blue. These collect a number of prerequisites to simplify the network.
- Integrator units are red. These embody the main goals of the course.
- Units that will be developed by students for this course are yellow.
- Units that require revision are pale orange.
- Units that have a black border have deliverables that are designed to be submitted for credit (not relevant for this course).
- Arrows point from a prerequisite unit to a unit that requires it.
For reference, and for links to other information sources, consider the Knowledge Network for the BCH441 - Bioinformatics course.
Sources
- Consider the following sources before deciding whether a unit is suitable for you - or find additional sources.
- Two general introductions to the field are here:
Libbrecht & Noble (2015) Machine learning applications in genetics and genomics. Nat Rev Genet 16:321-32. (pmid: 25948244) |
[ PubMed ] [ DOI ] The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets. |
Marx (2013) Biology: The big challenges of big data. Nature 498:255-60. (pmid: 23765498) |
Two courses that I have taught this year contain code that can be adapted for many of the units we are constructing. Both courses are on github and can be accessed there, or by downloading them as RStudio projects.
- BCH2024 (2017) - Biological Data Analysis with R
- A graduate "Focussed Topics" course. In this course we analyzed a high-resolution time series of yeast cell-cycle expression profiles. This data is well suited for the kind of tasks we are focussing on. I think we should additionally aim to annotate the data with GOA. The course proceeded through six units. 1-Data; 2-Features; 3-Modelling; 4-Graphs; 5-Clustering; 6-MachineLearning. The cell-cycle expression data is described here.
Pramila et al. (2006) The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the S-phase gap in the transcriptional circuitry of the cell cycle. Genes Dev 20:2266-78. (pmid: 16912276) |
[ PubMed ] [ DOI ] Transcription patterns shift dramatically as cells transit from one phase of the cell cycle to another. To better define this transcriptional circuitry, we collected new microarray data across the cell cycle of budding yeast. The combined analysis of these data with three other cell cycle data sets identifies hundreds of new highly periodic transcripts and provides a weighted average peak time for each transcript. Using these data and phylogenetic comparisons of promoter sequences, we have identified a late S-phase-specific promoter element. This element is the binding site for the forkhead protein Hcm1, which is required for its cell cycle-specific activity. Among the cell cycle-regulated genes that contain conserved Hcm1-binding sites, there is a significant enrichment of genes involved in chromosome segregation, spindle dynamics, and budding. This may explain why Hcm1 mutants show 10-fold elevated rates of chromosome loss and require the spindle checkpoint for viability. Hcm1 also induces the M-phase-specific transcription factors FKH1, FKH2, and NDD1, and two cell cycle-specific transcriptional repressors, WHI5 and YHP1. As such, Hcm1 fills a significant gap in our understanding of the transcriptional circuitry that underlies the cell cycle. |
- Exploratory Data Analysis with R - A Canadian Bioinformatics Workshop (2017)
- This two-day workshop was targeted to graduate students, postdocs and PIs who have little experience with programming but significant data analysis needs. The workshop had an online "Introduction to R" as a prerequisite, equivalent to the R introduction units in our knowledge network. Here are the individual RStudio projects (on GitHub).
- The Wikipedia article on Cluster analysis is a decent first introduction.
- We could spend an entire course just working through The Elements of Statistical Learning. And we would have a lot of fun. But this is not a statistics course. Read this anyway.
- Apparently Healy's Data Visualization for Social Science contains many good ideas. Online. Have a look.
- Grolemund & Wickham have a new book R for Data Science. We need to have the talk, about the tidyverse and whether it's actually a good idea[1].
- What's everybody talking about in the field? This.
Organization
Details for the 2017 course will be discussed in our first class session, Wednesday, September 13, at 10:00 in Bahen BA025.
It is imperative that you attend the first class session in person. Do not enrol in this course if you can't attend the first class session.
Dates and Location
Classes meet Wednesdays between 10:00 and 12:00 in BA025 (Bahen Centre) throughout the Fall Term. Classes start at 10 minutes past the hour.
Coordinator
Office hours
(Virtual) face to face meetings are by appointment, if required. However, we will be able to resolve almost all issues by e-mail. You will find that discussions by e-mail are both more efficient and effective than meetings. Moreover e-mail discussions leave you with a document trail of what was discussed, can contain links to information sources, and we can share points of general interest more easily with the class.
Contact
Contact within the class is easiest via the Google Group that you will subscribe to at the beginning of class.
Phases
We will work in four phases:
- You will design a learning unit and draft its contents;
- The class will work through the unit;
- We will go through "Code reviews" of the material;
- You will respond to the review and improve the material.
Marking
Activity | Weight | |
Initial design of your unit | 20 marks | |
Participation in Review panels | 4 x 10 marks | |
Final version of unit | 20 marks | |
Journals | 15 marks | |
Insights! | 5 marks | |
Total | 100 marks |
What makes an excellent grade? See here.
Notes
- ↑ cf. The tidyverse curse for some thoughts on problems with the tidyverse philosophy and "My aversion to pipes" why the
%>%
may not be such a good idea as everybody seems to think it is.