Difference between revisions of "Computational Systems Biology Main Page"

From "A B C"
Jump to navigation Jump to search
 
(142 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
</div>
 
</div>
  
<small>'''This is our main tool to coordinate information, activities and projects in University of Toronto's computational systems biology course BCB420'''. If you are not one of our students, you can still browse this site, however only users with a login  account can edit or contribute or edit material. If you are here because you are interested in general aspects of bioinformatics or computational biology, you may want to review the [http://en.wikipedia.org/wiki/Bioinformatics Wikipedia article on bioinformatics], or visit [http://www.openwetware.org/wiki/Wikiomics Wikiomics]. Contact boris.steipe(at)utoronto.ca with any questions you may have.</small>
+
{{Vspace}}
  
 +
<small>'''This is our main tool to coordinate information, activities and projects in University of Toronto's computational systems biology course BCB420'''. If you are not one of our students, this site is unlikely to be useful. If you are here because you are interested in general aspects of bioinformatics or computational biology, you may want to review the [http://en.wikipedia.org/wiki/Bioinformatics Wikipedia article on bioinformatics], or visit [http://www.openwetware.org/wiki/Wikiomics Wikiomics]. Contact boris.steipe(at)utoronto.ca with any questions you may have.</small>
 +
 +
 +
{{Vspace}}
 +
 +
<div class="alert">
 +
 +
If you are enrolled in this course but have not been subscribed to the mailing list, or do not have an account on the Student Wiki, please contact me immediately.
 +
 +
</div>
 +
 +
{{Vspace}}
  
 
__TOC__
 
__TOC__
  
 +
 +
{{Vspace}}
 +
 +
{{Vspace}}
  
 
== BCB420 / JTB2020 ==
 
== BCB420 / JTB2020 ==
  
These are the course pages for '''BCB420H (Computational Systems Biology)'''. Welcome, you'll feel right at home here.
+
These are the course pages for '''BCB420H (Computational Systems Biology)'''. Welcome, you're in the right place.
  
 +
These are also the course pages for '''JTB2020H (Applied Bioinformatics)'''. How come? Why is JTB2020 not the graduate equivalent of [[Applied Bioinformatics Main Page|BCB410 (<u>Applied Bioinformatics</u>)]]? Let me explain. When this course was conceived as a required part of the (then so called) ''Collaborative PhD Program in Proteomics and Bioinformatics'' in 2003, there was an urgent need to bring graduate students to a minimal level of computer skills and programming; prior experience was virtually nonexistent. Fortunately, the field has changed and our current graduate students are usually quite competent at least in some practical aspects of computational biology. In this course we profit from the rich and diverse knowledge of the problem-domain our graduate students have, while bringing everyone up to a level of competence in the practical, computational aspects.
  
These are also the course pages for '''JTB2020H (Applied Bioinformatics)'''. How come? Why is JTB2020 not the graduate equivalent of [[Applied Bioinformatics Main Page|BCB410 (<u>Applied Bioinformatics</u>)]]? Let me explain. When this course was conceived as a required part of the (then so called) ''Collaborative PhD Program in Proteomics and Bioinformatics'' in 2003, there was an urgent need to bring graduate students to a minimal level of computer skills and programming; prior experience was virtually nonexistent. Fortunately, the field has changed and the Program has changed, and now our graduate students are usually quite competent at least in some practical aspects of computational biology. Not uniformly however, and the wide disparity of previous experience has made it increasingly difficult to provide course offerings that address students' needs. JTB2020 therefore shares its lecture components with '''BCB420''' course, and there is a large range of topics in  [[Applied Bioinformatics Main Page|Applied Bioinformatics]] that are covered by students in self-study and discussion with the lecturer, customized to their actual needs.
 
  
 +
;The 2019 course...
  
;The 2015 course...
+
In this course we explore systems biology of human genes with computational means in project oriented format. This will proceed in three phases:
 +
* '''Foundations''' first: we will review basic computational skills and bioinformatics knowledge to bring everyone to the same level. In all likelihood you will need to start with these tasks well in advance of the actual lectures. This phase will include a comprehensive quiz on prerequisite material in week 3. We will explore data-sources and you will choose one data-source for which you will develop import code and document it in an R markdown document within an R package;
 +
* Next we'll focus on '''Biocuration''': the expertise-informed collection, integration and annotation of biological data. We will each choose a molecular "system" to work on, and define an ontology and data-model in which to annotate our system's components, their roles, and their relationships. The outcome of your curation task (together with your data script) will define the scope of this course's {{Oral-Test}};
 +
* Finally, we will develop tools for '''Exploratory Data Analysis''' in computational systems biology. We will jointly develop code for a team-authored R package where everyone contributes one mini workflow for data preparation, exploration and interpretation. Your code contributions to the package will be assessed;
 +
* There are several meta-skills that you will pick up "on the side" these include time management, working according to best practice of reproducible research in a collaborative environment on GitHub; report writing, and keeping a scientific lab journal.  
  
This year's course will be very different from previous year's courses. In previous years we have worked with a structured, lecture-style format. This year we will be pursuing a wholly problem oriented format. This is the plan:
+
<!--
* We'll identify an interesting challenge in computational systems biology
+
2018:
* We'll formulate an approach to this challenge as a project
+
... focus on data integration and definition of features. As an example, we will integrate gene expression data from different experiments into a common set of features. Each student will contribute data from one experiment.
* We'll define the resources we need - data sources, algorithms, programming- and collaboration support
+
... we will each adopt a biological "system" in human cells and use machine learning methods to attempt to refine its gene membership and assign roles to its member genes;
* We'll define students' roles in the project according to their skills and experience
 
* Then we will implement the project.
 
  
Every week will have a set of general and specific tasks. The general tasks will include background reading, installing software and familiarizing yourselves with websites and tools. These will the topics of the weekly short quizzes. The rest of the classroom time will be dedicated to discussion of progress on your specific tasks, open discussion of any arising issues, and definition of next week's tasks. All students are requested to be familiar with the entire breadth of the project and through this cover the individual course objectives that are detailed below.
+
2017:
 +
Every week will have a set of general and specific tasks. The general tasks will include background reading, installing software and familiarizing yourselves with websites and tools. These will the topics of the weekly short quizzes. The rest of the classroom time will be dedicated to discussion of progress on your specific tasks, open discussion of any arising issues, and definition of next week's tasks. All students are requested to be familiar with the entire breadth of the project and through this cover the individual course objectives that are detailed below. -->
  
  
 
<section begin=CSB_main_organization />
 
<section begin=CSB_main_organization />
 
===Organization===
 
===Organization===
<div class="alert">
+
<!-- div class="alert">
 
+
Attendance in person at the first lecture is mandatory. You will loose three participation marks if you are not present in person.<ref>Only in case you are sick will you be excused. But in that case you '''must''' contact me before class.</ref>
First lecture this term: Wednesday, January 7. 2015 at 14:00 (2 pm), MSB 4279.
 
 
 
<small>We will coordinate the organization of the course, sign you up to mailing list and Student Wiki, and discuss the (significant!) syllabus changes for this term.</small>
 
  
Do not miss the first lecture - your input will be important and there is no good way for you to make up if you are not present.
+
<small>This may seem silly but it is unfortunately necessary - I can't get this course started effectively if you are not present when we work out the organization of the course, sign you up to mailing list and Student Wiki, and discuss the syllabus for this term.</small>
  
<!-- [[CSB Assignment Week 10|'''Assignment 10''']] is active.<br />
+
Second lecture and Quiz 1 / 2: Tuesday, January 24. 2017 at 16:00 (4 pm), MSB 3278.
Ninth (and final) quiz in class: Wednesday, March 19., 14:00. Please don't forget your red pen for marking.<br /> -->
 
  
</div>
+
</div -->
  
  
 
;Dates
 
;Dates
 
:BCB420/JTB2020 is a Winter Term course.
 
:BCB420/JTB2020 is a Winter Term course.
:Lectures: Wednesdays, 14:00 to 16:00. (Classes start at 10 minutes past the hour.)
+
:Lectures: Tuesdays, 16:00 to 18:00. (Classes start at 10 minutes past the hour.)  
:Exam: None for this course.
+
:'''Note: there will be three open-ended collaborative planning sessions that may go well into the night. Attendance and participation is mandatory.'''
 +
:Final Exam: None for this course.
 +
 
 +
;Events
 +
* Tuesday, January 8 2019: Course officially begins. No class meeting. Get started on preparatory material (well in advance actually).
 +
* Tuesday, January 15: First class meeting. Mock-quiz for preparatory material.
 +
* Tuesday, January 22: First live quiz on preparatory material. Later: open ended session on data import
 +
* Tuesday, February 5: Open ended session on system curation
 +
* Tuesday, March 12: Open ended session on exploratory data analysis
  
  
 
;Location
 
;Location
:[http://map.utoronto.ca/utsg/building/005 MS&nbsp;4279] (Medical Sciences Building).
+
:[http://map.utoronto.ca/utsg/building/005 '''MS&nbsp;3278'''] (Medical Sciences Building).
  
  
Line 64: Line 88:
 
:For JTB2020 see the [http://biochemistry.utoronto.ca/courses/jtb-2020h/ JTB2020 Course Web page] for general information.
 
:For JTB2020 see the [http://biochemistry.utoronto.ca/courses/jtb-2020h/ JTB2020 Course Web page] for general information.
  
 +
<section end=CSB_main_organization />
 +
 +
{{Vspace}}
  
;Submissions
+
====Prerequisites and Preparation====
:This is an '''electronic submission only''' course; but if you must print material, you might consider printing double-sided. Learn how, at the [http://printdoublesided.sa.utoronto.ca/ Print-Double-Sided Student Initiative].
 
  
 +
This course has formal prerequisites of [[Bioinformatics_Main_Page|BCH441H1 (Bioinformatics)]] or [https://csb.utoronto.ca/undergraduate-studies/undergraduate-courses/undergraduate-course-level-400/ CSB472H1 (Computational Genomics and Bioinformatics)]. I have no way of knowing what is being taught in CSB472, and no way of confirming how much you remember from any of your previous courses, like BCH441 or BCB410. Moreover there are many alternative ways to become familiar with important course contents. Thus I generally enforce course-prerequisites only very weakly and you should not assume at all that having taken any particular combination of courses will have prepared you sufficiently. Instead I make the contents of the course very explicit. If your preparation is lacking, you will have to expend a very significant amount of effort. This is certainly possible, but whether you will succeed will depend on your motivation and aptitude.
  
====Recommended textbooks ====
+
The course requires (i) a solid understanding of molecular biology, (ii) solid, introductory level knowledge of bioinformatics, (iii) a good working knowledge of the '''R''' programming language. 
  
: Depending on your background, various levels of textbooks may be suitable. I will bring my evaluation copies to class so you can decide what may work for you.
+
{{Smallvspace}}
  
: [http://www.garlandscience.com/product/isbn/9780815340249 '''Understanding Bioinformatics''' (Zvelebil & Baum)] is a decent general introduction to many aspects of bioinformatics. It was published in 2007, an updated version is urgently needed. Still, some of the basics (like the algorithm for optimal sequence alignment) don't change. <small>[http://www.amazon.ca/Understanding-Bioinformatics-Marketa-J-Zvelebil/dp/0815340249 (Amazon)] [http://www.chapters.indigo.ca/books/Understanding-Bioinformatics-Marketa-J-Zvelebil-Jeremy-O-Baum/9780815340249-item.html (Indigo)] [http://www.abebooks.com/servlet/SearchResults?isbn=9780815340249 (ABE books)]</small> 
+
The '''prerequisite material''' for this course includes the contents of [[Bioinformatics_Main_Page|'''the 2018 BCH441 course]]:
  
: [http://www.garlandscience.com/product/isbn/9780815344568 '''Practical Bioinformatics''' (Agostino)] covers some of the material of the BCH441 exercises. Expect a no-nonsense introduction to the very most basic stuff. I have my pet peeves about this book (as I have for many others, eg. why in the world do they still teach CLUSTAL when all available studies demonstrate it to be the least accurate MSA algorithm '''by a margin'''???), but if you haven't taken BCH441, this may serve you well. And if you did take BCH441, it may consolidate some ideas that I wasn't clear about. <small>[http://www.amazon.ca/Practical-Bioinformatics-Michael-Agostino/dp/0815344562 (Amazon)] [http://www.chapters.indigo.ca/books/Practical-Bioinformatics-Michael-Agostino/9780815344568-item.html (Indigo)] [http://www.abebooks.com/servlet/SearchResults?isbn=9780815344568 (ABE books)]</small>
+
* <command>-Click to open the Bioinformatics Learning Units Map in a new tab, scale for detail.
 +
[[File:ABC-units_map.svg|thumb|500px|none|link=http://steipe.biochemistry.utoronto.ca/abc/assets/ABC-units_map.svg|'''A knowledge network map of the bioinformatics learning units.''']]
 +
* Open the [http://steipe.biochemistry.utoronto.ca/abc/assets/ABC-units_map.svg Bioinformatics Knowledge Network Map] and get an overview of the material. You should confidently be able to execute the tasks in the four <span style="background-color: #e19fa7; border:solid 2px #000000;">&nbsp;&nbsp;Integrator&nbsp;Units&nbsp;&nbsp;</span>.
 +
* If you have taken BCH441 before, please note that many of the units have undergone significant revisions and material has been added. You will need to review the material and familiarize yourself more with the R programming aspects.
 +
* If you have not taken BCH441, you will need to work through the material rather carefully. Estimate at least three weeks of time and get started immediately.
  
: If you are aware of recent good textbooks, or have your own opinions about these or other books, let me know.
+
{{Smallvspace}}
  
 +
A minimal subset of bioinformatics knowledge you need to begin with work in BCB420 is linked from the BCB420-specific map below. To ensure everyone is adequately prepared, we will hold a ''Quiz'' on the <span style="background-color: #b3dbce;">&nbsp;&nbsp;Live&nbsp;units&nbsp;&nbsp;</span> on that map in the third week of class. We will hold a mock-quiz on the material in the second week (our first class meeting) so everyone knows what to expect.
  
 +
* <command>-Click to open the BCB420 Preparation Learning Units Map in a new tab, scale for detail.
 +
[[File:BCB420-Units.svg|thumb|500px|none|link=http://steipe.biochemistry.utoronto.ca/abc/assets/BCB420-Units.svg|'''A map of preparatory BCB420 learning units.''']]
 +
* Hover over a learning unit to see its keywords.
 +
* Click on a learning unit to open the associated page.
 +
* The nodes of the learning unit network are colour-coded:
 +
**<span style="background-color: #b3dbce;">&nbsp;&nbsp;Live&nbsp;units&nbsp;&nbsp;</span> are green
 +
**<span style="background-color: #d9ead5;">&nbsp;&nbsp;Units&nbsp;under&nbsp;development&nbsp;&nbsp;</span> are light green. These are still in progress.
 +
**<span style="background-color: #f2fafa;">&nbsp;&nbsp;Stubs&nbsp;&nbsp;</span> (placeholders) are pale. These still need basic contents.
 +
**<span style="background-color: #97bed5;">&nbsp;&nbsp;Milestone&nbsp;units&nbsp;&nbsp;</span> are blue. These collect a number of prerequisites to simplify the network.
 +
**<span style="background-color: #e19fa7;">&nbsp;&nbsp;Integrator&nbsp;units&nbsp;&nbsp;</span> are red. These embody the main goals of the course. These units are '''not''' for evaluation in BCB420.
 +
*Arrows point from a prerequisite unit to a unit that builds on its contents.
  
<section end=CSB_main_organization />
 
  
 +
{{Vspace}}
  
 
<section begin=CSB_main_grading />
 
<section begin=CSB_main_grading />
  
===Grading and Activities===
+
===Grading, Activities, Deliverables===
 
 
Exercises for the course  will be linked from Assignment Pages. I expect everyone to complete them, however there will be no required submissions for exercises. Exercise-related questions as well as pre-reading related questions will be part of the weekly quizzes. Don't expect to do well on the quizzes unless you have done the exercises and completed the pre-reading carefully. This course demands a lot of your discipline and time-management. A large portion of your grade will be contributed by the [[CSB_Open_project|'''Open Project''']]. JTB2020 students will also complete a number of [[APB_Exercises|'''Applied Bioinformatics Exercises''']]. Deliverables for the course will be completed well before end-of-term crunch time and there will be no final exam.
 
  
 +
{{Vspace}}
 +
For details of the deliverables, see below.
 +
{{Smallvspace}}
  
<table>
+
<table cellpadding="5">
  
 
<tr class="sh">
 
<tr class="sh">
Line 102: Line 146:
  
 
<tr class="s2">
 
<tr class="s2">
<td>[[CSB_Quizzes|'''9 In-class quizzes''']]</td>
+
<td>[[Eval_Sessions|'''Self-evaluation and Feedback session on preparatory material''']]("''Quiz''"<ref>I call these activities ''Quiz'' sessions for brevity, however they are not quizzes in the usual sense, since they rely on self-evaluation and immediate feedback.</ref>)</td>
<td>54 marks <small>(9 x 6)</small></td>
+
<td>20 marks</td>
<td>36 marks <small>(9 x 4)</small></td>
+
<td>15 marks</td>
 
</tr>
 
</tr>
  
 
<tr class="s1">
 
<tr class="s1">
<td>[[CSB_Class Project 2015|'''Class project contributions''']]</td>
+
<td>'''{{Oral-Test}}''' (March 7/8)</td>
 
<td>30 marks</td>
 
<td>30 marks</td>
 
<td>30 marks</td>
 
<td>30 marks</td>
Line 114: Line 158:
  
 
<tr class="s2">
 
<tr class="s2">
<td>[[CSB_Participation|'''"Classroom" participation''']]</td>
+
<td>Collaborative software task and participation</td>
<td>16 marks</td>
+
<td>20 marks</td>
<td>16 marks</td>
+
<td>15 marks</td>
 +
</tr>
 +
 
 +
<tr class="s1">
 +
<td>'''[[FND-Journal|Journal]]'''</td>
 +
<td>25 marks</td>
 +
<td>25 marks</td>
 +
</tr>
 +
 
 +
<tr class="s2">
 +
<td>'''[[ABC-Insights|Insights]]'''</td>
 +
<td>5 marks</td>
 +
<td>5 marks</td>
 
</tr>
 
</tr>
  
 
<tr class="s1">
 
<tr class="s1">
<td>[[APB_Exercises|'''Applied Bioinformatics Exercises''']]</td>
+
<td>'''Pull request reviews'''</td>
 
<td>&nbsp;</td>
 
<td>&nbsp;</td>
<td>18 marks</td>
+
<td>10 marks</td>
 
</tr>
 
</tr>
  
Line 135: Line 191:
 
</table>
 
</table>
  
 +
{{Vspace}}
 +
 +
'''We are covering a lot of ground in this course, and all deliverables feed into a collaborative project. Everyone's continuous, active participation is essential for making this a success: for you personally and for the class as a team.'''
 +
 +
{{Vspace}}
 +
 +
====Getting started====
 +
{{Smallvspace}}
 +
 +
Everything starts with the following four units:
 +
*[[FND-Wiki_editing|Introduction to editing Wiki pages]] (Optional if you have taken BCH441 or BCB410.)
 +
:{{#lst:FND-Wiki_editing|abstract}}
 +
 +
*[[FND-Journal|Your Course Journal]] (Mandatory - your Journals will be assessed. Note that the "rules" have changed - study the unit carefully and read the [[ABC-Rubrics#Course_Journal|evaluation rubrics]].)
 +
:{{#lst:FND-Journal|abstract}}
 +
 +
*[[ABC-Plagiarism|The "Plagiarism Unit"]] (Mandatory - must be the first entry in your Journal.)
 +
:{{#lst:ABC-Plagiarism|abstract}}
 +
 +
*[[ABC-Insights|The "insights!" page]] (Mandatory - your "insights!" pages will be assessed.)
 +
:{{#lst:ABC-Insights|abstract}}
 +
 +
Once you have completed these four units, get started '''immediately''' on the Introduction-to-R units. You need time and practice, practice, practice<ref>[https://tapas.io/episode/923459 It's practice!]</ref> to acquire the programming skills you need for the course. Whenever you want to take a break from studying R, continue with the other preparatory units.
 +
 +
{{Vspace}}
 +
 +
====PartI: Foundations and Data====
 +
 +
{{Vspace}}
 +
 +
<small>Don't forget to document your work in your Journal!</small>
 +
 +
{{Smallvspace}}
 +
 +
Your level of preparedness will be assessed in a "mock quiz" in week two, after which you have one more week to fill in gaps before our Quiz in week three. With that out of the way, we will look at different data sources that are useful in systems biology, including gene-level annotations and collections of experimental data, relationship data like physical and epistatic interactions, and systems-level data like metabolic or regulatory pathways. Each of you will select one data-source in our first open-ended session and then work on the following deliverables:
 +
* a brief summary page on the Student Wiki: the page needs to be named according to the pattern: <code><nowiki>User:&lt;your_name&gt;/BCB420-2019-Data_&lt;your_data_resource&gt;</nowiki></code> and contain the category tag: <code><nowiki>[[Category:BCH420-2019_Data_project]]</nowiki></code>.
 +
* an R package derived from [https://github.com/hyginn/rpt '''rpt'''],
 +
** hosted on GitHub,
 +
** named according to the pattern <code><nowiki>BCB420.2019.&lt;your_data_resource&gt;</nowiki></code><ref>According to [https://cran.r-project.org/doc/manuals/r-release/R-exts.html#The-DESCRIPTION-file "Writing R Extensions"]: "The mandatory ‘Package’ field gives the name of the package. This should contain only (ASCII) letters, numbers and dot, have at least two characters and start with a letter and not end in a dot." Deviating from this will result in a package check <span style="color:#EE0000;">error.</span></ref>,
 +
** containing an R markdown page that describes and annotates code for
 +
*** importing the chosen data in platform-independent function calls (see the footnote for details and restrictions)<ref>Note: the repository '''absolutely must not''' contain any datafile of more than 1Mb in size! Rather it must contain clear instructions how to download the data. Packages that violate the size limitations will not be evaluated. The code you write shall expect the data in a sister-directory of your working directory which is called <code>data</code>. For example, if I were to store a datafile by the name <code>STRING_90.dat</code>, my code would construct the path to it in a platform independent way as <code>file.path("..", "data", "STRING_90.dat")</code>.</ref>,
 +
*** and cleaning it up where necessary,
 +
*** and normalizing its identifiers to HuGO gene symbols,
 +
** and containing sample data for our defined reference dataset of genes,
 +
** and containing a report on the data statistics,
 +
** and containing code to validate the import process,
 +
** and containing the (provided) function to display the markdown file.
 +
Required: a user needs to be able to use the information you provided to understand the semantics of the data, import the data, purify it where necessary, and associate it with HUGO IDs in an R data frame. They should be able to use the data as a feature in a machine learning protocol without further preprocessing steps.
 +
 +
To illustrate the requirements with a model solution, I have provided an [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/User:Boris/BCB420-2019-Data_STRING example project page '''here'''], which links to a Github repository with the corresponding package. Studying this with some care will probably clarify many questions.
 +
 +
<div class="note">
 +
;Note
 +
:*If your data refers to chromosomal coordinates in any way, you '''must''' ensure the coordinates are from GRCh38 (hg38)<ref>For different approaches to convert from one to the other see [https://www.biostars.org/p/65558/ '''this thread''' on Biostars].</ref>
 +
:*Your chosen database will not always be the best choice of data source: often you can achieve your objective faster though ensembl/biomart. See [http://useast.ensembl.org/Homo_sapiens/Transcript/PDB?_format=HTML;db=core;g=ENSG00000139618;genomic=off;output=fasta;param=cdna;r=13:32889611-32973347;strand=feature;t=ENST00000380152 this sample annotation of BRCA2] for examples of what data is available.
 +
</div>
 +
 +
{{Vspace}}
 +
 +
====Database choices====
 +
 +
{{Smallvspace}}
 +
 +
Here are the chosen (or assigned) databases. Follow the link in the "Note" column for details:
 +
 +
{{Smallvspace}}
 +
 +
<table>
 +
<tr>
 +
<td>Name</td>
 +
<td>DB</td>
 +
<td>Note</td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Edouard Al-chami</td>
 +
<td>[https://www.ncbi.nlm.nih.gov/geo/ GEO (stimulus)]</td>
 +
<td>&nbsp;<ref>Cell response to external stimuli (eg. heat, salt, insulin, chemokines ...): Find ~ 20 high-coverage experimental data sets, define the pipeline to download and process the sets into a common data structure, apply quantile normalization. Result: an expression vector for each gene.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Emily Ayala</td>
 +
<td>[https://www.gencodegenes.org/ Gene models]</td>
 +
<td>&nbsp;<ref>Find gene models (exons and chromosomal coordinates) for each gene. Possible sources are Gencode v29 GTF or Gff3 files, or exons from biomart. Result: for each gene, a set of chromosomal start/end coordinates for the principal isoform as defined by APPRIS.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Deus Bajaj</td>
 +
<td>EGGNOG</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Cathy Cha</td>
 +
<td>[https://www.ncbi.nlm.nih.gov/geo/ GEO (tissues)]</td>
 +
<td>&nbsp;<ref>Differential expression in tissues (eg. brain, epithelium, muscles ...): Find ~ 20 high-coverage experimental data sets, define the pipeline to download and process the sets into a common data structure, apply quantile normalization. Result: an expression vector for each gene.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Nada Elnour</td>
 +
<td>Human Protein Atlas</td>
 +
<td>&nbsp;<ref>Find subcellular localization for each gene. Result: for each gene, the subcellular localizations it is associated with.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Chantal Ho</td>
 +
<td>[https://www.ncbi.nlm.nih.gov/geo/ GEO (diseases)]</td>
 +
<td>&nbsp;<ref>Differential expression in disease states (eg. diabetes, hypertension, RA,  ...): Find ~ 20 high-coverage experimental data sets, define the pipeline to download and process the sets into a common data structure, apply quantile normalization. Result: an expression vector for each gene.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Edward Ho</td>
 +
<td>Cosmic</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Sapir Labes</td>
 +
<td>GWAS</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Judy Lee</td>
 +
<td>PDB</td>
 +
<td>&nbsp;<ref>Find PDB structures of human proteins. Possible data sources: Biomart? PDB? NCBI's MMDB? If structures overlap, report only the best representative. This is a set of feature annotations for each gene that includes start and stop coordinates. You must validate the coordinates, i.e. make sure that the annotated residue numbers map accurately to the actual sequence associated with the HGNC symbol.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Tina Lee</td>
 +
<td>Pfam</td>
 +
<td>&nbsp;<ref>Obtain annotations via Ensembl/biomart. This is a set of feature annotation for each gene that includes start and stop coordinates. You must validate the coordinates, i.e. make sure that the annotated residue numbers map accurately to the actual sequence associated with the HGNC symbol.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Jian Bin Lin</td>
 +
<td>GEO</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Matthew Mcneil</td>
 +
<td>COSMIC and GEO</td>
 +
<td>&nbsp;<ref>Tissue specific correlations of expression levels. Result: for each gene ... ???  Question: how are differentially spliced genes handled?</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Gabriela Morgenshtern</td>
 +
<td>Awesome (or PANTHER)</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Yoonsik Park</td>
 +
<td>[https://reactome.org/ Reactome pathways]</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Alesandro Rigido</td>
 +
<td>[http://software.broadinstitute.org/gsea/msigdb/index.jsp MsigDB]</td>
 +
<td>&nbsp;<ref>For a selected set of MSigDB sets compute co-occurrence probability of genes: how often do they co-occur in the same MSig Set? This is a network-type result. Output will be two HGNC symbols and one probability for each queried pair.  Don't precompute all 1e9 possible pairs, but conceptualize this as a tool that queries a compact datastructure with the probabilities, e.g. a boolean matrix with one set-annotation per column (for each gene TRUE if present in the set, FALSE if not present) that compares two row-vectors for each query.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Fan Shen</td>
 +
<td>SMART</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Rachel Silverstein</td>
 +
<td>Human Phenotype Ontology</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Yiqiu Tang</td>
 +
<td>[https://www.omim.org/ OMIM]</td>
 +
<td>&nbsp;<ref>Gene phenotype associations. For each gene, the set of phenotypes it is associated with.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Denitsa Vasileva</td>
 +
<td>[https://www.ebi.ac.uk/GOA GO annotations]</td>
 +
<td>&nbsp;<ref>For each gene, the set of GO terms it is annotated to.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Rachel Woo</td>
 +
<td>Human Protein Atlas</td>
 +
<td>&nbsp;<ref>Tissue Data: tissue level expression vector. Result: for each gene ... ??? Question: how are differentially spliced genes handled?</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Alison Wu</td>
 +
<td>[https://thebiogrid.org/ BioGRID]</td>
 +
<td>&nbsp;<ref>Process genetic interactions only. Result: edge list (Weighted? Directed?)</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Yufei Yang</td>
 +
<td>[http://gtrd.biouml.org/ GTRD]</td>
 +
<td>&nbsp;<ref>ChipSeq verified TF binding sites in gene promoter regions. Result: for each genes, list of transcription factors that target its promoter region.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Yin Yin</td>
 +
<td>[http://proteincomplexes.org/ huMAP]</td>
 +
<td>&nbsp;<ref>Protein complexes. Result: for each gene, all complexes (if any) it has been annotated to.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Han Zhang</td>
 +
<td>[http://hintdb.hgc.jp/htp/index.html HitPredict]</td>
 +
<td>&nbsp;<ref>Weighted interaction graph. Result: edge list with weights.</ref></td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Xindi Zhang</td>
 +
<td>[http://mips.helmholtz-muenchen.de/corum/ CORUM]</td>
 +
<td>&nbsp;<ref>Protein complexes. Result: for each gene, all complexes (if any) it has been annotated to.</ref></td>
 +
</tr>
 +
<tr class="s1">
 +
<td>Yuhan Zhang</td>
 +
<td>Encode</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
<tr class="s2">
 +
<td>Liwen Zhuang</td>
 +
<td>Human Disease Ontology</td>
 +
<td>&nbsp;</td>
 +
</tr>
 +
</table>
 +
 +
 +
 +
Contact me with any questions you may have.
 +
 +
{{Vspace}}
 +
 +
====Part II: Biocuration====
 +
 +
"Systems" are concepts and working with systems requires expert knowledge. To explore the practice of expert curation of molecular systems, each of you will select one system in our second open-ended session and report on its components, its function(s) and its architecture. To start off:
 +
* Choose a system from the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/BCB420_2019_Biocuration_table '''GO term table''' on the Student Wiki], confirm your choice with me and replace the "N.N." in the table with your name.
 +
* Explore the term on AmiGO, and explore the linked "seed-genes" on UniProt.
 +
* In PubMed, find recent reviews or other manuscripts that discuss the system and its context. Make sure you have not overlooked important literature, this will be part of your evaluation. If there is no suitable literature available, your GO term is not a suitable choice.
 +
* Get an overview of your system and how it relates to the GO term you start out from.
 +
* define the system well and define a five-letter code as a shorthand notation of the system as discussed in class.
 +
;Note
 +
:A GO term is not a system nor is the set of GOA annotated genes a complete description of the system's members. A system may overlap the component/function/process described in a GO term to a large degree, but the term is not informed or constrained by our "system" definition. We use GO terms as a first approximation to system functions, and we use GOA to define "seed" genes as a starting point that may help us build out the system description. However, a system's roles include the creation, maintenance, destruction, and potentially recycling of components, and these roles are not always included in either the literature nor the GO terms themselves.
 +
 +
{{Smallvspace}}
 +
 +
Read the [[Systems_curation|notes on curating a biological system]].
 +
 +
{{Smallvspace}}
 +
 +
{{#lst:Systems_curation|deliverables}}
 +
 +
;Deliverables&#58; Form
 +
<section begin=curation_form />
 +
* Create a '''project page''' on the Student Wiki named according to the pattern: <code><nowiki>User:&lt;your_name&gt;/BCB420-2019-System_&lt;your_system_code&gt;</nowiki></code>;
 +
* add the category tag: <code><nowiki>[[Category:BCH420-2019_Curation_project]]</nowiki></code>;
 +
* add the <code><nowiki>{{CC-BY}}</nowiki></code> template;
 +
* summarize your "seed" information (follow the model [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/User:Boris/BCB420-2019-System_PHALY#Stage_1:_The_System_seed for the '''PHALY''' system]);
 +
* as you are annotating your system, ensure all components have a SyRO role defined, and the evidence source and evidence code has been entered;
 +
* the system data needs to be included in the page in a [https://jsonlint.com/ '''valid'''(!) JSON file], in an expansible section of text.<ref>Note: you '''must''' include line breaks with your JSON data! Data that has everything on one line will '''not''' be accepted.</ref>
 +
<section end=curation_form />
 +
 +
 +
Both your data import script and your curated system model will be assessed in the Oral Test.
 +
 +
{{Vspace}}
 +
 +
====Part III: Exploration====
 +
 +
At the end of Parts I and II we will have data available and annotated systems that induce relations on the data. Using this information, we can formulate tools for exploratory data analysis (EDA): isolating and evaluating features, looking at correlations, identifying patterns in networks,
 +
clustering data etc. Each of you will select one EDA workflow in our third open-ended session for which to build a tool in a jointly authored R package.  Your deliverables are:
 +
* a project page on the student Wiki that contains a specification of your tool;
 +
* an implementation of your tool as part of a jointly authored R package under continuous integration;
 +
* a Vignette in the package that describes the tool and includes sample code for which the data is also provided in the package.
 +
 +
Your deliverables will be evaluated together with your participation in constructing the package.
 +
 +
;Deliverables&#58; Form
 +
<section begin=exploration_form />
 +
* On the Student Wiki -
 +
** Create a '''project page''' on the Student Wiki named according to the pattern: <code><nowiki>User:&lt;your_name&gt;/BCB420-2019-ExploratorySystemsAnalysis</nowiki></code>;
 +
** add the category tag: <code><nowiki>[[Category:BCH420-2019_Exploration_project]]</nowiki></code>;
 +
** add the <code><nowiki>{{CC-BY}}</nowiki></code> template;
 +
** summarize the objectives of your exploration tool in terms of input, output, and interpretation;
 +
** write a specification for your exploration tool;
 +
** summarize example results.
 +
 +
* On GitHub -
 +
** Fork the project [https://github.com/hyginn/BCB420.2019.ESA <code>BCB420.2019.ESA</code>];
 +
** Develop your code as a package function;
 +
** Write a vignette;
 +
** Make sure your changes pass without errors, warnings or notes;
 +
** Submit a pull request by Monday, March 25.
 +
** Address comments from the pull-request review before Tuesday, April 2.
 +
 +
The code is considered "submitted" when it passes the continuous integration checks, all pull-request reviews have been addressed, and your branch has been merged into the <code>BCB420.2019.ESA</code> package.
 +
 +
{{Vspace}}
 +
 +
===Extensions for term work===
 +
{{Smallvspace}}
 +
Extensions for term work in this course are subject to Faculty regulations and will only be considered within the framework determined by the Faculty policies.
 +
 +
<!-- cf. http://www.artsci.utoronto.ca/faculty-staff/teacher-info/academic-handbook-for-instructors/sections-6-8 -->
 +
 +
* '''Regular Submissions'''
 +
::It is Faculty policy to require assessments to be "fair, equitable and reasonable". In order to be equitable, granting extensions requires the student to demonstrate that the need for the extension is due to unavoidable circumstances that go significantly beyond what was expected of the rest of the class. In general "official" documentation will be required: UofT Verification of Illness or Injury Form, Student Health or Disability Related Certificate, a College Registrar’s Letter, and an Accessibility Services Letter.
 +
 +
* '''Signing up for the oral tests.'''
 +
::The dates for the '''{{Oral-Test}}''' have been announced at the beginning of the term on this syllabus. If you fail to sign up for a slot, or if you fail to show up at the scheduled time, we apply the Faculty policy for a missed Midterm Test: "if the reasons for missing your test are ''acceptable'' to the instructor, a make-up opportunity should be offered to the student where ''practicable''. '''"Acceptable"''' reasons will be considered
 +
::* if they are justified,
 +
::* if the consideration is "fair, equitable and reasonable", and
 +
::*if the reason is documented through one of the four types of "official" documentation: UofT Verification of Illness or Injury Form, Student Health or Disability Related Certificate, a College Registrar’s Letter, and an Accessibility Services Letter.
 +
::Scope for a '''"practicable"''' make-up opportunity for the Oral Test will be limited.
 +
 +
* '''Submissions due on the {{LastdateSpring}}.'''
 +
::Since the course does not have a final exam, the Faculty requires grades to be marked, collated and submitted a few days after the {{LastdateSpring}}. Therefore I cannot normally grant extensions beyond this date. The Faculty allows so called ''informal extensions'' to be granted "in extraordinary circumstances"; in those cases too, the requirement to be "fair, equitable and reasonable" will apply, i.e. you would need to demonstrate that the need for the extension was due to unavoidable circumstances that go significantly beyond what was expected of the rest of the class, and submit "official" documentation to me. In that case, (i) we would determine an adjusted submission date, (ii) I will initially submit a mark of 0 for the missing submissions, and (iii) I will submit an amended mark, after that date, if appropriate. Note that the Faculty requires that such extensions don't go beyond a few days after the end of the Final Examination Period. If you require an extension beyond that date you need to submit a ''formal petition'' through your College Registrar.
 +
 +
 +
{{Vspace}}
 +
 +
===Late penalties===
 +
{{Smallvspace}}
 +
 +
Late penalties will be applied according to the following formula: <code>(marks achieved) * 0.5^(fractional days late)</code>. However material submitted more than 3.0 days late (72 hours or more) will be marked zero. Note: this does not apply to material due before the Oral Test (see there).
 +
 +
{{Vspace}}
 +
 +
===Copyright and Licensing===
 +
{{Smallvspace}}
 +
 +
We follow [[https://en.wikipedia.org/wiki/Free_and_open-source_software '''FOSS''']] principles in this course. You automatically own copyright to all material you prepare. All material must be licensed for free re-use, under the condition of fair attribution. In practice:
 +
 +
'''All pages''' that you place on the Student Wiki must include a <code><nowiki>{{CC-BY}}</nowiki></code> tag. All '''documentation''' within GitHub pages that you prepare for this course must include a [https://creativecommons.org/choose/results-one?license_code=by&amp;jurisdiction=&amp;version=4.0&amp;lang=en Creative Commons  License - Attribution (CC-BY), v. 4.0 or later]. All '''code''' submitted for this course must be licensed under the <code>MIT</code> software license. Unlicensed submissions will have marks deducted and may be removed from the Wiki.
 +
 +
{{Vspace}}
 +
 +
====Academic integrity====
 +
 +
Our rules on [[ABC-Plagiarism|'''Plagiarism and Academic Misconduct''' are clearly spelled out in this learning unit]]. This unit is part of our course prerequisites, and everyone documents in their course journal that they have worked through the unit and understood it. Consequences of having to report to the [http://www.artsci.utoronto.ca/osai Office of Student Academic Integrity (OSAI)] for plagiarism, misrepresentation or falsification include an indelible failing mark on the transcript, a delay in graduation, or not being able to complete your POSt. Please take extra time to clearly understand the requirements, and define for '''yourself''' what they mean for every aspect of '''your''' work.
 +
 +
{{Vspace}}
 +
 +
====Marks adjustments====
 +
 +
I do not adjust marks towards a target mean and variance (i.e. there will be no "belling" of grades). I feel strongly that such "normalization" detracts from a collaborative and mutually supportive learning environment. If your classmate gets a great mark because you helped them with a difficult concept, this should never have the effect that it brings down your mark through class average adjustments. Collaborate as much as possible, it is a great way to learn. <small>But do keep it honest and carefully consider our rules on [[ABC-Plagiarism|Plagiarism and Academic Misconduct]].</small>
 +
 +
{{Vspace}}
 +
 +
== Timetable and contents details ==
 +
 +
 +
<div class="alert">
 +
Note: The general outline of the course as described above is current for the 2019 Winter Term. Filling in the activity details below is still in progress.
 +
</div>
 +
 +
{{Vspace}}
 +
<div class="note">
 +
Note: Click on the "▽" - symbol to see details for each week's activities.
 +
</div>
 +
 +
 +
{{Vspace}}
 +
 +
=== Part I: Foundations ===
 +
{{Smallvspace}}
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, January 8 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
 +
<tr class="s1">
 +
<td class="sc">1</td>
 +
<td class="sc">
 +
* '''No class meeting this day!'''
 +
</td>
 +
<td class="sc">
 +
* To prepare before next meeting ...
 +
:* study or review ABC learning unit material
 +
:* start or update your User page on the Student Wiki
 +
:* start your course journal
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes01" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes01"><small>
 +
* You are not submitting learning units for credit, thus you should be able to progress quickly through the material up to the <span style="background-color: #97bed5;">&nbsp;&nbsp;Milestone&nbsp;units&nbsp;&nbsp;</span>. But do not skip units.
 +
* If you have worked with the ABC-units RStudio project before, you need to  '''pull''' the most recent version from the GitHub repository. Update it from time to time, code will change. If you have not worked with this RStudio project before, make sure you work through the "Introduction to R" units in detail and with great care.
 +
* Your course journal must contain the following category tag: <code><nowiki>[[Category:BCB420-2019_Journal]]</nowiki></code>.
 +
* Your User page must contain the following category tag: <code><nowiki>[[Category:BCB420-2019]]</nowiki></code>.
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, January 15 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">2</td>
 +
<td class="sc">
 +
*First class meeting
 +
* Review of preparatory materials (you should have worked through all of the materials in preparation for class).
 +
* Practice quiz on preparations (not for credit)
 +
* Course overview and Q&A
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* Patch any gaps in your preparation until next Tuesday (Live Quiz!)
 +
:* Carefully study the [[BIN-SYS-Concepts|the '''Systems Concepts''' unit]]
 +
----
 +
* To prepare before next meeting ...
 +
:* Get an overview of the [https://github.com/hyginn/rpt '''the rpt package'''] so you can ask questions next week.
 +
:* Review data sources, you will need to choose one to work on.
 +
:* Review requirements for your ''data source deliverable''. Make sure you can work from it and discuss it in class.
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes02" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes02"><small>
 +
 +
; In progress ...
 +
 +
* You need a GitHub account and you need to have your RStudio client set up to pull from and push to Github hosted projects. See [https://github.com/hyginn/rpt '''the rpt package'''] for details.
 +
* Data: our goal is to make data available that can be used for the annotation of curated biological systems. Data types that interest us in principle include:
 +
** Component annotations: sequence, structure, function (GO), localization ...
 +
** Component dynamics (time, space, virtual dimensions): expression profiles, modification dynamics,  ...
 +
** Relationships: protein-protein interaction data, metabolic and regulatory pathways, functional associations (STRING),  ...
 +
** Perturbations: cancer genomes, epistatic effects,  ...
 +
** Phenotypes: OMIM, Navigome  ...
 +
** Expert curated sets: MSigDB  ...
 +
 +
To be well prepared, you need to understand the various categories of data that are available and have narrowed your choice to two or three datasets for which you know that they fulfill the requirements.
 +
 +
Read:
 +
{{#pmid:30522862}}
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, January 22 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">3</td>
 +
<td class="sc" style="background-color: #ffbf00;">
 +
'''Open ended session:'''
 +
----
 +
* Preparations review Q & A
 +
* Quiz
 +
----
 +
*'''Choosing a dataset to define an import workflow ... '''
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* Data import
 +
:* Analyze your datasource
 +
:* Define cleanup and normalization needs
 +
----
 +
* To prepare before next meeting ...
 +
:* create a project page on the Student Wiki
 +
:* study your database and figure out how the information it provides is related to the system data model
 +
:* define your requirements
 +
:* create a package based on [https://github.com/hyginn/rpt '''rpt''']
 +
:* begin writing your workflow as a "literate programming" document
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes03" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes03"><small>
 +
* Understand the context:
 +
** What data is available? Explore your database and be sure to understand the semantics of the data.
 +
** How is your data going to support systems annotations? Study the systems data model in the [https://github.com/hyginn/BCB420-2019-resources resources project]
 +
** How are you going to present your data?
 +
*** The [https://github.com/hyginn/rpt <tt>'''rpt'''</tt> package]: read the <tt>README</tt> and understand how this supports you to construct your own R package.
 +
*** Markdown: work through the [[RPR-Literate_programming|'''Literate Programming''' unit]] to get an idea in principle, but note the difference between <tt>.Rmd</tt> and <tt>.md</tt> documents (We are doing <tt>.md</tt> here, this is simpler.)
 +
*** Study the [https://github.com/hyginn/BCB420.2019.STRING sample solution well.] Understand what parts of this are relevant for your project, which ones are not, and what parts you may need that are not in the sample solution. 
 +
* Get started:
 +
** Define your requirements. Define how you are going to download the source data, what the results data should look like, and how you are going to construct the results. Identify ambiguities, cleanup needs, possibilities for validation.
 +
** Start a project page on the Student Wiki, write your requirements in point form
 +
** Start building your package. Follow the instructions in the [https://github.com/hyginn/rpt <tt>'''rpt'''</tt> package]. Push the result to GitHub.
 +
** Link to your package from your project page.
 +
** Draft an outline of your workflow in your <tt>README.md</tt> document. Commit and push to GitHub.
 +
* Communicate: whenever questions come up, post on the list.
 +
{{Smallvspace}}
 +
* Don't forget your Journal!
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, January 29 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">4</td>
 +
<td class="sc">
 +
* Normalizing gene names
 +
* Validating datasets
 +
* Scaling transformations
 +
* Intro of test dataset
 +
* Reproducible research aspects
 +
 +
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* solve any normalization issues your dataset may have
 +
:* Get your ORCID IDs
 +
:* Prepare your data that relates to the test set
 +
:* Include scaling code, where indicated
 +
----
 +
* To prepare before next meeting ...
 +
:* work through literate programming
 +
:* finalize package
 +
:* validate correctness
 +
:* document
 +
:* "''Release''" your package before {{Data-package-deadline}}.
 +
:* Review systems theory
 +
:* Intro to BioCuration
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes04" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes04"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
 +
{{Vspace}}
 +
 +
=== Part II: Curation ===
 +
 +
{{Smallvspace}}
 +
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, February 5 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">5</td>
 +
<td class="sc" style="background-color: #ffbf00;">
 +
'''Open ended session:'''
 +
----
 +
*Systems concepts
 +
*A systems ontology
 +
*A systems data model
 +
*Biocuration
 +
----
 +
*'''Choosing your system for a systems curation project ...'''
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* Choose your system
 +
----
 +
* To prepare before next meeting ...
 +
:* Begin your project page
 +
:* Define observables
 +
:* Begin exploring your system
 +
:* Start drafting a systems architecture
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes05" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes05"><small>
 +
{{#lst:Computational_Systems_Biology_Main_Page|curation_form}}
 +
* draft a '''hand drawn sketch''' of the system architecture (cf. {{PDFlink|[http://steipe.biochemistry.utoronto.ca/abc/assets/BIN-SYS-Concepts.pdf "Systems Concepts"]}} <small>(this is the file that was assigned as required reading in Week 2</small>);
 +
* write down a '''list of observables''' for your system, the relationship of the '''data''' we explored in Phase I to the system:
 +
** What features do you expect to find for a gene that occurs in the system? (Annotation-type data)
 +
** What features do you expect to be shared by two genes that occur in your system? (Network-type data)
 +
** What features do you expect to be enriched for all genes in your system, or a defined subset? (Set/enrichment-type data)
 +
{{Smallvspace}}
 +
* Don't forget to write your Journal as you explore your system!
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, February 12 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">6</td>
 +
<td class="sc">
 +
* Class was canceled due to an ice storm
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
* To prepare before next meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes06" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes06"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, February 19 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">–</td>
 +
<td class="sc">
 +
* No class meeting - Reading Week
 +
</td>
 +
<td class="sc">
 +
* To prepare during reading week ...
 +
:* Start your project page on the Student Wiki;
 +
:* draft a hand drawn sketch of the system architecture;
 +
:* draft a list of system observables;
 +
For details see [[Computational_Systems_Biology_Main_Page#Part_II:_Biocuration|the "Biocuration" deliverables]] (above).
 +
 +
 +
<!--
 +
----
 +
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-NotesReadingWeek" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-NotesReadingWeek"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
-->
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, February 26 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
  
;A note on marking
+
<tr class="s1">
It is not my policy to adjust marks towards a target mean and variance (i.e. there will be no "belling" of grades). I feel strongly that such "normalization" detracts from a collaborative and mutually supportive learning environment. If your classmate gets a great mark because you helped him with a difficult concept, this should never have the effect that it brings down your mark through class average adjustments. Collaborate as much as possible, it is a great way to learn. <small>I may however adjust marks is if I phrase questions ambiguously on quizzes.</small>
+
<td class="sc">7</td>
<section end=CSB_main_grading />
+
<td class="sc">
 +
* Milestone report: (major progress: you should be nearly done)
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
* To prepare before next meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes07" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes07"><small>
 +
* TBD
 +
* ...
 +
* ...
  
== Prerequisites ==
 
  
You must have taken an introductory bioinformatics course as a prerequisite, or otherwise acquired the necessary knowledge. Therefore I expect familiarity with the material of my [[Bioinformatics_Main_Page|BCH441 course]]. If you have not taken BCH441, please update your knowledge and skills '''before the course starts'''. I will not make accommodations for lack of prerequisites. Please check the syllabus for this course below to find whether you need to catch up on additional material, and peruse this site to find the information you may need. A (non-exhaustive) overview of topics and useful links is [[CSB prerequisites|linked '''here''']].
+
</small></div>
  
==Exercises and Pre-reading==
+
</td>
 +
</tr>
 +
</table>
  
All course units will have associated exercises that are topics for the following week's quiz.
 
  
All course units will have assigned pre-readings that are topics for the current week's quiz.
+
{{Vspace}}
  
  
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, March 5 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
  
 +
<tr class="s1">
 +
<td class="sc">8</td>
 +
<td class="sc">
 +
* Milestone III: report (final)
 +
* A brief overview of Exploratory Data Analysis (EDA) for Systems Biology <small>(overview of materials and outline how to study)</small>
 +
* Data model of systems data for a shared package
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* Finalize curation report
 +
:* Validate
 +
----
 +
* To prepare before next meeting ...
 +
:* Curation project deadline
 +
:* Prepare for Oral Tests: March 7/8
 +
:* Study introduction to EDA materials
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes08" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes08"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
=== Part III: Exploration ===
 +
 +
{{Smallvspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, March 12 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">9</td>
 +
<td class="sc" style="background-color: #ffbf00;">
 +
'''Open ended session:'''
 +
----
 +
* Exploratory Data Analysis of Systems data
 +
----
 +
* <code>rptPlus</code> and <code>rptTeam</code>
 +
* Contributing to a team-authored package on GitHub: forks, branches, pull-requests and Continuous Integration
 +
*'''Choose your workflow for a team-authored systems EDA package ...'''
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* Study <code>rptPlus</code> and <code>rptTeam</code> documentation
 +
:* ...
 +
----
 +
* To prepare before next meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes09" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes09"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, March 19 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">10</td>
 +
<td class="sc">
 +
* Vignettes
 +
* ...
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
* To prepare before next meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes10" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes10"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, March 26 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">11</td>
 +
<td class="sc">
 +
* ...
 +
</td>
 +
<td class="sc">
 +
* Follow up from class meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
* To prepare before next meeting ...
 +
:* ...
 +
:* ...
 +
----
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes11" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes11"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<table width="90%" align="center">
 +
<tr class="sh">
 +
<td class="sc" width="5%">'''Week'''</td>
 +
<td class="sc" width="33%">'''In class: Tuesday, April 2 2019'''</td>
 +
<td class="sc">'''This week's activities'''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td class="sc">12</td>
 +
<td class="sc">
 +
* No class meeting this day
 +
* Deadline for computational tasks to be documented in journal
 +
* Deadline for all remaining course deliverables
 +
</td>
 +
<td class="sc">
 +
'''NA'''
 +
{{Smallvspace}}
 +
<span class="mw-customtoggle-Notes12" style="vertical-align:bottom;">Details ... &nbsp;▽△</span>
 +
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-Notes12"><small>
 +
* TBD
 +
* ...
 +
* ...
 +
 +
 +
</small></div>
 +
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
{{Vspace}}
 +
 +
 +
<!--
 +
== Course Objectives ==
 +
 +
 +
=== Building Software ===
 +
;Understand principles of software design and implementation in a collaborative environment.
 +
 +
This objective is implicit in students' project participation.
 +
 +
=== Gene Lists ===
 +
;Understand sources of and types of gene lists, gene IDs.
 +
 +
Gene IDs and gene lists are in many respects the raw material from which we construct bioinformatics. Here are two articles to set the stage:
 +
 +
 +
'''BioDBnet''' is a data-warehouse at the US National Cancer Institute.
 +
{{#pmid: 19129209}}
 +
* The database is significant for the large number of source databases it integrates. Have a look at the network of entities and links: http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php
 +
* The database has a list of all identifier types that explains their semantics. This is very useful: http://biodbnet.abcc.ncifcrf.gov/dbInfo/netNodes.php
 +
 +
 +
 +
The '''Molecular Signatures Database (MSigDB)''' collects examples of ''gene sets'', lists of gene identifiers with a shared property. The paper discusses V3.0, the database has grown since then.
 +
{{#pmid: 21546393}}
 +
The currently available gene sets are described here : http://www.broadinstitute.org/gsea/msigdb/collections.jsp
 +
 +
 +
&nbsp;
 +
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Reading response''' (~ 1/2 page; due April 2.): We have discussed pairwise interactions for systems discovery. How would you use gene-lists like those contained in the MSigDB as an additional information sources?
 +
</div>
 +
 +
&nbsp;
 +
 +
 +
=== Co-expression ===
 +
;Understand the use of expression information to infer co-regulation.
 +
 +
A principle we sometimes call "guilt by association" states that genes that have similar features have a functional relationship. Applied to expression levels, the inference is: if the expression levels of two genes are correlated, we may assume that they are co-regulated. And if nature has evolved them to be co-regulated, the selective advantage is derived from a shared function. To put this into practice, one need to find a suitable set of experiments for our expression profiles, one needs to assess multiple experiments in a common frame of reference, an one needs to calculate correlation in a meaningful way. Two databases have recently been published that store such co-expression values: it is useful to compare and contrast their approaches.
 +
{{#pmid: 25628763}}
 +
{{#pmid: 24599084}}
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Quiz''' (March 25): Brief quiz on this paper. Understand the methods.
 +
</div>
 +
 +
&nbsp;
 +
 +
 +
 +
=== Molecular Interaction ===
 +
;Understand the use of interaction data to infer contribution to common function.
 +
 +
Interaction databases provide some of the best evidence for functional relationships between biomolecules, but to use them productively can be challenging. First of all, we are leaving the paradigm of individual molecules and list of molecules  behind, and entering the world of graphs and networks. Secondly, interaction databases have historically struggled to maintain their data to a common standard, and the source data can be of widely varying reliability. As a result, the overlap between different databases has been embarrassingly low, and integration efforts that simply take the superset of all reported interactions suffer from too many false positives. A good introduction to the topic is here:
 +
{{#pmid: 22611057}}
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Quiz''' (April 1): Brief quiz on this paper. Understand the issues in curating and storing interaction data.
 +
</div>
 +
 +
&nbsp;
 +
 +
 +
Function prediction from network data assumes some functions have been annotated and the network will guide which functions to transfer to un-annotated nodes. Two major approaches have been proposed: diffusion based approaches and clustering. We will discuss clustering elsewhere, here is a recent example of diffusion approaches.
 +
{{#pmid: 23788799}}
 +
 +
&nbsp;
 +
 +
=== GO ===
 +
;Understand the use of GO and GOA databases, and how to compute semantic similarity.
 +
 +
The notion of "function" is notoriously difficult to compute with, and the most successful approach to date is contributed by the [http://geneontology.org/ '''Gene Ontology (GO) Consortium''']. GO is an ontology of concepts, organized in a DAG (Directed Acyclic Graph: a hierarchical data-structure like a tree, but nodes can have more than one parent). Actually GO comprises three ontologies for (i) biological processes, (ii) cellular components and (iii) molecular functions. [http://www.ebi.ac.uk/GOA '''Gene Ontology Annotation (GOA)'''] is a database curated by UniPROT, which annotates the UniProt KB proteins with GO terms. The collection of GO terms for a protein is presumed to reflect its function. Visit these sites for a brief introduction.
 +
 +
To work with GO, we need a somewhat deeper understanding of the principles. The discussion of changes to the ontology is a useful start.
 +
{{#pmid: 24641996}}
 +
 +
 +
For this course, one question is particularly important: to analyze whether two genes collaborate. This can usually not be directly inferred from their annotations, but the similarity of their annotated GO terms is an important indicator. There are many ways to compute such ''semantic similarity''. Here is a recent paper that proposes a new measure and compares it to previous approaches:
 +
{{#pmid: 25550042}}
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Quiz''' (April 1): Brief quiz on this paper. Understand the principle of "semantic similarity".
 +
</div>
 +
 +
&nbsp;
 +
 +
=== Pathways ===
 +
;Understand the contents of pathway databases, and their use for gene-pair annotation.
 +
 +
Pathways are the classical paradigm to organize biochemistry into a meaningful framework, with metabolic pathways coming first, later additions include regulatory/signalling pathways and developmental pathways. In a sense such pathways should correlate with a notion of systems as collaborating entitities, or at least form the cores of such systems. But how to exploit this information is not trivial, since pathways are also just conceptual entities: paths in much larger, multiply interconnected networks. One of the classic databases in this field is KEGG, it contains both signalling as well as metabolic pathways, MetaCYC/BioCYC probably has the current lead in breadth of reactions but is metabolic only, Reactome is excellently curated by the EBI, contains metabolic and signalling pathways, but is human only.
 +
 +
I have not found a good, current paper that utilizes database-scale pathway information for the discovery of broad principles. But here is a good, relatively recent overview of Metacyc/Biocyc to set the stage.
 +
 +
{{#pmid: 24225315}}
 +
 +
 +
&nbsp;
 +
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Reading response''' (~ 1/2 page; due April 2.): Regulatory pathways are usually named according to key proteins they are organized around. Can you think of a better way?
 +
</div>
 +
 +
&nbsp;
 +
 +
 +
=== Graph features ===
 +
;Understand the analysis of graphs and computation of graph features.
 +
 +
Graph theory is the most important theoretical framework for systems biology. Here is an introduction with a perspective on biological networks.
 +
{{#pmid: 21527005}}
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Quiz''' (March 25): Brief quiz on this paper. Understand basic concepts and terms.
 +
</div>
 +
 +
&nbsp;
 +
 +
=== Graphs, Pathways, and Networks ===
 +
;Understand the representation of interaction data in systems biology as graphs.
 +
 +
Here Nata&#154;a Pr&#158;ulj develops an analysis of the interaction network '''topology'''.
 +
{{#pmid: 24953453}}
 +
{{PDFlink|[http://local.biochemistry.utoronto.ca/steipe/abc/CourseMaterials/BCB420/jib-238.pdf|(PDF link here)]}}
 +
 +
 +
&nbsp;
 +
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Reading response''' (~ 1/2 page; due April 2.): Sketch the relationship between network topology and "system".
 +
</div>
 +
 +
&nbsp;
 +
 +
=== Graph clustering ===
 +
;Understand the principles and application of modern graph-clustering algorithms.
 +
 +
Cluster theory is a powerful approach to structure data. The basic idea is simple: define clusters as subsets that share more of a certain property within a set than between sets. To put this into practice however is non-trivial - everything depends on the precise definition of the property we are using to organize the data, and what we mean precisely by "within" and "between". Applying the notion of clusters to graphs has its own set of theoretical challenges: in this case we are clustering topological relations, not object attributes. But the implications are profound and range from an improved understanding of biological network structure to a consistent strategy for function annotation. And perhaps biological systems discovery.
 +
 +
{{#pmid: 24972109}}
 +
 +
 +
&nbsp;
 +
 +
<div style="background-color: #FFD4C8; margin:10px; padding: 10px; width: 78%;">
 +
'''Reading response''' (~ 1/2 page; due April 2.): What is your preferred approach to validate the "systems" we discover through clustering? Why?
 +
</div>
 +
 +
&nbsp;
 +
 +
 +
-->
 +
 +
 +
 +
 +
<!--
 
== Timetable and syllabus ==
 
== Timetable and syllabus ==
  
  
 
;Subject to change on short notice...
 
;Subject to change on short notice...
 +
 +
<div class="alert">
 +
Under construction...
 +
</div>
 +
  
 
<table>
 
<table>
Line 172: Line 1,306:
  
 
<tr class="s1">
 
<tr class="s1">
<td class="sc">1</td>
+
<td class="sc">[[BCB420_Week01_Tasks|'''1''']]</td>
 
<td class="sc">Jan.&nbsp;6&nbsp;&ndash;&nbsp;12</td>
 
<td class="sc">Jan.&nbsp;6&nbsp;&ndash;&nbsp;12</td>
 
<td class="sc">
 
<td class="sc">
Line 183: Line 1,317:
 
</tr>
 
</tr>
  
<!-- ===================    THEME    ===================  -->
+
 
 +
 
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr class="st">
 
<tr class="st">
Line 195: Line 1,330:
 
<td class="sc">'''Assignment'''</td>
 
<td class="sc">'''Assignment'''</td>
 
</tr>
 
</tr>
<!-- ===================    /THEME  ===================  -->
+
 
 +
 
  
 
<tr class="s1">
 
<tr class="s1">
<td class="sc">2</td>
+
<td class="sc">[[BCB420_Week02_Tasks|'''2''']]</td>
 
<td class="sc">Jan. 13 - 19</td>
 
<td class="sc">Jan. 13 - 19</td>
 
<td class="sc">
 
<td class="sc">
Line 212: Line 1,348:
  
 
<tr class="s1">
 
<tr class="s1">
<td class="sc">3</td>
+
<td class="sc">[[BCB420_Week03_Tasks|'''3''']]</td>
 
<td class="sc">Jan. 20 - 26</td>
 
<td class="sc">Jan. 20 - 26</td>
 
<td class="sc">
 
<td class="sc">
Line 235: Line 1,371:
  
  
<!-- ===================    THEME    ===================  -->
+
 
 +
 
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr class="st">
 
<tr class="st">
Line 248: Line 1,385:
 
<td class="sc">'''Assignment'''</td>
 
<td class="sc">'''Assignment'''</td>
 
</tr>
 
</tr>
<!-- ===================    /THEME  ===================  -->
+
 
 +
 
  
 
<tr class="s2">
 
<tr class="s2">
Line 277: Line 1,415:
 
</tr>
 
</tr>
  
<!-- ===================    THEME    ===================  -->
+
 
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr class="st">
 
<tr class="st">
Line 290: Line 1,428:
 
<td class="sc">'''Assignment'''</td>
 
<td class="sc">'''Assignment'''</td>
 
</tr>
 
</tr>
<!-- ===================    /THEME  ===================  -->
+
 
  
 
<tr class="s1">
 
<tr class="s1">
Line 326: Line 1,464:
 
</tr>
 
</tr>
  
<!-- ===================    THEME    ===================  -->
+
 
 +
 
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr class="st">
 
<tr class="st">
Line 339: Line 1,478:
 
<td class="sc">'''Assignment'''</td>
 
<td class="sc">'''Assignment'''</td>
 
</tr>
 
</tr>
<!-- ===================    /THEME  ===================  -->
+
 
 +
 
  
 
<tr class="s1">
 
<tr class="s1">
Line 380: Line 1,520:
 
</tr>
 
</tr>
  
<!-- ===================    THEME    ===================  -->
+
 
 +
 
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr><td colspan="5" class="sp">&nbsp;</td></tr>
 
<tr class="st">
 
<tr class="st">
Line 393: Line 1,534:
 
<td class="sc">'''Assignment'''</td>
 
<td class="sc">'''Assignment'''</td>
 
</tr>
 
</tr>
<!-- ===================    /THEME  ===================  -->
+
 
 +
 
  
 
<tr class="s1">
 
<tr class="s1">
Line 422: Line 1,564:
 
</table>
 
</table>
  
==In depth...==
 
* [[Glossary]]
 
* [[Mutation Data Matrices]]
 
* {{WP|List_of_standard_amino_acids|Amino acids}}
 
<!-- * [[Database Identifiers]] -->
 
  
 +
 +
-->
  
 
== Resources ==
 
== Resources ==
  
 
;Course related
 
;Course related
*[[CSB prerequisites|Prerequisites for the course]]
+
*[http://steipe.biochemistry.utoronto.ca/abc/students '''Student Wiki''']
*The [http://groups.google.com/group/bcb420_2013 Course Google Group].
+
*The [https://groups.google.com/forum/#!forum/bcb420-2019 Course Google Group].
*[[Netiquette]] for the Group mailing list
+
*[[FND-Netiquette|Netiquette]] for the Group mailing list
  
 +
{{Smallvspace}}
  
;Contents related
+
{{#pmid:21816037}}
*The '''[[VMD]]''' tutorial
+
{{#pmid:24359104}}
*A '''[[Stereo Vision]]''' tutorial
+
{{#pmid:26844019}}
  
 +
{{Smallvspace}}
  
 
<table width="100%" padding="10" border="1">
 
<table width="100%" padding="10" border="1">
Line 453: Line 1,594:
  
  
 +
&nbsp;
 +
 +
== Notes ==
 +
 +
<references/>
 +
 +
 +
 +
&nbsp;
 
[[Category:Computational Systems Biology]]
 
[[Category:Computational Systems Biology]]
 
</div>
 
</div>

Latest revision as of 11:23, 2 April 2019

Computational Systems Biology

Course Wiki for BCB420 (Computational Systems Biology) and JTB2020 (Applied Bioinformatics).


 

This is our main tool to coordinate information, activities and projects in University of Toronto's computational systems biology course BCB420. If you are not one of our students, this site is unlikely to be useful. If you are here because you are interested in general aspects of bioinformatics or computational biology, you may want to review the Wikipedia article on bioinformatics, or visit Wikiomics. Contact boris.steipe(at)utoronto.ca with any questions you may have.


 

If you are enrolled in this course but have not been subscribed to the mailing list, or do not have an account on the Student Wiki, please contact me immediately.


 


 


 

BCB420 / JTB2020

These are the course pages for BCB420H (Computational Systems Biology). Welcome, you're in the right place.

These are also the course pages for JTB2020H (Applied Bioinformatics). How come? Why is JTB2020 not the graduate equivalent of BCB410 (Applied Bioinformatics)? Let me explain. When this course was conceived as a required part of the (then so called) Collaborative PhD Program in Proteomics and Bioinformatics in 2003, there was an urgent need to bring graduate students to a minimal level of computer skills and programming; prior experience was virtually nonexistent. Fortunately, the field has changed and our current graduate students are usually quite competent at least in some practical aspects of computational biology. In this course we profit from the rich and diverse knowledge of the problem-domain our graduate students have, while bringing everyone up to a level of competence in the practical, computational aspects.


The 2019 course...

In this course we explore systems biology of human genes with computational means in project oriented format. This will proceed in three phases:

  • Foundations first: we will review basic computational skills and bioinformatics knowledge to bring everyone to the same level. In all likelihood you will need to start with these tasks well in advance of the actual lectures. This phase will include a comprehensive quiz on prerequisite material in week 3. We will explore data-sources and you will choose one data-source for which you will develop import code and document it in an R markdown document within an R package;
  • Next we'll focus on Biocuration: the expertise-informed collection, integration and annotation of biological data. We will each choose a molecular "system" to work on, and define an ontology and data-model in which to annotate our system's components, their roles, and their relationships. The outcome of your curation task (together with your data script) will define the scope of this course's Oral Test;
  • Finally, we will develop tools for Exploratory Data Analysis in computational systems biology. We will jointly develop code for a team-authored R package where everyone contributes one mini workflow for data preparation, exploration and interpretation. Your code contributions to the package will be assessed;
  • There are several meta-skills that you will pick up "on the side" these include time management, working according to best practice of reproducible research in a collaborative environment on GitHub; report writing, and keeping a scientific lab journal.



Organization

Dates
BCB420/JTB2020 is a Winter Term course.
Lectures: Tuesdays, 16:00 to 18:00. (Classes start at 10 minutes past the hour.)
Note: there will be three open-ended collaborative planning sessions that may go well into the night. Attendance and participation is mandatory.
Final Exam: None for this course.
Events
  • Tuesday, January 8 2019: Course officially begins. No class meeting. Get started on preparatory material (well in advance actually).
  • Tuesday, January 15: First class meeting. Mock-quiz for preparatory material.
  • Tuesday, January 22: First live quiz on preparatory material. Later: open ended session on data import
  • Tuesday, February 5: Open ended session on system curation
  • Tuesday, March 12: Open ended session on exploratory data analysis


Location
MS 3278 (Medical Sciences Building).


Departmental information
For BCB420 see the BCB420 Biochemistry Department Course Web page.
For JTB2020 see the JTB2020 Course Web page for general information.



 

Prerequisites and Preparation

This course has formal prerequisites of BCH441H1 (Bioinformatics) or CSB472H1 (Computational Genomics and Bioinformatics). I have no way of knowing what is being taught in CSB472, and no way of confirming how much you remember from any of your previous courses, like BCH441 or BCB410. Moreover there are many alternative ways to become familiar with important course contents. Thus I generally enforce course-prerequisites only very weakly and you should not assume at all that having taken any particular combination of courses will have prepared you sufficiently. Instead I make the contents of the course very explicit. If your preparation is lacking, you will have to expend a very significant amount of effort. This is certainly possible, but whether you will succeed will depend on your motivation and aptitude.

The course requires (i) a solid understanding of molecular biology, (ii) solid, introductory level knowledge of bioinformatics, (iii) a good working knowledge of the R programming language.


 

The prerequisite material for this course includes the contents of the 2018 BCH441 course:

  • <command>-Click to open the Bioinformatics Learning Units Map in a new tab, scale for detail.
A knowledge network map of the bioinformatics learning units.
  • Open the Bioinformatics Knowledge Network Map and get an overview of the material. You should confidently be able to execute the tasks in the four   Integrator Units  .
  • If you have taken BCH441 before, please note that many of the units have undergone significant revisions and material has been added. You will need to review the material and familiarize yourself more with the R programming aspects.
  • If you have not taken BCH441, you will need to work through the material rather carefully. Estimate at least three weeks of time and get started immediately.


 

A minimal subset of bioinformatics knowledge you need to begin with work in BCB420 is linked from the BCB420-specific map below. To ensure everyone is adequately prepared, we will hold a Quiz on the   Live units   on that map in the third week of class. We will hold a mock-quiz on the material in the second week (our first class meeting) so everyone knows what to expect.

  • <command>-Click to open the BCB420 Preparation Learning Units Map in a new tab, scale for detail.
A map of preparatory BCB420 learning units.
  • Hover over a learning unit to see its keywords.
  • Click on a learning unit to open the associated page.
  • The nodes of the learning unit network are colour-coded:
    •   Live units   are green
    •   Units under development   are light green. These are still in progress.
    •   Stubs   (placeholders) are pale. These still need basic contents.
    •   Milestone units   are blue. These collect a number of prerequisites to simplify the network.
    •   Integrator units   are red. These embody the main goals of the course. These units are not for evaluation in BCB420.
  • Arrows point from a prerequisite unit to a unit that builds on its contents.


 


Grading, Activities, Deliverables

 

For details of the deliverables, see below.

 
Activity Weight
BCB410 - (Undergraduates)
Weight
JTB2020 - (Graduates)
Self-evaluation and Feedback session on preparatory material("Quiz"[1]) 20 marks 15 marks
Oral Test (March 7/8) 30 marks 30 marks
Collaborative software task and participation 20 marks 15 marks
Journal 25 marks 25 marks
Insights 5 marks 5 marks
Pull request reviews   10 marks
Total 100 marks 100 marks


 

We are covering a lot of ground in this course, and all deliverables feed into a collaborative project. Everyone's continuous, active participation is essential for making this a success: for you personally and for the class as a team.


 

Getting started

 

Everything starts with the following four units:

This should be the first learning unit you work with, since your Course Journal will be kept on a Wiki, as well as all other deliverables. This unit includes an introduction to authoring Wikitext and the structure of Wikis, in particular how different pages live in separate "Namespaces". The unit also covers the standard markup conventions - "Wikitext markup" - the same conventions that are used on Wikipedia - as well as some extensions that are specific to our Course- and Student Wiki. We also discuss page categories that help keep a Wiki organized, licensing under a Creative Commons Attribution license, and how to add licenses and other page components through template codes.


Keeping a journal is an essential task in a laboratory. To practice keeping a technical journal, you will document your activities as you are working through the material of the course. A significant part of your term grade will be given for this Course Journal. This unit introduces components and best practice for lab- and course journals and includes a wiki-source template to begin your own journal on the Student Wiki.


Academic Integrity is a promise that scholars and scientists world-wide give each other, that we will uphold, protect, and promote ethical and practical standards for our work. Its most basic values are proclaimed as honesty, trust, fairness, respect, responsibility, and courage. These are simple ideas, but in order to give them meaning we need to discuss how these values get translated to the details of our everyday work. Unfortunately, this important topic is often compressed to discussing cheating and plagiarism, to managing procedures to detect dishonesty, and to threatening sanctions. It is overlooked that those are just the manifestations of much deeper problems, and focussing on those symptoms alone perpetuates a stereotyped us-versus-them mentality of educators and students alike that is much more likely to make the problem worse than to solve it. The key to counter this lies in a proper understanding of academic integrity as a relational value, and respect as its foundation.

Discussing academic integrity in the abstract is of limited use, the challenge is to put the concepts in practice, in every aspect of this course and this is not a question of behaviour, but of attitude. The attitude needs to be reflected in the choice of teaching materials, in the care in their preparation, in the attitude of impartiality and reproducibility we bring to our experiments, in mutual trust in class, in fairness in assessments, and honesty in assignments. One everyday issue is attribution and we operate a Full Disclosure Policy for attribution in this course. This means everything that is not one's own, original idea must be identified and properly attributed. Neither I nor you are already perfect in this, but I trust we can come together as a learning community to educate each other and improve.


In paralell with your other work, you will maintain an insights! page on which you collect valuable insights and learning experiences of the course. Through this you ask yourself: what does this material mean - for the field, and for myself.


Once you have completed these four units, get started immediately on the Introduction-to-R units. You need time and practice, practice, practice[2] to acquire the programming skills you need for the course. Whenever you want to take a break from studying R, continue with the other preparatory units.


 

PartI: Foundations and Data

 

Don't forget to document your work in your Journal!


 

Your level of preparedness will be assessed in a "mock quiz" in week two, after which you have one more week to fill in gaps before our Quiz in week three. With that out of the way, we will look at different data sources that are useful in systems biology, including gene-level annotations and collections of experimental data, relationship data like physical and epistatic interactions, and systems-level data like metabolic or regulatory pathways. Each of you will select one data-source in our first open-ended session and then work on the following deliverables:

  • a brief summary page on the Student Wiki: the page needs to be named according to the pattern: User:<your_name>/BCB420-2019-Data_<your_data_resource> and contain the category tag: [[Category:BCH420-2019_Data_project]].
  • an R package derived from rpt,
    • hosted on GitHub,
    • named according to the pattern BCB420.2019.<your_data_resource>[3],
    • containing an R markdown page that describes and annotates code for
      • importing the chosen data in platform-independent function calls (see the footnote for details and restrictions)[4],
      • and cleaning it up where necessary,
      • and normalizing its identifiers to HuGO gene symbols,
    • and containing sample data for our defined reference dataset of genes,
    • and containing a report on the data statistics,
    • and containing code to validate the import process,
    • and containing the (provided) function to display the markdown file.

Required: a user needs to be able to use the information you provided to understand the semantics of the data, import the data, purify it where necessary, and associate it with HUGO IDs in an R data frame. They should be able to use the data as a feature in a machine learning protocol without further preprocessing steps.

To illustrate the requirements with a model solution, I have provided an example project page here, which links to a Github repository with the corresponding package. Studying this with some care will probably clarify many questions.

Note
  • If your data refers to chromosomal coordinates in any way, you must ensure the coordinates are from GRCh38 (hg38)[5]
  • Your chosen database will not always be the best choice of data source: often you can achieve your objective faster though ensembl/biomart. See this sample annotation of BRCA2 for examples of what data is available.


 

Database choices

 

Here are the chosen (or assigned) databases. Follow the link in the "Note" column for details:


 
Name DB Note
Edouard Al-chami GEO (stimulus)  [6]
Emily Ayala Gene models  [7]
Deus Bajaj EGGNOG  
Cathy Cha GEO (tissues)  [8]
Nada Elnour Human Protein Atlas  [9]
Chantal Ho GEO (diseases)  [10]
Edward Ho Cosmic  
Sapir Labes GWAS  
Judy Lee PDB  [11]
Tina Lee Pfam  [12]
Jian Bin Lin GEO  
Matthew Mcneil COSMIC and GEO  [13]
Gabriela Morgenshtern Awesome (or PANTHER)  
Yoonsik Park Reactome pathways  
Alesandro Rigido MsigDB  [14]
Fan Shen SMART  
Rachel Silverstein Human Phenotype Ontology  
Yiqiu Tang OMIM  [15]
Denitsa Vasileva GO annotations  [16]
Rachel Woo Human Protein Atlas  [17]
Alison Wu BioGRID  [18]
Yufei Yang GTRD  [19]
Yin Yin huMAP  [20]
Han Zhang HitPredict  [21]
Xindi Zhang CORUM  [22]
Yuhan Zhang Encode  
Liwen Zhuang Human Disease Ontology  


Contact me with any questions you may have.


 

Part II: Biocuration

"Systems" are concepts and working with systems requires expert knowledge. To explore the practice of expert curation of molecular systems, each of you will select one system in our second open-ended session and report on its components, its function(s) and its architecture. To start off:

  • Choose a system from the GO term table on the Student Wiki, confirm your choice with me and replace the "N.N." in the table with your name.
  • Explore the term on AmiGO, and explore the linked "seed-genes" on UniProt.
  • In PubMed, find recent reviews or other manuscripts that discuss the system and its context. Make sure you have not overlooked important literature, this will be part of your evaluation. If there is no suitable literature available, your GO term is not a suitable choice.
  • Get an overview of your system and how it relates to the GO term you start out from.
  • define the system well and define a five-letter code as a shorthand notation of the system as discussed in class.
Note
A GO term is not a system nor is the set of GOA annotated genes a complete description of the system's members. A system may overlap the component/function/process described in a GO term to a large degree, but the term is not informed or constrained by our "system" definition. We use GO terms as a first approximation to system functions, and we use GOA to define "seed" genes as a starting point that may help us build out the system description. However, a system's roles include the creation, maintenance, destruction, and potentially recycling of components, and these roles are not always included in either the literature nor the GO terms themselves.


 

Read the notes on curating a biological system.


 


 
General goal: System Architecture

A system architecture describes the system’s behaviour in terms of its subsystems and their relationships, given its context, within its boundaries.


 
Deliverables: Contents
  • A structured description of the system, including its name, definition, description, associated GO terms, an initial set of computationally defined genes it contains, and references to a seed set of literature articles that will be used for curation;
  • A description of concepts of importance. This includes the biological context, and background knowledge about the components.
  • An enumeration of components from:
    • literature review;
    • direct annotation, i.e. genes discovered because they have been annotated with a relationship to the system, in a database such as UniProt, NCBI-Protein or any of the three GO ontologies represented in GOA (GO annotations);
    • network and pathway annotation, i.e. genes discovered in the network neighbourhood of system components, in a database like STRING or IntAct, or in pathways such as KEGG or Reactome;
    • phenotype and behaviour, i.e. genes annotated to a related phenotype in OMIM or the GWAS catalog;
    • ... each with a note on the type and quality of evidence that supports their inclusion.
  • Completion of role annotation: each component has one role annotated to it (list components more than once if several distinct roles relate to the same, or overlapping entities); list roles that are expected, or required, but have no components associated with them.
  • A system architecture sketch that integrates the system information;
  • A formatted set of system data, ready to be imported into a system database.


Deliverables: Form
  • Create a project page on the Student Wiki named according to the pattern: User:<your_name>/BCB420-2019-System_<your_system_code>;
  • add the category tag: [[Category:BCH420-2019_Curation_project]];
  • add the {{CC-BY}} template;
  • summarize your "seed" information (follow the model for the PHALY system);
  • as you are annotating your system, ensure all components have a SyRO role defined, and the evidence source and evidence code has been entered;
  • the system data needs to be included in the page in a valid(!) JSON file, in an expansible section of text.[23]


Both your data import script and your curated system model will be assessed in the Oral Test.


 

Part III: Exploration

At the end of Parts I and II we will have data available and annotated systems that induce relations on the data. Using this information, we can formulate tools for exploratory data analysis (EDA): isolating and evaluating features, looking at correlations, identifying patterns in networks, clustering data etc. Each of you will select one EDA workflow in our third open-ended session for which to build a tool in a jointly authored R package. Your deliverables are:

  • a project page on the student Wiki that contains a specification of your tool;
  • an implementation of your tool as part of a jointly authored R package under continuous integration;
  • a Vignette in the package that describes the tool and includes sample code for which the data is also provided in the package.

Your deliverables will be evaluated together with your participation in constructing the package.

Deliverables: Form
  • On the Student Wiki -
    • Create a project page on the Student Wiki named according to the pattern: User:<your_name>/BCB420-2019-ExploratorySystemsAnalysis;
    • add the category tag: [[Category:BCH420-2019_Exploration_project]];
    • add the {{CC-BY}} template;
    • summarize the objectives of your exploration tool in terms of input, output, and interpretation;
    • write a specification for your exploration tool;
    • summarize example results.
  • On GitHub -
    • Fork the project BCB420.2019.ESA;
    • Develop your code as a package function;
    • Write a vignette;
    • Make sure your changes pass without errors, warnings or notes;
    • Submit a pull request by Monday, March 25.
    • Address comments from the pull-request review before Tuesday, April 2.

The code is considered "submitted" when it passes the continuous integration checks, all pull-request reviews have been addressed, and your branch has been merged into the BCB420.2019.ESA package.


 

Extensions for term work

 

Extensions for term work in this course are subject to Faculty regulations and will only be considered within the framework determined by the Faculty policies.


  • Regular Submissions
It is Faculty policy to require assessments to be "fair, equitable and reasonable". In order to be equitable, granting extensions requires the student to demonstrate that the need for the extension is due to unavoidable circumstances that go significantly beyond what was expected of the rest of the class. In general "official" documentation will be required: UofT Verification of Illness or Injury Form, Student Health or Disability Related Certificate, a College Registrar’s Letter, and an Accessibility Services Letter.
  • Signing up for the oral tests.
The dates for the Oral Test have been announced at the beginning of the term on this syllabus. If you fail to sign up for a slot, or if you fail to show up at the scheduled time, we apply the Faculty policy for a missed Midterm Test: "if the reasons for missing your test are acceptable to the instructor, a make-up opportunity should be offered to the student where practicable. "Acceptable" reasons will be considered
  • if they are justified,
  • if the consideration is "fair, equitable and reasonable", and
  • if the reason is documented through one of the four types of "official" documentation: UofT Verification of Illness or Injury Form, Student Health or Disability Related Certificate, a College Registrar’s Letter, and an Accessibility Services Letter.
Scope for a "practicable" make-up opportunity for the Oral Test will be limited.
  • Submissions due on the last day to submit course work in the Spring term (Tuesday, April 2 2019).
Since the course does not have a final exam, the Faculty requires grades to be marked, collated and submitted a few days after the last day to submit course work in the Spring term (Tuesday, April 2 2019). Therefore I cannot normally grant extensions beyond this date. The Faculty allows so called informal extensions to be granted "in extraordinary circumstances"; in those cases too, the requirement to be "fair, equitable and reasonable" will apply, i.e. you would need to demonstrate that the need for the extension was due to unavoidable circumstances that go significantly beyond what was expected of the rest of the class, and submit "official" documentation to me. In that case, (i) we would determine an adjusted submission date, (ii) I will initially submit a mark of 0 for the missing submissions, and (iii) I will submit an amended mark, after that date, if appropriate. Note that the Faculty requires that such extensions don't go beyond a few days after the end of the Final Examination Period. If you require an extension beyond that date you need to submit a formal petition through your College Registrar.


 

Late penalties

 

Late penalties will be applied according to the following formula: (marks achieved) * 0.5^(fractional days late). However material submitted more than 3.0 days late (72 hours or more) will be marked zero. Note: this does not apply to material due before the Oral Test (see there).


 

Copyright and Licensing

 

We follow [FOSS] principles in this course. You automatically own copyright to all material you prepare. All material must be licensed for free re-use, under the condition of fair attribution. In practice:

All pages that you place on the Student Wiki must include a {{CC-BY}} tag. All documentation within GitHub pages that you prepare for this course must include a Creative Commons License - Attribution (CC-BY), v. 4.0 or later. All code submitted for this course must be licensed under the MIT software license. Unlicensed submissions will have marks deducted and may be removed from the Wiki.


 

Academic integrity

Our rules on Plagiarism and Academic Misconduct are clearly spelled out in this learning unit. This unit is part of our course prerequisites, and everyone documents in their course journal that they have worked through the unit and understood it. Consequences of having to report to the Office of Student Academic Integrity (OSAI) for plagiarism, misrepresentation or falsification include an indelible failing mark on the transcript, a delay in graduation, or not being able to complete your POSt. Please take extra time to clearly understand the requirements, and define for yourself what they mean for every aspect of your work.


 

Marks adjustments

I do not adjust marks towards a target mean and variance (i.e. there will be no "belling" of grades). I feel strongly that such "normalization" detracts from a collaborative and mutually supportive learning environment. If your classmate gets a great mark because you helped them with a difficult concept, this should never have the effect that it brings down your mark through class average adjustments. Collaborate as much as possible, it is a great way to learn. But do keep it honest and carefully consider our rules on Plagiarism and Academic Misconduct.


 

Timetable and contents details

Note: The general outline of the course as described above is current for the 2019 Winter Term. Filling in the activity details below is still in progress.


 

Note: Click on the "▽" - symbol to see details for each week's activities.


 

Part I: Foundations

 
Week In class: Tuesday, January 8 2019 This week's activities
1
  • No class meeting this day!
  • To prepare before next meeting ...
  • study or review ABC learning unit material
  • start or update your User page on the Student Wiki
  • start your course journal

 

Details ...  ▽△

  • You are not submitting learning units for credit, thus you should be able to progress quickly through the material up to the   Milestone units  . But do not skip units.
  • If you have worked with the ABC-units RStudio project before, you need to pull the most recent version from the GitHub repository. Update it from time to time, code will change. If you have not worked with this RStudio project before, make sure you work through the "Introduction to R" units in detail and with great care.
  • Your course journal must contain the following category tag: [[Category:BCB420-2019_Journal]].
  • Your User page must contain the following category tag: [[Category:BCB420-2019]].


 


Week In class: Tuesday, January 15 2019 This week's activities
2
  • First class meeting
  • Review of preparatory materials (you should have worked through all of the materials in preparation for class).
  • Practice quiz on preparations (not for credit)
  • Course overview and Q&A
  • Follow up from class meeting ...

  • To prepare before next meeting ...
  • Get an overview of the the rpt package so you can ask questions next week.
  • Review data sources, you will need to choose one to work on.
  • Review requirements for your data source deliverable. Make sure you can work from it and discuss it in class.

 

Details ...  ▽△

In progress ...
  • You need a GitHub account and you need to have your RStudio client set up to pull from and push to Github hosted projects. See the rpt package for details.
  • Data: our goal is to make data available that can be used for the annotation of curated biological systems. Data types that interest us in principle include:
    • Component annotations: sequence, structure, function (GO), localization ...
    • Component dynamics (time, space, virtual dimensions): expression profiles, modification dynamics, ...
    • Relationships: protein-protein interaction data, metabolic and regulatory pathways, functional associations (STRING), ...
    • Perturbations: cancer genomes, epistatic effects, ...
    • Phenotypes: OMIM, Navigome ...
    • Expert curated sets: MSigDB ...

To be well prepared, you need to understand the various categories of data that are available and have narrowed your choice to two or three datasets for which you know that they fulfill the requirements.

Read:

Grabowski & Rappsilber (2019) A Primer on Data Analytics in Functional Genomics: How to Move from Data to Insight?. Trends Biochem Sci 44:21-32. (pmid: 30522862)

PubMed ] [ DOI ] High-throughput methodologies and machine learning have been central in developing systems-level perspectives in molecular biology. Unfortunately, performing such integrative analyses has traditionally been reserved for bioinformaticians. This is now changing with the appearance of resources to help bench-side biologists become skilled at computational data analysis and handling large omics data sets. Here, we show an entry route into the field of omics data analytics. We provide information about easily accessible data sources and suggest some first steps for aspiring computational data analysts. Moreover, we highlight how machine learning is transforming the field and how it can help make sense of biological data. Finally, we suggest good starting points for self-learning and hope to convince readers that computational data analysis and programming are not intimidating.


 


Week In class: Tuesday, January 22 2019 This week's activities
3

Open ended session:


  • Preparations review Q & A
  • Quiz

  • Choosing a dataset to define an import workflow ...
  • Follow up from class meeting ...
  • Data import
  • Analyze your datasource
  • Define cleanup and normalization needs

  • To prepare before next meeting ...
  • create a project page on the Student Wiki
  • study your database and figure out how the information it provides is related to the system data model
  • define your requirements
  • create a package based on rpt
  • begin writing your workflow as a "literate programming" document

 

Details ...  ▽△

  • Understand the context:
    • What data is available? Explore your database and be sure to understand the semantics of the data.
    • How is your data going to support systems annotations? Study the systems data model in the resources project
    • How are you going to present your data?
      • The rpt package: read the README and understand how this supports you to construct your own R package.
      • Markdown: work through the Literate Programming unit to get an idea in principle, but note the difference between .Rmd and .md documents (We are doing .md here, this is simpler.)
      • Study the sample solution well. Understand what parts of this are relevant for your project, which ones are not, and what parts you may need that are not in the sample solution.
  • Get started:
    • Define your requirements. Define how you are going to download the source data, what the results data should look like, and how you are going to construct the results. Identify ambiguities, cleanup needs, possibilities for validation.
    • Start a project page on the Student Wiki, write your requirements in point form
    • Start building your package. Follow the instructions in the rpt package. Push the result to GitHub.
    • Link to your package from your project page.
    • Draft an outline of your workflow in your README.md document. Commit and push to GitHub.
  • Communicate: whenever questions come up, post on the list.
 
  • Don't forget your Journal!



 


Week In class: Tuesday, January 29 2019 This week's activities
4
  • Normalizing gene names
  • Validating datasets
  • Scaling transformations
  • Intro of test dataset
  • Reproducible research aspects


  • Follow up from class meeting ...
  • solve any normalization issues your dataset may have
  • Get your ORCID IDs
  • Prepare your data that relates to the test set
  • Include scaling code, where indicated

  • To prepare before next meeting ...
  • work through literate programming
  • finalize package
  • validate correctness
  • document
  • "Release" your package before Tuesday, February 5 2019 at 16:00[24].
  • Review systems theory
  • Intro to BioCuration

 

Details ...  ▽△

  • TBD
  • ...
  • ...




 

Part II: Curation

 


Week In class: Tuesday, February 5 2019 This week's activities
5

Open ended session:


  • Systems concepts
  • A systems ontology
  • A systems data model
  • Biocuration

  • Choosing your system for a systems curation project ...
  • Follow up from class meeting ...
  • Choose your system

  • To prepare before next meeting ...
  • Begin your project page
  • Define observables
  • Begin exploring your system
  • Start drafting a systems architecture

 

Details ...  ▽△

  • Create a project page on the Student Wiki named according to the pattern: User:<your_name>/BCB420-2019-System_<your_system_code>;
  • add the category tag: [[Category:BCH420-2019_Curation_project]];
  • add the {{CC-BY}} template;
  • summarize your "seed" information (follow the model for the PHALY system);
  • as you are annotating your system, ensure all components have a SyRO role defined, and the evidence source and evidence code has been entered;
  • the system data needs to be included in the page in a valid(!) JSON file, in an expansible section of text.[25]
  • draft a hand drawn sketch of the system architecture (cf. "Systems Concepts" (this is the file that was assigned as required reading in Week 2);
  • write down a list of observables for your system, the relationship of the data we explored in Phase I to the system:
    • What features do you expect to find for a gene that occurs in the system? (Annotation-type data)
    • What features do you expect to be shared by two genes that occur in your system? (Network-type data)
    • What features do you expect to be enriched for all genes in your system, or a defined subset? (Set/enrichment-type data)
 
  • Don't forget to write your Journal as you explore your system!



 


Week In class: Tuesday, February 12 2019 This week's activities
6
  • Class was canceled due to an ice storm
  • Follow up from class meeting ...
  • ...
  • ...

  • To prepare before next meeting ...
  • ...
  • ...

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 


Week In class: Tuesday, February 19 2019 This week's activities
  • No class meeting - Reading Week
  • To prepare during reading week ...
  • Start your project page on the Student Wiki;
  • draft a hand drawn sketch of the system architecture;
  • draft a list of system observables;

For details see the "Biocuration" deliverables (above).



 


Week In class: Tuesday, February 26 2019 This week's activities
7
  • Milestone report: (major progress: you should be nearly done)
  • Follow up from class meeting ...
  • ...
  • ...

  • To prepare before next meeting ...
  • ...
  • ...

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 


Week In class: Tuesday, March 5 2019 This week's activities
8
  • Milestone III: report (final)
  • A brief overview of Exploratory Data Analysis (EDA) for Systems Biology (overview of materials and outline how to study)
  • Data model of systems data for a shared package
  • Follow up from class meeting ...
  • Finalize curation report
  • Validate

  • To prepare before next meeting ...
  • Curation project deadline
  • Prepare for Oral Tests: March 7/8
  • Study introduction to EDA materials

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 

Part III: Exploration

 


Week In class: Tuesday, March 12 2019 This week's activities
9

Open ended session:


  • Exploratory Data Analysis of Systems data

  • rptPlus and rptTeam
  • Contributing to a team-authored package on GitHub: forks, branches, pull-requests and Continuous Integration
  • Choose your workflow for a team-authored systems EDA package ...
  • Follow up from class meeting ...
  • Study rptPlus and rptTeam documentation
  • ...

  • To prepare before next meeting ...
  • ...
  • ...

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 


Week In class: Tuesday, March 19 2019 This week's activities
10
  • Vignettes
  • ...
  • Follow up from class meeting ...
  • ...
  • ...

  • To prepare before next meeting ...
  • ...
  • ...

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 


Week In class: Tuesday, March 26 2019 This week's activities
11
  • ...
  • Follow up from class meeting ...
  • ...
  • ...

  • To prepare before next meeting ...
  • ...
  • ...

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 


Week In class: Tuesday, April 2 2019 This week's activities
12
  • No class meeting this day
  • Deadline for computational tasks to be documented in journal
  • Deadline for all remaining course deliverables

NA

 

Details ...  ▽△

  • TBD
  • ...
  • ...



 




Resources

Course related


 
Miller et al. (2011) Strategies for aggregating gene expression data: the collapseRows R function. BMC Bioinformatics 12:322. (pmid: 21816037)

PubMed ] [ DOI ] BACKGROUND: Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied. RESULTS: We introduce the R function collapseRows that implements several collapsing methods and evaluate its performance in three applications. First, we study a crucial step of the meta-analysis of microarray data: the merging of independent gene expression data sets, which may have been measured on different platforms. Toward this end, we collapse multiple microarray probes for a single gene and then merge the data by gene identifier. We find that choosing the probe with the highest average expression leads to best between-study consistency. Second, we study methods for summarizing the gene expression profiles of a co-expression module. Several gene co-expression network analysis applications show that the optimal collapsing strategy depends on the analysis goal. Third, we study aggregating the information of cell type marker genes when the aim is to predict the abundance of cell types in a tissue sample based on gene expression data ("expression deconvolution"). We apply different collapsing methods to predict cell type abundances in peripheral human blood and in mixtures of blood cell lines. Interestingly, the most accurate prediction method involves choosing the most highly connected "hub" marker gene. Finally, to facilitate biological interpretation of collapsed gene lists, we introduce the function userListEnrichment, which assesses the enrichment of gene lists for known brain and blood cell type markers, and for other published biological pathways. CONCLUSIONS: The R function collapseRows implements several standard and network-based collapsing methods. In various genomic applications we provide evidence that both types of methods are robust and biologically relevant tools.

Chang et al. (2013) Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC Bioinformatics 14:368. (pmid: 24359104)

PubMed ] [ DOI ] BACKGROUND: As high-throughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and meta-analysis have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a meta-analysis method given an application; the decision essentially requires both statistical and biological considerations. RESULTS: We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS(A): DE genes with non-zero effect sizes in all studies, (2) HS(B): DE genes with non-zero effect sizes in one or more studies and (3) HS(r): DE gene with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six large-scale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data structure, respectively. CONCLUSIONS: The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS(A), HS(B), and HS(r)). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author's publication website.

Thompson et al. (2016) Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 4:e1621. (pmid: 26844019)

PubMed ] [ DOI ] Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


 
325C78 7097B8 9BACCF A8A5CC D7C0F0


 

Notes

  1. I call these activities Quiz sessions for brevity, however they are not quizzes in the usual sense, since they rely on self-evaluation and immediate feedback.
  2. It's practice!
  3. According to "Writing R Extensions": "The mandatory ‘Package’ field gives the name of the package. This should contain only (ASCII) letters, numbers and dot, have at least two characters and start with a letter and not end in a dot." Deviating from this will result in a package check error.
  4. Note: the repository absolutely must not contain any datafile of more than 1Mb in size! Rather it must contain clear instructions how to download the data. Packages that violate the size limitations will not be evaluated. The code you write shall expect the data in a sister-directory of your working directory which is called data. For example, if I were to store a datafile by the name STRING_90.dat, my code would construct the path to it in a platform independent way as file.path("..", "data", "STRING_90.dat").
  5. For different approaches to convert from one to the other see this thread on Biostars.
  6. Cell response to external stimuli (eg. heat, salt, insulin, chemokines ...): Find ~ 20 high-coverage experimental data sets, define the pipeline to download and process the sets into a common data structure, apply quantile normalization. Result: an expression vector for each gene.
  7. Find gene models (exons and chromosomal coordinates) for each gene. Possible sources are Gencode v29 GTF or Gff3 files, or exons from biomart. Result: for each gene, a set of chromosomal start/end coordinates for the principal isoform as defined by APPRIS.
  8. Differential expression in tissues (eg. brain, epithelium, muscles ...): Find ~ 20 high-coverage experimental data sets, define the pipeline to download and process the sets into a common data structure, apply quantile normalization. Result: an expression vector for each gene.
  9. Find subcellular localization for each gene. Result: for each gene, the subcellular localizations it is associated with.
  10. Differential expression in disease states (eg. diabetes, hypertension, RA, ...): Find ~ 20 high-coverage experimental data sets, define the pipeline to download and process the sets into a common data structure, apply quantile normalization. Result: an expression vector for each gene.
  11. Find PDB structures of human proteins. Possible data sources: Biomart? PDB? NCBI's MMDB? If structures overlap, report only the best representative. This is a set of feature annotations for each gene that includes start and stop coordinates. You must validate the coordinates, i.e. make sure that the annotated residue numbers map accurately to the actual sequence associated with the HGNC symbol.
  12. Obtain annotations via Ensembl/biomart. This is a set of feature annotation for each gene that includes start and stop coordinates. You must validate the coordinates, i.e. make sure that the annotated residue numbers map accurately to the actual sequence associated with the HGNC symbol.
  13. Tissue specific correlations of expression levels. Result: for each gene ... ??? Question: how are differentially spliced genes handled?
  14. For a selected set of MSigDB sets compute co-occurrence probability of genes: how often do they co-occur in the same MSig Set? This is a network-type result. Output will be two HGNC symbols and one probability for each queried pair. Don't precompute all 1e9 possible pairs, but conceptualize this as a tool that queries a compact datastructure with the probabilities, e.g. a boolean matrix with one set-annotation per column (for each gene TRUE if present in the set, FALSE if not present) that compares two row-vectors for each query.
  15. Gene phenotype associations. For each gene, the set of phenotypes it is associated with.
  16. For each gene, the set of GO terms it is annotated to.
  17. Tissue Data: tissue level expression vector. Result: for each gene ... ??? Question: how are differentially spliced genes handled?
  18. Process genetic interactions only. Result: edge list (Weighted? Directed?)
  19. ChipSeq verified TF binding sites in gene promoter regions. Result: for each genes, list of transcription factors that target its promoter region.
  20. Protein complexes. Result: for each gene, all complexes (if any) it has been annotated to.
  21. Weighted interaction graph. Result: edge list with weights.
  22. Protein complexes. Result: for each gene, all complexes (if any) it has been annotated to.
  23. Note: you must include line breaks with your JSON data! Data that has everything on one line will not be accepted.
  24. Note: late-penalties apply.
  25. Note: you must include line breaks with your JSON data! Data that has everything on one line will not be accepted.