Difference between revisions of "ABC-INT-Categorical features"

From "A B C"
Jump to navigation Jump to search
m
m
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<div id="ABC">
 
<div id="ABC">
<div style="padding:5px; border:1px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;">
+
<div style="padding:5px; border:4px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Integrator Unit: Categorical Features
 
Integrator Unit: Categorical Features
 
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; ">
 
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; ">
Line 21: Line 21:
 
<b>Deliverables:</b><br />
 
<b>Deliverables:</b><br />
 
<section begin=deliverables />
 
<section begin=deliverables />
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-integrator" -->
+
<li><b>Integrator unit</b>: Deliverables can be submitted for course marks. See below for details.</li>
*<b>Integrator unit</b>: Deliverables can be submitted for course marks. See below for details.
 
 
<section end=deliverables />
 
<section end=deliverables />
 
<!-- ============================  -->
 
<!-- ============================  -->
Line 28: Line 27:
 
<section begin=prerequisites />
 
<section begin=prerequisites />
 
<b>Prerequisites:</b><br />
 
<b>Prerequisites:</b><br />
<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
+
This unit builds on material covered in the following prerequisite units:<br />
This unit builds on material covered in the following prerequisite units:
 
 
<!-- *[[APB-Data-Preparation]] -->
 
<!-- *[[APB-Data-Preparation]] -->
 
*[[BIN-FUNC-Annotation|BIN-FUNC-Annotation (Function Annotation)]]
 
*[[BIN-FUNC-Annotation|BIN-FUNC-Annotation (Function Annotation)]]
Line 39: Line 37:
  
  
 +
{{SLEEP}}
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 49: Line 48:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "./components/ABC-INT-Categorical_features.components.txt", section: "evaluation" -->
 
 
Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.<ref>Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.</ref>.
 
Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.<ref>Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.</ref>.
 
*Work through the tasks described below.
 
*Work through the tasks described below.
Line 56: Line 54:
 
* Your task will involve submitting documentation on a sub-page of the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/BCB420_2018_Tasks '''Teams and Tasks page''']. (Details below) This documentation will be jointly authored and I expect every team member to be able to speak to all of it.
 
* Your task will involve submitting documentation on a sub-page of the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/BCB420_2018_Tasks '''Teams and Tasks page''']. (Details below) This documentation will be jointly authored and I expect every team member to be able to speak to all of it.
 
<!-- * Your task will involve submitting code to the [https://github.com/hyginn/zu ''zu'' R package]. Ensure that your team's submission are complete and pass package checks with zero errors, zero warnings and zero notes.-->
 
<!-- * Your task will involve submitting code to the [https://github.com/hyginn/zu ''zu'' R package]. Ensure that your team's submission are complete and pass package checks with zero errors, zero warnings and zero notes.-->
* Schedule an oral exam (if you haven't done so already) by editing the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. You must have signed-up for an exam slot before 20:00 on the day before your exam.<ref>For clarification: You sign up for only '''one''' oral exam for February.</ref>
+
* Schedule an oral exam (if you haven't done so already) by editing <span class="highlight">the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. You</span> must have signed-up for an exam slot before 20:00 on the day before your exam.<ref>For clarification: You sign up for only '''one''' oral exam for February.</ref>
 
* Your work must be complete before 20:00 on the day before your exam.
 
* Your work must be complete before 20:00 on the day before your exam.
  
Line 62: Line 60:
  
 
== Contents ==
 
== Contents ==
<!-- included from "./components/ABC-INT-Categorical_features.components.txt", section: "contents" -->
 
  
 
Most interesting data that describes function in living cells is not numerical, but categorical. Moreover, it is data with large numbers of categories - "'''high cardinality categorical data'''". Such data is problematic for machine learning for reasons of principle and practicality. Such data is sparse, and the many dimensions of noise make overfitting of data a strong concern, in particular if we do not have very large numbers of examples in our training sets. The data suffers from the "curse of dimensionality", i.e. all examples look similarly similar or different. And the datastructures that hold such data may become impractically large, and model training may take impractically long. In this integrator unit we will download and prepare different types of categorical data to explore later how to use feature engineering to optimize it for machine learning tasks.
 
Most interesting data that describes function in living cells is not numerical, but categorical. Moreover, it is data with large numbers of categories - "'''high cardinality categorical data'''". Such data is problematic for machine learning for reasons of principle and practicality. Such data is sparse, and the many dimensions of noise make overfitting of data a strong concern, in particular if we do not have very large numbers of examples in our training sets. The data suffers from the "curse of dimensionality", i.e. all examples look similarly similar or different. And the datastructures that hold such data may become impractically large, and model training may take impractically long. In this integrator unit we will download and prepare different types of categorical data to explore later how to use feature engineering to optimize it for machine learning tasks.
  
 
Your tasks as a team are:
 
Your tasks as a team are:
* to choose one of the datasets of interest for human systems biology specified below;
+
* <strike>to choose one of the datasets of interest for human systems biology specified below</strike>;
 
* to download the source data;
 
* to download the source data;
 
* to determine the necessary steps for processing and to distribute the necessary tasks among the team members;
 
* to determine the necessary steps for processing and to distribute the necessary tasks among the team members;
Line 74: Line 71:
  
  
;To begin, you need to choose - as a team - one of the following five data sources of categorical functional data. Rank the units in terms of preference (or write "random"). Your team spokesperson should eMail me your team's preferences. I will assign the choices.
+
;<strike>To begin, you need to choose - as a team - one of the following five data sources of categorical functional data. Rank the units in terms of preference (or write "random"). Your team spokesperson should eMail me your team's preferences. I will assign the choices.</strike> (Done)
  
 
{{Vspace}}
 
{{Vspace}}
  
==='''(1) Graph data mining on STRING'''===
+
==='''(1) Graph data mining on STRING (Aaardvark)'''===
 
{{Smallvspace}}
 
{{Smallvspace}}
The [https://string-db.org/ '''STRING database'''] publishes a network of gene nodes and edges that represent functional interactions: these integrate various experimental observations and computational inference, such as protein-protein interactions and literature data mining - or they can be decomposed by individual categories of evidence. A summary score is given as a probability of an edge to be functionally relevant. Network data mining for ML features is a very interesting topic in and of itself, here we will simply take the neighbours of a gene as categorical features that describe its environment. '''Task:''' using a suitable score cutoff, produce a table of STRING neigbours for each human gene that is defined in our HUGO symbol table. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation.
+
The [https://string-db.org/ '''STRING database'''] publishes a network of gene nodes and edges that represent functional interactions: these integrate various experimental observations and computational inference, such as protein-protein interactions and literature data mining - or they can be decomposed by individual categories of evidence. A summary score is given as a probability of an edge to be functionally relevant. Network data mining for ML features is a very interesting topic in and of itself, here we will simply take the neighbours of a gene as categorical features that describe its environment. '''Task:''' using a suitable score cutoff, produce a table of STRING neigbours for each human gene that is defined in our HUGO symbol table. Upload the annotation for the <code>miniHUGOsymbols</code> list to your documentation.
  
 
Example row:
 
Example row:
Line 89: Line 86:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
==='''(2) GO and GOA'''===
+
==='''(2) GO and GOA (Chihuahua)'''===
 
{{Smallvspace}}
 
{{Smallvspace}}
[https://www.ebi.ac.uk/GOA Gene Ontology Annotations ('''GOA''')] provide the cornerstone of functional annotations for genes. Build a pipeline to annotate each HUGO symbol with the Gene Ontology (GO) terms found in the relevant GOA tables. '''Task:''' produce a table of GO terms annotated for each human gene defined in our HUGO symbol table. Do this separately for the three GO ontologies. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation.
+
[https://www.ebi.ac.uk/GOA Gene Ontology Annotations ('''GOA''')] provide the cornerstone of functional annotations for genes. Build a pipeline to annotate each HUGO symbol with the Gene Ontology (GO) terms found in the relevant GOA tables. '''Task:''' produce a table of GO terms annotated for each human gene defined in our HUGO symbol table. Do this separately for the three GO ontologies. Upload the annotation for the <code>miniHUGOsymbols</code> list to your documentation.
  
 
Example header and row:
 
Example header and row:
Line 100: Line 97:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
==='''(3) MSigDB sets'''===
+
==='''(3) MSigDB sets (Cricket)'''===
 
{{Smallvspace}}
 
{{Smallvspace}}
The Broad Institute hosts an expert-curated database of gene sets: [http://software.broadinstitute.org/gsea/msigdb/collections.jsp '''MSigDB''' - the Molecular Signature Database]. '''Task:''' download the data and build a pipeline to annotate all HUGO gene symbols with all of the gene sets that contain them. Annotate the <tt>miniHUGOsymbols</tt> list and upload it to your documentation; also test how your pipeline will scale to the full dataset of more than 17,000 gene sets.
+
The Broad Institute hosts an expert-curated database of gene sets: [http://software.broadinstitute.org/gsea/msigdb/collections.jsp '''MSigDB''' - the Molecular Signature Database]. '''Task:''' download the data and build a pipeline to annotate all HUGO gene symbols with all of the gene sets that contain them. Annotate the <code>miniHUGOsymbols</code> list and upload it to your documentation; also test how your pipeline will scale to the full dataset of more than 17,000 gene sets.
  
 
Example row:
 
Example row:
Line 110: Line 107:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
==='''(4) Enrichment'''===
+
==='''(4) Enrichment (Owl)'''===
 
{{Smallvspace}}
 
{{Smallvspace}}
A common aspect of systems biology wet-lab experiments is that they produce a ''set-of-genes'' result: genes that co-precipitate, genes that are co-regulated, genes that are phosporylated by the same kinase, etc. etc. Enrichment algorithms ask: what do such genes have in common, i.e. what feature appears more frequently in the set than one would expect in a randomly chosen set of genes. Any type of annotation can be chosen, but existing packages usually use GO annotations. Candidate tools include [http://bioconductor.org/packages/release/bioc/html/topGO.html topGO], and other tools in the [http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment '''Gene Set Enrichment''' biocView]<ref>Note GSEA (Gene Set Enrichment Analysis) is '''not''' the same as gene feature enrichment.</ref>. '''Task:''' build a pipeline that takes as input a set of HUGO symbols - such as the sets derived from the MSigDB above, and outputs an annotation of enriched GO terms for each of the set elements. Develop this for the <tt>miniHUGOsymbols</tt> list and a few other gene sets and upload the results for the <tt>miniHUGOsymbols</tt> list to your documentation.
+
A common aspect of systems biology wet-lab experiments is that they produce a ''set-of-genes'' result: genes that co-precipitate, genes that are co-regulated, genes that are phosporylated by the same kinase, etc. etc. Enrichment algorithms ask: what do such genes have in common, i.e. what feature appears more frequently in the set than one would expect in a randomly chosen set of genes. Any type of annotation can be chosen, but existing packages usually use GO annotations. Candidate tools include [http://bioconductor.org/packages/release/bioc/html/topGO.html topGO], and other tools in the [http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment '''Gene Set Enrichment''' biocView]<ref>Note GSEA (Gene Set Enrichment Analysis) is '''not''' the same as gene feature enrichment.</ref>. '''Task:''' build a pipeline that takes as input a set of HUGO symbols - such as the sets derived from the MSigDB above, and outputs an annotation of enriched GO terms for each of the set elements. Develop this for the <code>miniHUGOsymbols</code> list and a few other gene sets and upload the results for the <code>miniHUGOsymbols</code> list to your documentation.
  
 
Example row:
 
Example row:
Line 120: Line 117:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
==='''(5) Interpro'''===
+
==='''(5) InterPro (Python)'''===
 
{{Smallvspace}}
 
{{Smallvspace}}
[https://www.ebi.ac.uk/interpro/protein/P43489 '''InterPro''' provides] rich sequence and domain annotation - and the domain composition of a protein is a categorical feature set. [https://www.ebi.ac.uk/interpro/download.html '''Download'''] of InterPro data is available. '''Task:''' produce a table of InterPro domains in each human gene as defined by our HUGO symbol table. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation. Example row:
+
[https://www.ebi.ac.uk/interpro/protein/P43489 '''InterPro''' provides] rich sequence and domain annotation - and the domain composition of a protein is a categorical feature set. [https://www.ebi.ac.uk/interpro/download.html '''Download'''] of InterPro data is available. '''Task:''' produce a table of InterPro domains in each human gene as defined by our HUGO symbol table. Upload the annotation for the <code>miniHUGOsymbols</code> list to your documentation. Example row:
  
 
  TNFRSF4 IPR034022|IPR001368|IPR001368
 
  TNFRSF4 IPR034022|IPR001368|IPR001368
Line 132: Line 129:
 
{{Vspace}}
 
{{Vspace}}
  
 +
==Process Details==
 +
{{Smallvspace}}
 +
Below are the general details for the tasks as they apply to four of the five alternatives. "Enrichment" is different because it does not produce a set of category annotations, but a pipeline to prodcue data for such a set. Apply the details in spirit.
  
<div class="alert">
 
Process details TBC...
 
</div>
 
  
<!--
+
{{Vspace}}
  
=== Select a Data Source ===
+
=== Study your Data Source ===
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 145: Line 142:
 
{{task|1=
 
{{task|1=
  
*Navigate to [https://www.ncbi.nlm.nih.gov/gds/ '''the GEO expression dataset search page'''] and select an expression dataset you will work with.
+
{{Smallvspace}}
 
 
{{Vspace}}
 
  
 
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
;1 – Choose a dataset of native, healthy human cells or tissue ...
+
;Navigate to the source database and clarify the format and semantics of the data ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
<div class="mw-collapsible-content" style="padding:10px;">
If your dataset is not native, healthy human tissue, you will not receive credit. An exception can be made if you feel that you have discovered an experiment on different cells that will be particularly useful regardless. If so, contact me for special permission.
+
For example: you need to understand what the data is, and how it can annotate individual genes. You need to understand which gene identifiers are used and how they map to the HUGO symbols. You need to know what download formats are available, and what the downloads contain. You also need to understand how the data was curated, if and when it is being updated, what its copyright status is and and what the reference citation is. <small>(This list is not necessarily exhaustive.)</small>
 
</div>
 
</div>
 
</div>
 
</div>
Line 158: Line 153:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
+
}}
;2 – Choose an interesting experiment ...
+
 
 +
{{Vspace}}
 +
 
 +
{{Vspace}}
 +
 
 +
=== Download your Data ===
 +
 
 +
{{Smallvspace}}
 +
 
 +
{{task|1=
 +
 
 +
{{Smallvspace}}
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
 +
;In your process script, develop code to download the source data; then assess what you have and load it into R ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
<div class="mw-collapsible-content" style="padding:10px;">
If we are to use these features to assess system membership, their expression response to the experimental conditions must reflect some biological property. Ideally, this will be a physiological response of some sort, disease states may be less suited to this question. It is your task to reflect on this question and choose accordingly.
+
* If possible, identify both an URL for the most recent version of the downloadable data (for future updates), and a stable URL for the exact version you have used (best practice for reproducible research);
 +
* Some of the datasets require registration: in that case it may be better to manually download, rather than pass credentials in a script. If you feel you '''must''' pass login credentials in a download script, ask me about best-practice for that.
 +
* Once your data is downloaded, uncompress and untar it. Identify the structure of the data. Usually this will be a tsv or csv plain text file, possibly with column headers, possibly with meta information. Identify the columns you need.
 +
* Read your data into R. You need to know whether your data has headers, and how many extra lines need to be skipped. I highly recommend using the <code>readr</code> package functions - they are very flexible and much faster than R's standard functions. <code>readr</code> functions allow you to only read the columns you actually need. The only downside (if indeed that is one) is that they produce "tibbles", not data frames. You may need to convert your results later.
 +
 
 +
{{Smallvspace}}
 +
 
 +
<small>(You are encouraged to discuss questions about these processes on the mailing list and share experiences!)</small>
 
</div>
 
</div>
 
</div>
 
</div>
Line 167: Line 183:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
+
}}
;3 – Make sure the coverage is as complete as possible ...
+
 
 +
{{Vspace}}
 +
 
 +
=== Map identifiers ===
 +
 
 +
{{Smallvspace}}
 +
 
 +
{{task|1=
 +
 
 +
{{Smallvspace}}
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
 +
;Develop code to map source IDs to HUGO symbols ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
<div class="mw-collapsible-content" style="padding:10px;">
Experiments that measure expression for only a small subset of genes are not suitable.
+
There are two points to consider:
 +
* Map source data IDs to HUGO symbols. If your source data does not include HUGO symbols, you may need to map ... e.g. STRING uses ensembl IDs, but they also have a table of mappings for download.
 +
* Once you have HUGO symbols, you may need to update them as they may contain previous symbols or aliases. This is the same process that you have gone through when you worked with your expression data sets.
 +
 
 +
{{Smallvspace}}
 +
 
 +
<small>(Contact me in case this process does not seem completely starightforward and you need advice.)</small>
 
</div>
 
</div>
 
</div>
 
</div>
Line 176: Line 210:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
+
}}
;4 – Choose high-quality experiments ...
+
 
 +
{{Vspace}}
 +
 
 +
=== Define your datastructure ===
 +
 
 +
{{Smallvspace}}
 +
 
 +
{{task|1=
 +
 
 +
{{Smallvspace}}
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
 +
;Define the datastructure to hold the category data
 
<div class="mw-collapsible-content" style="padding:10px;">
 
<div class="mw-collapsible-content" style="padding:10px;">
The experiments should be performed with technical replicates (the more the better), and you will average the replicates as you prepare the feature data set. It also should be performed with mature experimental platforms, according to best-practice procedures; therefore we should choose recent experiments (not older than ten years). As above, contact me for special permission if you want to deviate from this requirement.
+
* Your datastructure should be a dataframe with two columns of character data. <small>The "GO and GOA" structure will have four columns.</small>.
 +
* The first column should contain the HUGO symbols and be named <code>symbols</code>.
 +
* The rownames should be set to the HUGO symbols. Therefore you can access specific rows with syntax like<br>
 +
:<code>sel <- which(cats$symbol == "TNFRSF4")</code> or <br>
 +
:<code>cats["TNFRSF4", "signatures"]</code>, and the latter is much faster.
 +
* The category column should be named <code>neighbours</code> (STRING),  <code>MF</code>,<code>BP</code>, and <code>CC</code> (GOA), <code>signatures</code> (MSigDB), or  <code>domains</code> (InterPro).
 +
* Each element of the category column should be a string of category identifiers separated with a "{{!}}" (pipe) character. Consider the following code:
 +
<source lang="R">
 +
 
 +
myCats <- c("M1739", "M5947", "M18255", "M13664", "M1644", "M4248")
 +
(x <- paste0(myCats, collapse = "|"))
 +
strsplit(x, "\\|")  # Note: this produces a list - use unlist() or [[1]] if required
 +
 
 +
</source>
 +
 
 +
{{Smallvspace}}
 +
 
 
</div>
 
</div>
 
</div>
 
</div>
 +
 +
{{Smallvspace}}
 +
 +
}}
  
 
{{Vspace}}
 
{{Vspace}}
  
* Claim the dataset on the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/ABC-INT-Expression_datasets '''dataset signup page of the Student Wiki'''].
+
=== Build your category table ===
 +
 
 +
{{Smallvspace}}
 +
 
 +
{{task|1=
 +
 
 +
{{Smallvspace}}
 +
 
 +
Building the actual table is going to be specific for each set. Try this first with a small number of entries to get a sense of how long the full dataset will take. If it appears to take longer than an hour or so, contact me. We can either optimize the code, or split the task, or defer it. '''Minimally, you must annotate the 20 genes in the <code>miniHUGOsymbols</code> set.'''
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 193: Line 267:
 
{{Vspace}}
 
{{Vspace}}
  
=== Clean the data and map to HUGO symbols ===
+
=== Assess your results ===
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 199: Line 273:
 
{{task|1=
 
{{task|1=
  
* Develop your code in an R script that you submit as part of this task. The script should implement the following workflow:
+
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
+
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
;1 – Download the data ...
+
;Develop code to compile annotation statistics ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
<div class="mw-collapsible-content" style="padding:10px;">
Do not work on manually downloaded data, but use the GEOquery Bioconductor package. (Obviously, do not re-download the dataset every time you run the script, but figure out a strategy to download only when necessary.) Sample code is in the R code associeted with the '''[[RPR-GEO2R]]''' learning unit.
+
Minimally you should report -
</div>
+
* Number of source data items;
</div>
+
* Number of unique genes in the source data;
 +
* Number of source genes that could not mapped to HUGO symbols;
 +
* Number and percentage of HUGO symbols annotated;
 +
* Number of unique annotations (i.e. cardinality of categories).
  
{{Smallvspace}}
+
<small>(This list is not exhaustive and I expect that you will be able to define additional, informative statistics.)</small>
 
 
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 
;2 – Assess ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
Compute overview statistics to assess data quality for the control and test conditions in your dataset. You will only submit one feature for the control condition, and one feature for the test condition - so you can remove columns that correspond to other conditions from the dataset at this point.
 
</div>
 
</div>
 
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
+
* Can you validate that your results are correct and complete?
;3 – Map ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
Your dataset probably does not contain HUGO gene symbols as row identifiers - you need to map rows to symbols. How? Figure that out in collaboration with your team and the rest of the class. It is crucial that everyone maps to the same gene symbols. '''I have prepared a reference set of symbols in the ''zu'' repository. It is in <tt>inst/extdata</tt> and the corresponding script is in <tt>inst/scripts</tt>.''' But you will need to figure out how to handle unmapped rows (these are likely outdated aliases of current symbols, or possibly deprecated ORFs). You also need to figure out what to do with rows that map to more than one symbol (perhaps the sequenced fragment was not unique, or the microarray spot hybridizes to more than one gene.) Finally you need to figure out what to do with multiple rows that map to the same symbol (these could be splice variants, or probes that are designed as internal controls). In any case: you, your team, and the entire class have to come up with a consensus how these situations will be handled correctly. And your code must implement the consensus. Simply removing these cases from the dataset is not satisfactory, and if your code does not correctly implement the consensus approach you may not receive credit.
 
</div>
 
</div>
 
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
+
<div class="note">Discuss with me how to proceed if a full annotation can't be computed with reasonable resources.</div>
;4 – Clean ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
There are two considerations you need to go through: removing outliers, and imputing missing data. Whether one should '''remove outliers''' is a matter of debate<ref>See [https://stats.stackexchange.com/questions/200534/is-it-ok-to-remove-outliers-from-data/200923 '''this''' conversation on ''cross-validated''] for example.</ref>, but if one has well founded reasons to believe a measurement error has occurred, the measurement should indeed be discarded. For imputation strategies, see the '''[[RPR-Data-Imputation]]''' learning unit. As with the ''map-rows-to-gene-symbols'' task, we need a class-wide consensus on how to clean the data and your script must correctly implement that consensus.
 
</div>
 
</div>
 
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 
;5 – Average ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
Calculate means for the replicates in your dataset.
 
 
</div>
 
</div>
 
</div>
 
</div>
Line 250: Line 306:
 
{{Vspace}}
 
{{Vspace}}
  
 +
=== Refactor your script ===
 +
 +
{{Smallvspace}}
  
=== Apply Quantile Normalization (QN) ===
+
{{task|1=
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
{{task|1=
+
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
 +
;When you are done, review and refactor your script ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
* Does the script include all the required information items suggested by the script template in the ''zu'' repository ?
 +
* Does your code conform to the [[RPR-Coding_style|'''Coding style requirements for this course''']]?
 +
* Is the script fully commented?
 +
* Did you load all required libraries?
 +
* Did you comment and justify all parameters?
 +
* Are there any "magic numbers" left in the code?
 +
* Is the code parsimonious in its activities and efficient in its program flow (e.g. doesn't grow large datastructures dynamically, takes advantage of vectorized functions wherever possible, doesn't perform activities inside of loops that don't depend on the loop variable, etc. etc.)?
  
* Next, transform the data with QN. The process is motivated and described in Taroni (2017), but once again there may be parameters to respect and we need a class-consensus on how to do this correctly. Coordinate as above.
+
{{Smallvspace}}
 +
<small>(This list is '''not''' exhaustive.)</small>
 +
</div>
 +
</div>
  
* The final result of your script needs to be a dataframe with two numeric columns, named <tt>&lt;GSET-ID&gt;.ctrl</tt> and <tt>&lt;GSET-ID&gt;.test</tt>, all rows of the HUGO symbols must exist in the exact order of the HUGO symbol reference vector, and the HUGO symbols must be defined as rownames of the dataframe. I expect that you have actually produced such a dataset and have it available on your computer for reference. '''Do not upload this data to Github.'''
+
{{Smallvspace}}
  
* If your script does not produce a data set according to these exact specifications, this '''must''' be clearly stated in the script.
+
<div class="note>
 +
For the evaluation of your results: Form matters!
 +
</div>
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 269: Line 342:
 
{{Vspace}}
 
{{Vspace}}
  
=== Interpret, and document ===
+
=== Document your work ===
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
The steps above conclude the actual data preparation. Be prepared to answer the following questions:
+
{{task|1=
  
{{task|1=
 
* What are the control and test conditions of the dataset?
 
* Why is the dataset of interest to our systems assessment task?
 
* Were there expression values that were not unique for specific genes? How did you handle these?
 
* Were there expression values that could not be mapped to current HUGO symbols?
 
* How many outliers were removed, how many datapoints were imputed?
 
* How did you handle replicates?
 
* What is the final coverage of your dataset?
 
 
{{Smallvspace}}
 
{{Smallvspace}}
  
* Make sure your script contains the complete workflow, is fully commented, and contains all essential elements suggested by the script template<ref>Refer to the script template <tt>inst/extdata/scripts/scriptTemplate.R</tt> in the _zu_ project repository.</ref>. This is a collaborative project - form matters.
+
<div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%; border: solid 1px #BBBBBB; padding: 10px; spacing: 10px;">
 +
;Every category task has a subpage for documentation linked from the teams and tasks page ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
Documentation needs to be brief and complete and describe the final result unambiguously. You don't need to document your activities here - this goes into your respective Course Journals. Describe your results, and include your <code>miniHUGOsymbols</code> annotations, and your complete script (with proper GeSHi highlighting). A [[FND-CSC-SPN|'''SPN diagram''']] may help to make the dataflow clear.
  
}}
+
Form matters.
  
-->
+
</div>
 
+
</div>
{{Vspace}}
 
 
 
== Notes ==
 
<!-- included from "./components/ABC-INT-Categorical_features.components.txt", section: "notes" -->
 
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
+
}}
<references />
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 309: Line 372:
  
  
 +
== Notes ==
  
{{Vspace}}
+
{{Smallvspace}}
  
 +
<references />
  
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
+
{{Vspace}}
  
----
 
  
 
{{Vspace}}
 
{{Vspace}}
  
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 333: Line 392:
 
:2018-02-01
 
:2018-02-01
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2018-02-01
+
:2018-02-03
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.2.1
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2.1 Sleeping ...
 +
*1.2 Tasks are assigned to teams
 +
*1.1 Added task details
 +
*1.0 Category tasks defined
 
*0.1 New unit under development
 
*0.1 New unit under development
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{INTEGRATOR}}
 +
{{SLEEP}}
 +
{{EVAL}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 01:44, 23 September 2020

Integrator Unit: Categorical Features

(Integrator unit: collect categorical features for human genes)


 


Abstract:

This page integrates material from the learning units and defines a task for defining and downloading categorical feature sets for human genes.


Deliverables:

  • Integrator unit: Deliverables can be submitted for course marks. See below for details.

  • Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     


    This page is not currently being maintained since it is not part of active learning sections.


     



     


    Evaluation

    Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.[1].

    • Work through the tasks described below.
    • Note that there are several tasks that need to be coordinated with your teammates and classmates. This is necessary to ensure the feature sets can be merged in the second phase of the course. Be sure to begin this coordination process in time.
    • Remember to document your work in your journal concurrently with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement. Note that this is a team task, and your contribution to the task must be clearly documented in your journal for evaluation.
    • Your task will involve submitting documentation on a sub-page of the Teams and Tasks page. (Details below) This documentation will be jointly authored and I expect every team member to be able to speak to all of it.
    • Schedule an oral exam (if you haven't done so already) by editing the signup page on the Student Wiki. You must have signed-up for an exam slot before 20:00 on the day before your exam.[2]
    • Your work must be complete before 20:00 on the day before your exam.


     

    Contents

    Most interesting data that describes function in living cells is not numerical, but categorical. Moreover, it is data with large numbers of categories - "high cardinality categorical data". Such data is problematic for machine learning for reasons of principle and practicality. Such data is sparse, and the many dimensions of noise make overfitting of data a strong concern, in particular if we do not have very large numbers of examples in our training sets. The data suffers from the "curse of dimensionality", i.e. all examples look similarly similar or different. And the datastructures that hold such data may become impractically large, and model training may take impractically long. In this integrator unit we will download and prepare different types of categorical data to explore later how to use feature engineering to optimize it for machine learning tasks.

    Your tasks as a team are:

    • to choose one of the datasets of interest for human systems biology specified below;
    • to download the source data;
    • to determine the necessary steps for processing and to distribute the necessary tasks among the team members;
    • to format the data for use as categorical features as specified below;
    • to document what you have achieved, including your scripts and tools and one annotaded subset of 20 genes.


    To begin, you need to choose - as a team - one of the following five data sources of categorical functional data. Rank the units in terms of preference (or write "random"). Your team spokesperson should eMail me your team's preferences. I will assign the choices. (Done)


     

    (1) Graph data mining on STRING (Aaardvark)

     

    The STRING database publishes a network of gene nodes and edges that represent functional interactions: these integrate various experimental observations and computational inference, such as protein-protein interactions and literature data mining - or they can be decomposed by individual categories of evidence. A summary score is given as a probability of an edge to be functionally relevant. Network data mining for ML features is a very interesting topic in and of itself, here we will simply take the neighbours of a gene as categorical features that describe its environment. Task: using a suitable score cutoff, produce a table of STRING neigbours for each human gene that is defined in our HUGO symbol table. Upload the annotation for the miniHUGOsymbols list to your documentation.

    Example row:

    TNFRSF4 TNFRSF9|CTLA4|TNFSF4|TRAF5|IL2|IL2RA|FOXP3
    


     

    (2) GO and GOA (Chihuahua)

     

    Gene Ontology Annotations (GOA) provide the cornerstone of functional annotations for genes. Build a pipeline to annotate each HUGO symbol with the Gene Ontology (GO) terms found in the relevant GOA tables. Task: produce a table of GO terms annotated for each human gene defined in our HUGO symbol table. Do this separately for the three GO ontologies. Upload the annotation for the miniHUGOsymbols list to your documentation.

    Example header and row:

    symbol MF  BP  CC
    TNFRSF4 GO:0001618|GO:0005031|GO:0005515 GO:0006954|GO:006955|GO:0007275  GO:0005886|GO:0005887|GO:0009986
    


     

    (3) MSigDB sets (Cricket)

     

    The Broad Institute hosts an expert-curated database of gene sets: MSigDB - the Molecular Signature Database. Task: download the data and build a pipeline to annotate all HUGO gene symbols with all of the gene sets that contain them. Annotate the miniHUGOsymbols list and upload it to your documentation; also test how your pipeline will scale to the full dataset of more than 17,000 gene sets.

    Example row:

    TNFRSF4 M1739|M5947|M18255|M13664|M1644|M4248
    


     

    (4) Enrichment (Owl)

     

    A common aspect of systems biology wet-lab experiments is that they produce a set-of-genes result: genes that co-precipitate, genes that are co-regulated, genes that are phosporylated by the same kinase, etc. etc. Enrichment algorithms ask: what do such genes have in common, i.e. what feature appears more frequently in the set than one would expect in a randomly chosen set of genes. Any type of annotation can be chosen, but existing packages usually use GO annotations. Candidate tools include topGO, and other tools in the Gene Set Enrichment biocView[3]. Task: build a pipeline that takes as input a set of HUGO symbols - such as the sets derived from the MSigDB above, and outputs an annotation of enriched GO terms for each of the set elements. Develop this for the miniHUGOsymbols list and a few other gene sets and upload the results for the miniHUGOsymbols list to your documentation.

    Example row:

    TNFRSF4 GO:0097190|GO:0051024|GO:0033209|GO:0032496
    


     

    (5) InterPro (Python)

     

    InterPro provides rich sequence and domain annotation - and the domain composition of a protein is a categorical feature set. Download of InterPro data is available. Task: produce a table of InterPro domains in each human gene as defined by our HUGO symbol table. Upload the annotation for the miniHUGOsymbols list to your documentation. Example row:

    TNFRSF4 IPR034022|IPR001368|IPR001368
    


     
    Do not upload your full datasets to the Github repository!


     

    Process Details

     

    Below are the general details for the tasks as they apply to four of the five alternatives. "Enrichment" is different because it does not produce a set of category annotations, but a pipeline to prodcue data for such a set. Apply the details in spirit.


     

    Study your Data Source

     

    Task:

     
    Navigate to the source database and clarify the format and semantics of the data ...

    For example: you need to understand what the data is, and how it can annotate individual genes. You need to understand which gene identifiers are used and how they map to the HUGO symbols. You need to know what download formats are available, and what the downloads contain. You also need to understand how the data was curated, if and when it is being updated, what its copyright status is and and what the reference citation is. (This list is not necessarily exhaustive.)


     


     


     

    Download your Data

     

    Task:

     
    In your process script, develop code to download the source data; then assess what you have and load it into R ...
    • If possible, identify both an URL for the most recent version of the downloadable data (for future updates), and a stable URL for the exact version you have used (best practice for reproducible research);
    • Some of the datasets require registration: in that case it may be better to manually download, rather than pass credentials in a script. If you feel you must pass login credentials in a download script, ask me about best-practice for that.
    • Once your data is downloaded, uncompress and untar it. Identify the structure of the data. Usually this will be a tsv or csv plain text file, possibly with column headers, possibly with meta information. Identify the columns you need.
    • Read your data into R. You need to know whether your data has headers, and how many extra lines need to be skipped. I highly recommend using the readr package functions - they are very flexible and much faster than R's standard functions. readr functions allow you to only read the columns you actually need. The only downside (if indeed that is one) is that they produce "tibbles", not data frames. You may need to convert your results later.


     

    (You are encouraged to discuss questions about these processes on the mailing list and share experiences!)


     


     

    Map identifiers

     

    Task:

     
    Develop code to map source IDs to HUGO symbols ...

    There are two points to consider:

    • Map source data IDs to HUGO symbols. If your source data does not include HUGO symbols, you may need to map ... e.g. STRING uses ensembl IDs, but they also have a table of mappings for download.
    • Once you have HUGO symbols, you may need to update them as they may contain previous symbols or aliases. This is the same process that you have gone through when you worked with your expression data sets.


     

    (Contact me in case this process does not seem completely starightforward and you need advice.)


     


     

    Define your datastructure

     

    Task:

     
    Define the datastructure to hold the category data
    • Your datastructure should be a dataframe with two columns of character data. The "GO and GOA" structure will have four columns..
    • The first column should contain the HUGO symbols and be named symbols.
    • The rownames should be set to the HUGO symbols. Therefore you can access specific rows with syntax like
    sel <- which(cats$symbol == "TNFRSF4") or
    cats["TNFRSF4", "signatures"], and the latter is much faster.
    • The category column should be named neighbours (STRING), MF,BP, and CC (GOA), signatures (MSigDB), or domains (InterPro).
    • Each element of the category column should be a string of category identifiers separated with a "|" (pipe) character. Consider the following code:
    myCats <- c("M1739", "M5947", "M18255", "M13664", "M1644", "M4248")
    (x <- paste0(myCats, collapse = "|"))
    strsplit(x, "\\|")   # Note: this produces a list - use unlist() or [[1]] if required


     


     


     

    Build your category table

     

    Task:

     

    Building the actual table is going to be specific for each set. Try this first with a small number of entries to get a sense of how long the full dataset will take. If it appears to take longer than an hour or so, contact me. We can either optimize the code, or split the task, or defer it. Minimally, you must annotate the 20 genes in the miniHUGOsymbols set.


     


     

    Assess your results

     

    Task:

     
    Develop code to compile annotation statistics ...

    Minimally you should report -

    • Number of source data items;
    • Number of unique genes in the source data;
    • Number of source genes that could not mapped to HUGO symbols;
    • Number and percentage of HUGO symbols annotated;
    • Number of unique annotations (i.e. cardinality of categories).

    (This list is not exhaustive and I expect that you will be able to define additional, informative statistics.)


     
    • Can you validate that your results are correct and complete?


     
    Discuss with me how to proceed if a full annotation can't be computed with reasonable resources.


     


     


     

    Refactor your script

     

    Task:

     
    When you are done, review and refactor your script ...
    • Does the script include all the required information items suggested by the script template in the zu repository ?
    • Does your code conform to the Coding style requirements for this course?
    • Is the script fully commented?
    • Did you load all required libraries?
    • Did you comment and justify all parameters?
    • Are there any "magic numbers" left in the code?
    • Is the code parsimonious in its activities and efficient in its program flow (e.g. doesn't grow large datastructures dynamically, takes advantage of vectorized functions wherever possible, doesn't perform activities inside of loops that don't depend on the loop variable, etc. etc.)?


     

    (This list is not exhaustive.)


     

    For the evaluation of your results: Form matters!


     


     

    Document your work

     

    Task:

     
    Every category task has a subpage for documentation linked from the teams and tasks page ...

    Documentation needs to be brief and complete and describe the final result unambiguously. You don't need to document your activities here - this goes into your respective Course Journals. Describe your results, and include your miniHUGOsymbols annotations, and your complete script (with proper GeSHi highlighting). A SPN diagram may help to make the dataflow clear.

    Form matters.


     


     

    Further reading, links and resources

    • HUGO Gene Nomenclature Committee - the authoritative information source for gene symbols. Includes search functions for synonyms. aliases and other information, as well as downloadable data.


    Notes

     
    1. Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.
    2. For clarification: You sign up for only one oral exam for February.
    3. Note GSEA (Gene Set Enrichment Analysis) is not the same as gene feature enrichment.


     


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2018-02-01

    Modified:

    2018-02-03

    Version:

    1.2.1

    Version history:

    • 1.2.1 Sleeping ...
    • 1.2 Tasks are assigned to teams
    • 1.1 Added task details
    • 1.0 Category tasks defined
    • 0.1 New unit under development

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

    This page is not currently being maintained since it is not part of active learning sections.