Difference between revisions of "ABC-INT-Expression data"
m |
m |
||
Line 3: | Line 3: | ||
Integration Unit: Expression Data | Integration Unit: Expression Data | ||
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; "> | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; "> | ||
− | (Integrator unit: select and normalize expression data) | + | (Integrator unit: select, clean and normalize expression data) |
</div> | </div> | ||
</div> | </div> | ||
Line 66: | Line 66: | ||
<div class="reference-box">[https://www.biorxiv.org/content/early/2017/03/21/118349 Taroni & Greene (2017)] Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv '''doi:''' https://doi.org/10.1101/118349) </div> | <div class="reference-box">[https://www.biorxiv.org/content/early/2017/03/21/118349 Taroni & Greene (2017)] Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv '''doi:''' https://doi.org/10.1101/118349) </div> | ||
− | We need expression datasets - | + | Our ultimate goal is to explore machine learning approaches to evaluate systems membership of genes. For this, we need features that annotate genes, are suitable for machine learning, '''and are informative regarding the function of the gene'''. Expression profiles have great potential for this, since genes that collaborate are often (although not always) co-regulated - either directly, by being part of the same gene regulatory pathways, or indirectly by being similarly responsive to environmental conditions or other stimuli. In order to build "good" features, the data need to be of good quality, and informative for our purpose. We need expression datasets - |
* with good coverage; | * with good coverage; | ||
* not much older than ten years (quality!); | * not much older than ten years (quality!); | ||
Line 73: | Line 73: | ||
* mapped to unique human gene identifiers. | * mapped to unique human gene identifiers. | ||
− | As the result of this task, you should prepare a script that will produce one reference and one experimental data set for human genes (from the same experiment). | + | As the result of this task, you should prepare a script that will produce one reference and one experimental ''feature'' data set for human genes (from the same experiment). |
− | To avoid mistakes in praparing the dataset, discuss with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone '''however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere. | + | To avoid mistakes in praparing the dataset, discuss your with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone '''however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere'''. |
+ | |||
+ | Read the entire set of requirements and parameters carefully before you begin. | ||
{{Vspace}} | {{Vspace}} | ||
Line 84: | Line 86: | ||
{{task|1= | {{task|1= | ||
− | *Navigate to [https://www.ncbi.nlm.nih.gov/geo/browse/ '''the GEO expression dataset browser'''] and select an expression dataset | + | *Navigate to [https://www.ncbi.nlm.nih.gov/geo/browse/ '''the GEO expression dataset browser'''] and select an expression dataset you will work with. |
+ | {{Vspace}} | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;1 – Choose a dataset of native, healthy human cells or tissue ... | ;1 – Choose a dataset of native, healthy human cells or tissue ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
− | If your dataset is not native, healthy human tissue, you will not receive credit. | + | If your dataset is not native, healthy human tissue, you will not receive credit. An exception can be made if you feel that you have discovered an experiment on different cells that will be particularly useful regardless. If so, contact me for special permission. |
</div> | </div> | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | {{Smallvspace}} |
− | ;2 – interesting ... | + | |
+ | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> | ||
+ | ;2 – Choose an interesting experiment ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
− | + | If we are to use these features to assess system membership, their expression response to the experimental conditions must reflect some biological property. Ideally, this will be a physiological response of some sort, disease states may be less suited to this question. It is your task to reflect on this question and choose accordingly. | |
</div> | </div> | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | {{Smallvspace}} |
− | ;3 – coverage ... | + | |
+ | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> | ||
+ | ;3 – Make sure the coverage is as complete as possible ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
− | + | Experiments that measure expression for only a small subset of genes are not suitable. | |
</div> | </div> | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | {{Smallvspace}} |
− | ;4 – | + | |
+ | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> | ||
+ | ;4 – Choose high-quality experiments ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
− | + | The experiments should be performed with technical replicates (the more the better), and you will average the replicates as you prepare the feature data set. It also should be performed with mature experimental platforms, according to best-practice procedures; therefore we should choose recent experiments (not older than ten years). As above, contact me for special permission if you want to deviate from this requirement. | |
</div> | </div> | ||
</div> | </div> | ||
− | + | {{Vspace}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
* Claim the dataset on the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/ABC-INT-Expression_datasets '''dataset signup page of the Student Wiki''']. | * Claim the dataset on the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/ABC-INT-Expression_datasets '''dataset signup page of the Student Wiki''']. | ||
− | + | {{Smallvspace}} | |
}} | }} | ||
Line 138: | Line 142: | ||
* Develop your code in an R script that you submit as part of this task. | * Develop your code in an R script that you submit as part of this task. | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Download ... | ;999 – Download ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 145: | Line 149: | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Assess ... | ;999 – Assess ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 152: | Line 156: | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Map ... | ;999 – Map ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 160: | Line 164: | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Clean ... | ;999 – Clean ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 167: | Line 171: | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Impute ... | ;999 – Impute ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 174: | Line 178: | ||
</div> | </div> | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Average ... | ;999 – Average ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 195: | Line 199: | ||
* | * | ||
− | <div class="mw-collapsible mw-collapsed" data-expandtext=" | + | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> |
;999 – Clean ... | ;999 – Clean ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
Line 252: | Line 256: | ||
:Boris Steipe <boris.steipe@utoronto.ca> | :Boris Steipe <boris.steipe@utoronto.ca> | ||
<b>Created:</b><br /> | <b>Created:</b><br /> | ||
− | :2018-01- | + | :2018-01-26 |
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
:2017-11-01 | :2017-11-01 |
Revision as of 19:31, 26 January 2018
Integration Unit: Expression Data
(Integrator unit: select, clean and normalize expression data)
Abstract:
This page integrates material from the learning units and defines a task for selecting and normalizing human expression data.
Deliverables:
- Integrator unit: Deliverables can be submitted for course marks. See below for details.
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Contents
Evaluation
Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.[1].
- Work through the tasks described below. Remember to document your work in your journal concurrently with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement.
- Part of your task will involve writing an R script. Place your script code in a subpage of your User page on the Student Wiki and link to it from your Journal.
- Schedule an oral exam by editing the signup page on the Student Wiki. You must have signed-up for an exam slot before 20:00 on the day before your exam.
- Your work must be complete before 20:00 on the day before your exam.
Contents
Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values.
To begin, please read the following paper:
Our ultimate goal is to explore machine learning approaches to evaluate systems membership of genes. For this, we need features that annotate genes, are suitable for machine learning, and are informative regarding the function of the gene. Expression profiles have great potential for this, since genes that collaborate are often (although not always) co-regulated - either directly, by being part of the same gene regulatory pathways, or indirectly by being similarly responsive to environmental conditions or other stimuli. In order to build "good" features, the data need to be of good quality, and informative for our purpose. We need expression datasets -
- with good coverage;
- not much older than ten years (quality!);
- with sufficient numbers of replicates;
- collected under interesting conditions;
- mapped to unique human gene identifiers.
As the result of this task, you should prepare a script that will produce one reference and one experimental feature data set for human genes (from the same experiment).
To avoid mistakes in praparing the dataset, discuss your with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere.
Read the entire set of requirements and parameters carefully before you begin.
Select an Expression Data Set
Task:
- Navigate to the GEO expression dataset browser and select an expression dataset you will work with.
- 1 – Choose a dataset of native, healthy human cells or tissue ...
If your dataset is not native, healthy human tissue, you will not receive credit. An exception can be made if you feel that you have discovered an experiment on different cells that will be particularly useful regardless. If so, contact me for special permission.
- 2 – Choose an interesting experiment ...
If we are to use these features to assess system membership, their expression response to the experimental conditions must reflect some biological property. Ideally, this will be a physiological response of some sort, disease states may be less suited to this question. It is your task to reflect on this question and choose accordingly.
- 3 – Make sure the coverage is as complete as possible ...
Experiments that measure expression for only a small subset of genes are not suitable.
- 4 – Choose high-quality experiments ...
The experiments should be performed with technical replicates (the more the better), and you will average the replicates as you prepare the feature data set. It also should be performed with mature experimental platforms, according to best-practice procedures; therefore we should choose recent experiments (not older than ten years). As above, contact me for special permission if you want to deviate from this requirement.
- Claim the dataset on the dataset signup page of the Student Wiki.
Clean the data and map to HUGO symbols
Task:
- Develop your code in an R script that you submit as part of this task.
- 999 – Download ...
YYY
- 999 – Assess ...
YYY
- 999 – Map ...
YYY
- 999 – Clean ...
YYY
- 999 – Impute ...
YYY
- 999 – Average ...
YYY
Apply Quantile Normalization (QN)
Task:
- 999 – Clean ...
YYY
Interpret, and document
- Be prepared to answer the following questions
Task:
- What are the control and test conditions of the dataset?
- Why is the dataset of interest to our systems assessment task?
- Were there expression values that were not unique for specific genes? How did you handle these?
- Were there expression values that could not be mapped to current HUGO symbols?
- How many outliers were removed, how many datapoints were imputed?
- How did you handle replicates?
- What is the final coverage of your dataset?
- Make sure your script contains the complete workflow, is fully commented, and contains all essential elements suggested by the script template[2]. This is a collaborative project - form matters.
Notes
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2018-01-26
Modified:
- 2017-11-01
Version:
- 1.0
Version history:
- 1.0 First live version
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.