Difference between revisions of "ABC-INT-Expression data"

From "A B C"
Jump to navigation Jump to search
m (Created page with "<div id="ABC"> <div style="padding:5px; border:1px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;"> Integration Unit: Ex...")
 
m
Line 2: Line 2:
 
<div style="padding:5px; border:1px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
<div style="padding:5px; border:1px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Integration Unit: Expression Data
 
Integration Unit: Expression Data
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; width:100%;">
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; ">
 
(Integrator unit: select and normalize expression data)
 
(Integrator unit: select and normalize expression data)
 
</div>
 
</div>
Line 49: Line 49:
 
=== Evaluation ===
 
=== Evaluation ===
 
<!-- included from "./components/ABC-INT-Expression_data.components.txt", section: "evaluation" -->
 
<!-- included from "./components/ABC-INT-Expression_data.components.txt", section: "evaluation" -->
Your progress and outcomes of this "Integrator Unit" will be one of a small number of topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20 % of your term grade.<ref>Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.</ref>.
+
Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.<ref>Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.</ref>.
* Work through the tasks described below. Remember to document your work in your journal.
+
* Work through the tasks described below. Remember to document your work in your journal '''concurrently''' with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement.
* Part of your task will involve writing an R script, place that code in a subpage of your User page on the Student Wiki and link to it from your Journal.
+
* Part of your task will involve writing an R script. Place your script code in a subpage of your User page on the Student Wiki and link to it from your Journal.
* Your work must be complete before 21:00 on the day before your exam.
 
 
* Schedule an oral exam by editing the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. You must have signed-up for an exam slot before 20:00 on the day before your exam.
 
* Schedule an oral exam by editing the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. You must have signed-up for an exam slot before 20:00 on the day before your exam.
{{Smallvspace}}
+
* Your work must be complete before 20:00 on the day before your exam.
 +
 
 +
{{Vspace}}
  
 
== Contents ==
 
== Contents ==
 
<!-- included from "./components/ABC-INT-Expression_data.components.txt", section: "contents" -->
 
<!-- included from "./components/ABC-INT-Expression_data.components.txt", section: "contents" -->
  
Your task is to select an expression dataset that is suitable for use as features for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values.
+
Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values.
  
 
To begin, please read the following paper:
 
To begin, please read the following paper:
Line 82: Line 83:
  
 
{{task|1=
 
{{task|1=
*
+
 
 +
*Navigate to [https://www.ncbi.nlm.nih.gov/geo/browse/ '''the GEO expression dataset browser'''] and select an expression dataset for you to transform into a feature set.
 +
 
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;1 – Choose a dataset of native, healthy human cells or tissue ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
If your dataset is not native, healthy human tissue, you will not receive credit. If you feel that you have discovered an experiment on different cells that will be exceptionally useful regardless, contact me for special permission.
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;2 – interesting ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;3 – coverage ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;4 – replicates ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;5 – recent ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
* Claim the dataset on the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/ABC-INT-Expression_datasets '''dataset signup page of the Student Wiki'''].
 +
 
 +
 
 +
 
 
}}
 
}}
  
 
{{Vspace}}
 
{{Vspace}}
  
=== Clean it and impute missing data ===
+
=== Clean the data and map to HUGO symbols ===
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
 
{{task|1=
 
{{task|1=
*
+
 
 +
* Develop your code in an R script that you submit as part of this task.
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Download ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Assess ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Map ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Clean ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Impute ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Average ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
 +
 
  
 
}}
 
}}
Line 104: Line 192:
  
 
{{task|1=
 
{{task|1=
 +
 
*
 
*
* Post a script that will download the dataset and perform all required operations.
+
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Evaluation criteria&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
 +
;999 – Clean ...
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
YYY
 +
</div>
 +
</div>
  
  
 
}}
 
}}
  
=== Interpret ===
+
=== Interpret, and document ===
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 117: Line 212:
  
 
{{task|1=
 
{{task|1=
* What is the coverage of your dataset?
+
* What are the control and test conditions of the dataset?
 
+
* Why is the dataset of interest to our systems assessment task?
 +
* Were there expression values that were not unique for specific genes? How did you handle these?
 +
* Were there expression values that could not be mapped to current HUGO symbols?
 +
* How many outliers were removed, how many datapoints were imputed?
 +
* How did you handle replicates?
 +
* What is the final coverage of your dataset?
 +
{{Smallvspace}}
 +
* Make sure your script contains the complete workflow, is fully commented, and contains all essential elements suggested by the script template<ref>Refer to the script template <tt>inst/extdata/scripts/scriptTemplate.R</tt> in the _zu_ project repository.</ref>. This is a collaborative project - form matters.
  
 
}}
 
}}
 
 
 
  
 
{{Vspace}}
 
{{Vspace}}
 
 
 
 
  
 
== Notes ==
 
== Notes ==
Line 135: Line 230:
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
<references />
 
<references />
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
 
{{Vspace}}
 
{{Vspace}}

Revision as of 17:55, 26 January 2018

Integration Unit: Expression Data

(Integrator unit: select and normalize expression data)


 


Abstract:

This page integrates material from the learning units and defines a task for selecting and normalizing human expression data.


Deliverables:

  • Integrator unit: Deliverables can be submitted for course marks. See below for details.

Prerequisites:
This unit builds on material covered in the following prerequisite units:


 



 



 


Evaluation

Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.[1].

  • Work through the tasks described below. Remember to document your work in your journal concurrently with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement.
  • Part of your task will involve writing an R script. Place your script code in a subpage of your User page on the Student Wiki and link to it from your Journal.
  • Schedule an oral exam by editing the signup page on the Student Wiki. You must have signed-up for an exam slot before 20:00 on the day before your exam.
  • Your work must be complete before 20:00 on the day before your exam.


 

Contents

Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values.

To begin, please read the following paper:

Taroni & Greene (2017) Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv doi: https://doi.org/10.1101/118349)

We need expression datasets -

  • with good coverage;
  • not much older than ten years (quality!);
  • with sufficient numbers of replicates;
  • collected under interesting conditions;
  • mapped to unique human gene identifiers.

As the result of this task, you should prepare a script that will produce one reference and one experimental data set for human genes (from the same experiment).

To avoid mistakes in praparing the dataset, discuss with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere.


 

Select an Expression Data Set

 

Task:


1 – Choose a dataset of native, healthy human cells or tissue ...

If your dataset is not native, healthy human tissue, you will not receive credit. If you feel that you have discovered an experiment on different cells that will be exceptionally useful regardless, contact me for special permission.

2 – interesting ...

YYY

3 – coverage ...

YYY

4 – replicates ...

YYY

5 – recent ...

YYY


 

Clean the data and map to HUGO symbols

 

Task:

  • Develop your code in an R script that you submit as part of this task.
999 – Download ...

YYY

999 – Assess ...

YYY

999 – Map ...

YYY


999 – Clean ...

YYY

999 – Impute ...

YYY

999 – Average ...

YYY


 


Apply Quantile Normalization (QN)

 

Task:

999 – Clean ...

YYY

Interpret, and document

 
Be prepared to answer the following questions

Task:

  • What are the control and test conditions of the dataset?
  • Why is the dataset of interest to our systems assessment task?
  • Were there expression values that were not unique for specific genes? How did you handle these?
  • Were there expression values that could not be mapped to current HUGO symbols?
  • How many outliers were removed, how many datapoints were imputed?
  • How did you handle replicates?
  • What is the final coverage of your dataset?
 
  • Make sure your script contains the complete workflow, is fully commented, and contains all essential elements suggested by the script template[2]. This is a collaborative project - form matters.


 

Notes

  1. Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.
  2. Refer to the script template inst/extdata/scripts/scriptTemplate.R in the _zu_ project repository.


 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2018-01-25

Modified:

2017-11-01

Version:

1.0

Version history:

  • 1.0 First live version

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.