Difference between revisions of "ABC-INT-Expression data"
m |
m |
||
(9 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<div id="ABC"> | <div id="ABC"> | ||
− | <div style="padding:5px; border: | + | <div style="padding:5px; border:4px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;"> |
Integrator Unit: Expression Data | Integrator Unit: Expression Data | ||
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; "> | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; "> | ||
Line 21: | Line 21: | ||
<b>Deliverables:</b><br /> | <b>Deliverables:</b><br /> | ||
<section begin=deliverables /> | <section begin=deliverables /> | ||
− | < | + | <li><b>Integrator unit</b>: Deliverables can be submitted for course marks. See below for details.</li> |
− | |||
<section end=deliverables /> | <section end=deliverables /> | ||
<!-- ============================ --> | <!-- ============================ --> | ||
Line 28: | Line 27: | ||
<section begin=prerequisites /> | <section begin=prerequisites /> | ||
<b>Prerequisites:</b><br /> | <b>Prerequisites:</b><br /> | ||
− | + | This unit builds on material covered in the following prerequisite units:<br /> | |
− | This unit builds on material covered in the following prerequisite units: | ||
*[[RPR-GEO2R|RPR-GEO2R (GEO2R)]] | *[[RPR-GEO2R|RPR-GEO2R (GEO2R)]] | ||
<section end=prerequisites /> | <section end=prerequisites /> | ||
Line 38: | Line 36: | ||
+ | {{SLEEP}} | ||
{{Smallvspace}} | {{Smallvspace}} | ||
Line 48: | Line 47: | ||
=== Evaluation === | === Evaluation === | ||
− | |||
Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.<ref>Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.</ref>. | Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.<ref>Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.</ref>. | ||
* Work through the tasks described below. | * Work through the tasks described below. | ||
Line 54: | Line 52: | ||
*Remember to document your work in your journal '''concurrently''' with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement. | *Remember to document your work in your journal '''concurrently''' with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement. | ||
* Your task will involve writing an R script. '''Place your script code in a subpage of your User page on the Student Wiki'''<ref>Use the appropriate GeSHi code markup</ref> and link to the page from your Journal. | * Your task will involve writing an R script. '''Place your script code in a subpage of your User page on the Student Wiki'''<ref>Use the appropriate GeSHi code markup</ref> and link to the page from your Journal. | ||
− | * Schedule an oral exam by editing the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. You must have signed-up for an exam slot before 20:00 on the day before your exam. | + | * Schedule an oral exam by editing <span class="highlight">the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. You</span> must have signed-up for an exam slot before 20:00 on the day before your exam. |
* Your work must be complete before 20:00 on the day before your exam. | * Your work must be complete before 20:00 on the day before your exam. | ||
Line 60: | Line 58: | ||
== Contents == | == Contents == | ||
− | |||
Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values. | Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values. | ||
Line 79: | Line 76: | ||
To avoid mistakes in praparing the dataset, discuss your approach with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone '''however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere'''. | To avoid mistakes in praparing the dataset, discuss your approach with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone '''however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere'''. | ||
− | Read the entire set of requirements and parameters carefully before you begin. | + | Read the entire set of requirements and parameters carefully before you begin. I have posted sample code that covers some of the aspects in the <tt>./inst/scripts/</tt> directory of the ''zu'' project/package repository. |
{{Vspace}} | {{Vspace}} | ||
Line 89: | Line 86: | ||
{{task|1= | {{task|1= | ||
− | *Navigate to [https://www.ncbi.nlm.nih.gov/ | + | *Navigate to [https://www.ncbi.nlm.nih.gov/gds/ '''the GEO expression dataset search page'''] and select an expression dataset you will work with. |
{{Vspace}} | {{Vspace}} | ||
Line 129: | Line 126: | ||
{{Vspace}} | {{Vspace}} | ||
− | * Claim the dataset on the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/ABC-INT-Expression_datasets '''dataset signup page of the Student Wiki''']. | + | * <span class="highlight">Claim the dataset</span> on the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/ABC-INT-Expression_datasets '''dataset signup page of the Student Wiki''']. |
{{Smallvspace}} | {{Smallvspace}} | ||
Line 184: | Line 181: | ||
;5 – Average ... | ;5 – Average ... | ||
<div class="mw-collapsible-content" style="padding:10px;"> | <div class="mw-collapsible-content" style="padding:10px;"> | ||
− | Calculate | + | Calculate robust estimates for expression values from the replicates in your dataset. |
</div> | </div> | ||
</div> | </div> | ||
Line 203: | Line 200: | ||
* Next, transform the data with QN. The process is motivated and described in Taroni (2017), but once again there may be parameters to respect and we need a class-consensus on how to do this correctly. Coordinate as above. | * Next, transform the data with QN. The process is motivated and described in Taroni (2017), but once again there may be parameters to respect and we need a class-consensus on how to do this correctly. Coordinate as above. | ||
− | * The final result of your script needs to be a dataframe with two numeric columns, named <tt><GSET-ID>.ctrl</tt> and <tt><GSET-ID>.test</tt>, all rows of the HUGO symbols must exist in the exact order of the HUGO symbol reference vector, and the HUGO symbols must be defined as rownames of the dataframe. I expect that you have actually produced such a dataset and have it available on your computer for reference. '''Do not upload this data to Github.''' | + | * The final result of your script needs to be a dataframe with two numeric columns, named <tt><GSET-ID>.ctrl</tt> and <tt><GSET-ID>.test</tt>, all rows of the HUGO symbols must exist in the exact order of the HUGO symbol reference vector, and the HUGO symbols must be defined as rownames of the dataframe. |
+ | <div class="mw-collapsible mw-collapsed" data-expandtext="More ▽" data-collapsetext="Hide △" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;"> | ||
+ | ;Example ... | ||
+ | <div class="mw-collapsible-content" style="padding:10px;"> | ||
+ | If you are working on <tt>GSE01234</tt> and you have mapped symbols <tt>DDD</tt>, <tt>AAA</tt>, and <tt>MMM</tt>, but not probe <tt>1234_at</tt>, your initial dataframe after processing might look like this: | ||
+ | |||
+ | sym ctrl test | ||
+ | 1200_at DDD 69.40936 65.12871 | ||
+ | 1234_at <NA> 58.08945 82.15397 | ||
+ | 278_at AAA 65.62096 63.12460 | ||
+ | 112358_at MMM 46.75833 58.08034 | ||
+ | |||
+ | Assume <tt>HUGOSymbols</tt> contains the four symbols <tt>AAA</tt>, <tt>BBB</tt>, <tt>DDD</tt>, and <tt>MMM</tt>. Your '''final''' dataframe, according to the specifications, has to look like this: | ||
+ | |||
+ | GSE01234.ctrl GSE01234.test | ||
+ | AAA 65.62096 63.12460 | ||
+ | BBB NA NA | ||
+ | DDD 69.40936 65.12871 | ||
+ | MMM 46.75833 58.08034 | ||
+ | |||
+ | Note that the rownames of the final dataframe are *exactly the same* as the <tt>HUGOSymbols</tt> vector. This is necessary so we can later merge our data across all expression datasets. | ||
+ | |||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | * I expect that you have actually produced such a dataset and have it available on your computer for reference. '''Do not upload this data to Github.''' | ||
* If your script does not produce a data set according to these exact specifications, this '''must''' be clearly stated in the script. | * If your script does not produce a data set according to these exact specifications, this '''must''' be clearly stated in the script. | ||
Line 235: | Line 258: | ||
{{Vspace}} | {{Vspace}} | ||
− | == | + | == Further reading, links and resources == |
− | < | + | |
− | + | *<div class="reference-box">[https://www.biorxiv.org/content/early/2017/03/21/118349 Taroni & Greene (2017)] Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv '''doi:''' https://doi.org/10.1101/118349) </div> | |
− | < | ||
− | {{ | + | *<small>Quantile Normalization is provided in the [https://bioconductor.org/packages/release/bioc/html/preprocessCore.html <tt>preprocessCore</tt> Bioconductor package]:</small> |
+ | :{{#pmid:12538238}} | ||
+ | *<div class="reference-box">[https://www.bioconductor.org/help/workflows/RNAseq123/ '''RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR'''] Bioconductor workflow for RNAseq differential expression analysis with '''edgeR'''. </div> | ||
− | < | + | *<div class="reference-box">[https://www.bioconductor.org/help/workflows/rnaseqGene/ '''RNA-seq workflow: gene-level exploratory analysis and differential expression'''] Bioconductor workflow for RNAseq differential expression analysis with '''DEseq2'''. </div> |
− | - | + | *[https://www.genenames.org '''HUGO''' Gene Nomenclature Committee] - the authoritative information source for gene symbols. Includes search functions for synonyms. aliases and other information, as well as downloadable data. |
− | {{ | + | *<small>Good discussion of current microarray '''normalization''' strategies, as well as a proposal how to apply QN to case/control datasets:</small> |
+ | :{{#pmid:26732145}} | ||
− | < | + | *<small>Quackenbusch's paper is now old, but an often-cited standard reference in the field:</small> |
+ | :{{#pmid:12454644}} | ||
− | + | == Notes == | |
+ | <references /> | ||
{{Vspace}} | {{Vspace}} | ||
+ | |||
<div class="about"> | <div class="about"> | ||
Line 265: | Line 293: | ||
:2017-11-01 | :2017-11-01 | ||
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :1. | + | :1.1.1 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.1.1 Sleeping ... | ||
+ | *1.1 Some resources and further reading added | ||
*1.0 First live version | *1.0 First live version | ||
</div> | </div> | ||
− | |||
− | |||
{{CC-BY}} | {{CC-BY}} | ||
+ | [[Category:ABC-units]] | ||
+ | {{INTEGRATOR}} | ||
+ | {{SLEEP}} | ||
+ | {{EVAL}} | ||
</div> | </div> | ||
<!-- [END] --> | <!-- [END] --> |
Latest revision as of 01:44, 23 September 2020
Integrator Unit: Expression Data
(Integrator unit: select, clean and normalize expression data)
Abstract:
This page integrates material from the learning units and defines a task for selecting and normalizing human expression data.
Deliverables:
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Contents
Evaluation
Your progress and outcomes of this "Integrator Unit" will be one of the topics of the first oral exam for BCB420/JTB2020. That oral exam will be worth 20% of your term grade.[1].
- Work through the tasks described below.
- Note that there are several tasks that need to be coordinated with your teammates and classmates. This is necessary to ensure the feature sets can be merged in the second phase of the course. Be sure to begin this coordination process in time.
- Remember to document your work in your journal concurrently with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement.
- Your task will involve writing an R script. Place your script code in a subpage of your User page on the Student Wiki[2] and link to the page from your Journal.
- Schedule an oral exam by editing the signup page on the Student Wiki. You must have signed-up for an exam slot before 20:00 on the day before your exam.
- Your work must be complete before 20:00 on the day before your exam.
Contents
Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values.
To begin, please read the following paper:
Our ultimate goal is to explore machine learning approaches to evaluate systems membership of genes. For this, we need features that annotate genes, are suitable for machine learning, and are informative regarding the function of the gene. Expression profiles have great potential for this, since genes that collaborate are often (although not always) co-regulated - either directly by being part of the same gene regulatory pathways, or indirectly by being similarly responsive to environmental conditions or other stimuli. In order to build "good" features, the data need to be of good quality, and informative for our purpose. We need expression datasets -
- with good coverage;
- not much older than ten years (quality!);
- with sufficient numbers of replicates;
- collected under interesting conditions;
- mapped to unique human gene identifiers.
For this integrator unit you will prepare a script that will produce one reference and one experimental feature data set for human genes (from the same experiment).
To avoid mistakes in praparing the dataset, discuss your approach with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere.
Read the entire set of requirements and parameters carefully before you begin. I have posted sample code that covers some of the aspects in the ./inst/scripts/ directory of the zu project/package repository.
Select an Expression Data Set
Task:
- Navigate to the GEO expression dataset search page and select an expression dataset you will work with.
- 1 – Choose a dataset of native, healthy human cells or tissue ...
- 2 – Choose an interesting experiment ...
- 3 – Make sure the coverage is as complete as possible ...
- 4 – Choose high-quality experiments ...
- Claim the dataset on the dataset signup page of the Student Wiki.
Clean the data and map to HUGO symbols
Task:
- Develop your code in an R script that you submit as part of this task. The script should implement the following workflow:
- 1 – Download the data ...
- 2 – Assess ...
- 3 – Map ...
- 4 – Clean ...
- 5 – Average ...
Apply Quantile Normalization (QN)
Task:
- Next, transform the data with QN. The process is motivated and described in Taroni (2017), but once again there may be parameters to respect and we need a class-consensus on how to do this correctly. Coordinate as above.
- The final result of your script needs to be a dataframe with two numeric columns, named <GSET-ID>.ctrl and <GSET-ID>.test, all rows of the HUGO symbols must exist in the exact order of the HUGO symbol reference vector, and the HUGO symbols must be defined as rownames of the dataframe.
- Example ...
- I expect that you have actually produced such a dataset and have it available on your computer for reference. Do not upload this data to Github.
- If your script does not produce a data set according to these exact specifications, this must be clearly stated in the script.
Interpret, and document
The steps above conclude the actual data preparation. Be prepared to answer the following questions:
Task:
- What are the control and test conditions of the dataset?
- Why is the dataset of interest to our systems assessment task?
- Were there expression values that were not unique for specific genes? How did you handle these?
- Were there expression values that could not be mapped to current HUGO symbols?
- How many outliers were removed, how many datapoints were imputed?
- How did you handle replicates?
- What is the final coverage of your dataset?
- Make sure your script contains the complete workflow, is fully commented, and contains all essential elements suggested by the script template[4]. This is a collaborative project - form matters.
Further reading, links and resources
- Taroni & Greene (2017) Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv doi: https://doi.org/10.1101/118349)
- Quantile Normalization is provided in the preprocessCore Bioconductor package:
Bolstad et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185-93. (pmid: 12538238) - RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR Bioconductor workflow for RNAseq differential expression analysis with edgeR.
- RNA-seq workflow: gene-level exploratory analysis and differential expression Bioconductor workflow for RNAseq differential expression analysis with DEseq2.
- HUGO Gene Nomenclature Committee - the authoritative information source for gene symbols. Includes search functions for synonyms. aliases and other information, as well as downloadable data.
- Good discussion of current microarray normalization strategies, as well as a proposal how to apply QN to case/control datasets:
Cheng et al. (2016) CrossNorm: a novel normalization strategy for microarray data in cancers. Sci Rep 6:18898. (pmid: 26732145) - Quackenbusch's paper is now old, but an often-cited standard reference in the field:
Quackenbush (2002) Microarray data normalization and transformation. Nat Genet 32 Suppl:496-501. (pmid: 12454644) Notes
- ↑ Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.
- ↑ Use the appropriate GeSHi code markup
- ↑ See this conversation on cross-validated for example.
- ↑ Refer to the script template inst/extdata/scripts/scriptTemplate.R in the _zu_ project repository.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2018-01-26
Modified:
- 2017-11-01
Version:
- 1.1.1
Version history:
- 1.1.1 Sleeping ...
- 1.1 Some resources and further reading added
- 1.0 First live version
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.
This page is not currently being maintained since it is not part of active learning sections.