Difference between revisions of "ABC-INT-Expression data"

Revision as of 23:14, 26 January 2018

Integration Unit: Expression Data

(Integrator unit: select, clean and normalize expression data)

Abstract:

This page integrates material from the learning units and defines a task for selecting and normalizing human expression data.

Deliverables:

Integrator unit: Deliverables can be submitted for course marks. See below for details.

Prerequisites:
This unit builds on material covered in the following prerequisite units:

RPR-GEO2R (GEO2R)

Work through the tasks described below. Remember to document your work in your journal concurrently with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement.
Part of your task will involve writing an R script. Place your script code in a subpage of your User page on the Student Wiki and link to it from your Journal.
Schedule an oral exam by editing the signup page on the Student Wiki. You must have signed-up for an exam slot before 20:00 on the day before your exam.
Your work must be complete before 20:00 on the day before your exam.

Your task is to select an expression dataset that is suitable for use as a "feature" for human genes in machine learning. Currently, expression data are collected with microarrays and from RNAseq experiments. If we want to use different experiments in a computational experiment, we need to consider very carefully how to prepare comparable values.

To begin, please read the following paper:

Taroni & Greene (2017) Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv doi: https://doi.org/10.1101/118349)

Our ultimate goal is to explore machine learning approaches to evaluate systems membership of genes. For this, we need features that annotate genes, are suitable for machine learning, and are informative regarding the function of the gene. Expression profiles have great potential for this, since genes that collaborate are often (although not always) co-regulated - either directly, by being part of the same gene regulatory pathways, or indirectly by being similarly responsive to environmental conditions or other stimuli. In order to build "good" features, the data need to be of good quality, and informative for our purpose. We need expression datasets -

with good coverage;
not much older than ten years (quality!);
with sufficient numbers of replicates;
collected under interesting conditions;
mapped to unique human gene identifiers.

As the result of this task, you should prepare a script that will produce one reference and one experimental feature data set for human genes (from the same experiment).

To avoid mistakes in praparing the dataset, discuss your with your team members, or post questions on the mailing list. You are encouraged to discuss strategies with anyone however the script you submit must be entirely your own and you must not copy code (apart from the script template) from elsewhere.

Read the entire set of requirements and parameters carefully before you begin.

Select an Expression Data Set

Task:

Navigate to the GEO expression dataset browser and select an expression dataset you will work with.

More ▽

1 – Choose a dataset of native, healthy human cells or tissue ...

More ▽

2 – Choose an interesting experiment ...

More ▽

3 – Make sure the coverage is as complete as possible ...

More ▽

4 – Choose high-quality experiments ...

Claim the dataset on the dataset signup page of the Student Wiki.

Clean the data and map to HUGO symbols

Task:

Develop your code in an R script that you submit as part of this task.

More ▽

1 – Download the data ...

More ▽

2 – Assess ...

More ▽

3 – Map ...

More ▽

999 – Clean ...

More ▽

999 – Impute ...

More ▽

999 – Average ...

Apply Quantile Normalization (QN)

Task:

More ▽

999 – Clean ...

Interpret, and document

Be prepared to answer the following questions

Task:

What are the control and test conditions of the dataset?
Why is the dataset of interest to our systems assessment task?
Were there expression values that were not unique for specific genes? How did you handle these?
Were there expression values that could not be mapped to current HUGO symbols?
How many outliers were removed, how many datapoints were imputed?
How did you handle replicates?
What is the final coverage of your dataset?

Make sure your script contains the complete workflow, is fully commented, and contains all essential elements suggested by the script template^[2]. This is a collaborative project - form matters.

Notes

↑ Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.
↑ Refer to the script template inst/extdata/scripts/scriptTemplate.R in the _zu_ project repository.

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2018-01-26

Modified:

2017-11-01

Version:

1.0

Version history:

1.0 First live version

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.

[2] Refer to the script template inst/extdata/scripts/scriptTemplate.R in the _zu_ project repository.

[1]

[2]

@@ Line 143: / Line 143: @@
 <div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
-;999 – Download ...
+;1 – Download the data ...
 <div class="mw-collapsible-content" style="padding:10px;">
-YYY
+Do not work on manually downloaded data, but use the GEOquery Bioconductor package. (Obviously, do not re-download the dataset every time you run the script, but figure out a strategy to download only when necessary.) Sample code is in the R code associeted with the '''[[RPR-GEO2R]]''' learning unit.
 </div>
 </div>
 <div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
-;999 – Assess ...
+;2 – Assess ...
 <div class="mw-collapsible-content" style="padding:10px;">
-YYY
+Compute some overview statistics to assess data quality for the control and test conditions in your dataset.
 </div>
 </div>
 <div class="mw-collapsible mw-collapsed" data-expandtext="More&nbsp;▽" data-collapsetext="Hide&nbsp;△" style="width:67%;border: solid 1px #BBBBBB;padding: 10px;spacing: 10px;">
-;999 – Map ...
+;3 – Map ...
 <div class="mw-collapsible-content" style="padding:10px;">
-YYY
+Your dataset probably does not contain HUGO gene symbols as row identifiers - you need to map rows to symbols. How? Figure that out in collaboration with your team and the rest of the class. It is crucial that everyone maps to the same gene symbols. '''I have prepared a reference set of symbols in the ''zu'' repository. It is in <tt>inst/extdata</tt> and the corresponding script is in <tt>inst/scripts</tt>.''' But you will need to figure out how to handle unmapped rows (these are likely outdated aliases of current symbols, or possibly deprecated ORFs). You also need to figure out what to do with rows that map to more than one symbol (perhaps the sequenced fragment was not unique, or the microarray spot hybridizes to more than one gene.) Finally you need to figure out what to do with multiple rows that map to the same symbol (these could be splice variants, or probes that are designed as internal controls). In any case: you, your team, and the entire class have to come up with a consensus how these situations will be handled correctly. And your code must implement the consensus. Simply removing these cases from the dataset is not satisfactory, and if your code does not correctly implement the consensus approach you may not receive credit.
 </div>
 </div>
@@ Line 167: / Line 167: @@
 ;999 – Clean ...
 <div class="mw-collapsible-content" style="padding:10px;">
-YYY
+There are two considerations you need to go through ...
 </div>
 </div>

Difference between revisions of "ABC-INT-Expression data"

Revision as of 23:14, 26 January 2018

Contents

Evaluation

Contents

Select an Expression Data Set

Clean the data and map to HUGO symbols

Apply Quantile Normalization (QN)

Interpret, and document

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools