Difference between revisions of "ABC-INT-Categorical features"

Revision as of 02:48, 3 February 2018

Integrator Unit: Categorical Features

(Integrator unit: collect categorical features for human genes)

Abstract:

This page integrates material from the learning units and defines a task for defining and downloading categorical feature sets for human genes.

Deliverables:

Integrator unit: Deliverables can be submitted for course marks. See below for details.

Prerequisites:
This unit builds on material covered in the following prerequisite units:

BIN-FUNC-Annotation (Function Annotation)

Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.

Work through the tasks described below.
Note that there are several tasks that need to be coordinated with your teammates and classmates. This is necessary to ensure the feature sets can be merged in the second phase of the course. Be sure to begin this coordination process in time.
Remember to document your work in your journal concurrently with your progress. Journal entries that are uploaded in bulk at the end of your work will not be considered evidence of ongoing engagement. Note that this is a team task, and your contribution to the task must be clearly documented in your journal for evaluation.
Your task will involve submitting documentation on a sub-page of the Teams and Tasks page. (Details below) This documentation will be jointly authored and I expect every team member to be able to speak to all of it.
Your task will involve submitting code to the zu R package. Ensure that your team's submission are complete and pass package checks with zero errors, zero warnings and zero notes.
Schedule an oral exam (if you haven't done so already) by editing the signup page on the Student Wiki. You must have signed-up for an exam slot before 20:00 on the day before your exam.^[2]
Your work must be complete before 20:00 on the day before your exam.

Most interesting data that describes function in living cells is not numerical, but categorical. Moreover, it is data with large numbers of categories - "high cardinality categorical data". Such data is problematic for machine learning for reasons of principle, and practicality. Such data is sparse, and the many dimensions of noise make overfitting of data a strong concern, in particular if we do not have very large numbers of examples in our training sets. The data suffers from the "curse of dimensionality", i.e. all examples look similarly similar or different. And the datastructures that hold such data may become impractically large, and model training may take impractically long. In this integrator unit we will download and prepare different types of categorical data to explore later how to use feature engineering to optimize it for machine learning tasks.

Your tasks as a team are

to choose a dataset of interest for human systems biology;
to download the source data;
to transform it into categorical features;
to submit your scripts and tools to the zu package;
to document what you have achieved.

To begin, you need to choose - as a team - one of the following five data sources of categorical functional data:

(1) Graph data mining on STRING

The STRING database publishes a network of gene nodes and edges that represent functional interactions: these integrate various experimental observations and computational inference, such as protein-protein interactions and literature data mining - or they can be decomposed by individual categories of evidence. A summary score is given as a probability of an edge to be functionally relevant. Network data mining for ML features is a very interesting topic in and of itself, here we will simply take the neighbours of a gene as categorical features that describe its environment. Task: using a suitable score cutoff, produce a table of STRING neigbours for each human gene that is defined in our HUGO symbol table. Upload the annotation for the miniHUGOsymbols list to your documentation.

Example row:

TNFRSF4 TNFRSF9|CTLA4|TNFSF4|TRAF5|IL2|IL2RA|FOXP3

(2) GO and GOA

Gene Ontology Annotations provide the cornerstone of functional annotations for genes. Build a pipeline to annotate each HUGO symbol with the GO terms found in the relevant GOA tables. Task: produce a table of GO terms annotated for each human gene as defined by our HUGO symbol table. Do this separately for the three GO ontologies. Upload the annotation for the miniHUGOsymbols list to your documentation.

Example header and row:

symbol MF  BP  CC
TNFRSF4 GO:0001618|GO:0005031|GO:0005515 GO:0006954|GO:006955|GO:0007275  GO:0005886|GO:0005887|GO:0009986

(3) MSigDB sets

The Broad Institute hosts an expert-curated database of gene sets: MSigDB - the Molecular Signature Database. Task: download the data and build a pipeline to annotate all HUGO gene symbols with all of the gene sets that contain them. Annotate the miniHUGOsymbols list and upload that to your documentation and test how the pipeline scales to the full dataset of more than 17,000 gene sets.

Example row:

TNFRSF4 M1739|M5947|M18255|M13664|M1644|M4248

(4) Enrichment

A common aspect of systems biology wet-lab experiments is that they produce a set-of-genes result: genes that co-precipitate, genes that are co-regulated, genes that are phosporylated by the same kinase, etc. etc. Enrichment algorithms ask: what do such genes have in common, i.e. what feature appears more frequently in the set than one would expect in a randomly chosen set of genes. Any type of annotation can be chosen, but existing packages usually use GO annotations. Candidate tools include topGO, and other tools in the Gene Set Enrichment biocView^[3]. Task: build a pipeline that takes as input a set of HUGO symbols - such as the sets derived from the MSigDB above, and outputs an annotation of enriched GO terms for each of the set elements. Develop this for the miniHUGOsymbols list and a few other gene sets and upload the results for the miniHUGOsymbols list to your documentation.

Example row:

TNFRSF4 GO:0097190|GO:0051024|GO:0033209|GO:0032496

(5) Interpro

InterPro provides rich sequence and domain annotation - and the domain composition of a protein is a categorical feature set. Download of InterPro data is available. Task: produce a table of InterPro domains in each human gene as defined by our HUGO symbol table. Upload the annotation for the miniHUGOsymbols list to your documentation. Example row:

TNFRSF4 IPR034022|IPR001368|IPR001368

Do not upload your full datasets to the Github repository!

Process details TBC...

Notes

↑ Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.
↑ For clarification: You sign up for only one oral exam for February.
↑ Note GSEA (Gene Set Enrichment Analysis) is not the same as gene feature enrichment.

Further reading, links and resources

Taroni & Greene (2017) Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously (BioRχiv doi: https://doi.org/10.1101/118349)

Quantile Normalization is provided in the preprocessCore Bioconductor package:

Bolstad et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185-93. (pmid: 12538238)

[ PubMed ] [ DOI ] Abstract

RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR Bioconductor workflow for RNAseq differential expression analysis with edgeR.

RNA-seq workflow: gene-level exploratory analysis and differential expression Bioconductor workflow for RNAseq differential expression analysis with DEseq2.

HUGO Gene Nomenclature Committee - the authoritative information source for gene symbols. Includes search functions for synonyms. aliases and other information, as well as downloadable data.

Good discussion of current microarray normalization strategies, as well as a proposal how to apply QN to case/control datasets:

Cheng et al. (2016) CrossNorm: a novel normalization strategy for microarray data in cancers. Sci Rep 6:18898. (pmid: 26732145)

[ PubMed ] [ DOI ] Abstract

Quackenbusch's paper is now old, but an often-cited standard reference in the field:

Quackenbush (2002) Microarray data normalization and transformation. Nat Genet 32 Suppl:496-501. (pmid: 12454644)

[ PubMed ] [ DOI ] Abstract

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2018-02-01

Modified:

2018-02-01

Version:

0.1

Version history:

0.1 New unit under development

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] Note: oral exams will focus on the content of Integrator Units, but will also cover material that leads up to it. All exams in this course are cumulative.

[2] For clarification: You sign up for only one oral exam for February.

[3] Note GSEA (Gene Set Enrichment Analysis) is not the same as gene feature enrichment.

[1]

[2]

[3]

@@ Line 65: / Line 65: @@
 <!-- included from "./components/ABC-INT-Categorical_features.components.txt", section: "contents" -->
-Most interesting data that describes function in living cells is not numerical, but categorical. And it is data with large numbers of categories: this is "high cardinality categorical data". Such data is problematic for machine learning for reasons of principle, and practicality. Such data is sparse, and the many dimensions of noise make overfitting of data easy, in particular if we do not have very large numbers of examples in our training sets. The data suffers from the "curse of dimensionality", i.e. all examples look similarly similar or different. And the datastructures that hold such data may become impractically large, and model training may take impractically long times. In this integrator unit we will download and prepare different types of categorical data to explore later how to use feature engineering to optimize it for machine learning tasks.
+Most interesting data that describes function in living cells is not numerical, but categorical. Moreover, it is data with large numbers of categories - "'''high cardinality categorical data'''". Such data is problematic for machine learning for reasons of principle, and practicality. Such data is sparse, and the many dimensions of noise make overfitting of data a strong concern, in particular if we do not have very large numbers of examples in our training sets. The data suffers from the "curse of dimensionality", i.e. all examples look similarly similar or different. And the datastructures that hold such data may become impractically large, and model training may take impractically long. In this integrator unit we will download and prepare different types of categorical data to explore later how to use feature engineering to optimize it for machine learning tasks.
 Your tasks as a team are
 * to choose a dataset of interest for human systems biology;
-* to download the souyrce data;
+* to download the source data;
 * to transform it into categorical features;
 * to submit your scripts and tools to the ''zu'' package;
@@ Line 77: / Line 77: @@
-==== (1) Graph data mining on STRING====
+=== (1) Graph data mining on STRING===
 {{Smallvspace}}
-The [https://string-db.org/ '''STRING database'''] publishes a network of gene nodes and edges that represent functional interactions: these integrate various experimental observations and computational inference, such as protein-protein interactions and literature data mining - or they can be decomposed by individual data items. A summary score is given as a probability of an edge to be functionally relevant. Network data mining for ML features is a very interesting topic in and of itself, here we will simply take the neighbours of a gene as categorical features that describe its environment. Task: using a suitable score cutoff, produce a table of STRING neigbours for each human gene that is defined in our HUGO symbol table. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation. Example row:
+The [https://string-db.org/ '''STRING database'''] publishes a network of gene nodes and edges that represent functional interactions: these integrate various experimental observations and computational inference, such as protein-protein interactions and literature data mining - or they can be decomposed by individual categories of evidence. A summary score is given as a probability of an edge to be functionally relevant. Network data mining for ML features is a very interesting topic in and of itself, here we will simply take the neighbours of a gene as categorical features that describe its environment. Task: using a suitable score cutoff, produce a table of STRING neigbours for each human gene that is defined in our HUGO symbol table. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation.
+Example row:
   TNFRSF4 TNFRSF9|CTLA4|TNFSF4|TRAF5|IL2|IL2RA|FOXP3
@@ Line 86: / Line 88: @@
 {{Smallvspace}}
-==== (2) GO and GOA====
+=== (2) GO and GOA===
 {{Smallvspace}}
-Gene Ontology Annotations provide the cornerstone of functional annotations for genes. Build a pipeline to annotate each HUGO symbol with the GO terms found in the relevant GOA tables. Task: produce a table of GO terms annotated for each human gene as defined by our HUGO symbol table. Do this separately for the three GO ontologies. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation. Example header and row:
+Gene Ontology Annotations provide the cornerstone of functional annotations for genes. Build a pipeline to annotate each HUGO symbol with the GO terms found in the relevant GOA tables. Task: produce a table of GO terms annotated for each human gene as defined by our HUGO symbol table. Do this separately for the three GO ontologies. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation.
+Example header and row:
   symbol MF  BP  CC
@@ Line 95: / Line 99: @@
 {{Smallvspace}}
-==== (3) MSigDB sets====
+=== (3) MSigDB sets===
 {{Smallvspace}}
 The Broad Institute hosts an expert-curated database of gene sets: [http://software.broadinstitute.org/gsea/msigdb/collections.jsp '''MSigDB''' - the Molecular Signature Database]. Task: download the data and build a pipeline to annotate all HUGO gene symbols with all of the gene sets that contain them. Annotate the <tt>miniHUGOsymbols</tt> list and upload that to your documentation and test how the pipeline scales to the full dataset of more than 17,000 gene sets.
 Example row:
@@ Line 104: / Line 109: @@
 {{Smallvspace}}
-==== (4) Enrichment====
+=== (4) Enrichment===
 {{Smallvspace}}
 A common aspect of systems biology wet-lab experiments is that they produce a ''set-of-genes'' result: genes that co-precipitate, genes that are co-regulated, genes that are phosporylated by the same kinase, etc. etc. Enrichment algorithms ask: what do such genes have in common, i.e. what feature appears more frequently in the set than one would expect in a randomly chosen set of genes. Any type of annotation can be chosen, but existing packages usually use GO annotations. Candidate tools include [http://bioconductor.org/packages/release/bioc/html/topGO.html topGO], and other tools in the [http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment '''Gene Set Enrichment''' biocView]<ref>Note GSEA (Gene Set Enrichment Analysis) is '''not''' the same as gene feature enrichment.</ref>. Task: build a pipeline that takes as input a set of HUGO symbols - such as the sets derived from the MSigDB above, and outputs an annotation of enriched GO terms for each of the set elements. Develop this for the <tt>miniHUGOsymbols</tt> list and a few other gene sets and upload the results for the <tt>miniHUGOsymbols</tt> list to your documentation.
 Example row:
@@ Line 113: / Line 119: @@
 {{Smallvspace}}
-==== (5) Interpro====
+=== (5) Interpro===
 {{Smallvspace}}
 [https://www.ebi.ac.uk/interpro/protein/P43489 '''InterPro''' provides] rich sequence and domain annotation - and the domain composition of a protein is a categorical feature set. [https://www.ebi.ac.uk/interpro/download.html '''Download'''] of InterPro data is available. Task: produce a table of InterPro domains in each human gene as defined by our HUGO symbol table. Upload the annotation for the <tt>miniHUGOsymbols</tt> list to your documentation. Example row:
@@ Line 123: / Line 129: @@
 {{Vspace}}
-<div class="note">'''Do not upload your full datasets to the Github repository!'''</ref>
+<div class="note">'''Do not upload your full datasets to the Github repository!'''</div>
 {{Vspace}}

Difference between revisions of "ABC-INT-Categorical features"

Revision as of 02:48, 3 February 2018

Contents

Evaluation

Contents

(1) Graph data mining on STRING

(2) GO and GOA

(3) MSigDB sets

(4) Enrichment

(5) Interpro

Notes

Further reading, links and resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools