Difference between revisions of "CSB Assignment Week 2"
m |
m (→Pre-reading) |
||
Line 23: | Line 23: | ||
| | ||
==Pre-reading== | ==Pre-reading== | ||
− | + | In week 3, we will discuss various aspects of working with genome-scale data sets. For many experimental approaches, the ultimate outcome is a list of genes and the challenge is how to infer information from what such lists have in common: | |
− | {{#lst: | + | {{#lst:CSB_Gene_lists|reading}} |
Revision as of 01:18, 22 January 2013
Assignments for Week 2
Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
Exercises for this week relate to this week's lecture.
Pre-reading for this week will prepare next week's lectures.
Exercises and pre-reading will both be topics on next week's quiz.
Contents
Exercises
In this set of exercises we dive into practical work with GO: at first via the AmiGO browser, and then via bioconductor.
AmiGO
AmiGO is a GO browser developed by the Gene Ontology consortium and hosted on their website.
AmiGO - Gene products
Task:
- Navigate to the GO homepage.
- Enter
SOX2
into the search box to initiate a search for the human SOX2 transcription factor (WP, HUGO) (as gene or protein name). - There are a number of hits in various organisms: sulfhydryl oxidases and (sex determining region Y)-box genes. Check to see the various ways by which you could filter and restrict the results.
- Select Homo sapiens as the species filter and set the filter. Note that this still does not give you a unique hit, but ...
- ... you can identify the Transcription factor SOX-2 and follow its gene product information link. Study the information on that page.
- Later, we will need Entrez Gene IDs. The GOA information page provides these as GeneID in the External references section. Note it down. With the same approach, find and record the Gene IDs (a) of the functionally related Oct4 (POU5F1) protein, (b) the human cell-cycle transcription factor E2F1, (c) the human bone morphogenetic protein-4 transforming growth factor BMP4, (d) the human UDP glucuronosyltransferase 1 family protein 1, an enzyme that is differentially expressed in some cancers, UGT1A1, and (d) as a positive control, SOX2's interaction partner NANOG, which we would expect to be annotated as functionally similar to both Oct4 and SOX2.
AmiGO - Associations
GO annotations for a protein are called associations.
Task:
- Open the associations information page for the human SOX2 protein via the link in the right column in a separate tab. Study the information on that page.
- Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the definitions of the codes. Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC etc.). Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accessible, as GO is doing here. You won't find this everywhere, but as a user please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.
- There are many associations (around 60) and a good way to select which ones to pursue is to follow the most specific ones. Set
IDA
as a filter and among the returned terms selectGO:0035019
– somatic stem cell maintenance in the Biological Process ontology. Follow that link. - Study the information available on that page and through the tabs on the page, especially the graph view.
- In the Inferred Tree View tab, find the genes annotated to this go term for homo sapiens. There should be about 55. Click on the number behind the term. The resulting page will give you all human proteins that have been annotated with this particular term. Note that the great majority of these is via the
IEA
evidence code.
Semantic similarity
A good, recent overview of ontology based functional annotation is found in the following article. This is not a formal reading assignment, but do familiarize yourself with section 3: Derivation of Semantic Similarity between Terms in an Ontology as an introduction to the code-based annotations below.
Gan et al. (2013) From ontology to semantic similarity: calculation of ontology-based semantic similarity. ScientificWorldJournal 2013:793091. (pmid: 23533360) |
[ PubMed ] [ DOI ] Advances in high-throughput experimental techniques in the past decade have enabled the explosive increase of omics data, while effective organization, interpretation, and exchange of these data require standard and controlled vocabularies in the domain of biological and biomedical studies. Ontologies, as abstract description systems for domain-specific knowledge composition, hence receive more and more attention in computational biology and bioinformatics. Particularly, many applications relying on domain ontologies require quantitative measures of relationships between terms in the ontologies, making it indispensable to develop computational methods for the derivation of ontology-based semantic similarity between terms. Nevertheless, with a variety of methods available, how to choose a suitable method for a specific application becomes a problem. With this understanding, we review a majority of existing methods that rely on ontologies to calculate semantic similarity between terms. We classify existing methods into five categories: methods based on semantic distance, methods based on information content, methods based on properties of terms, methods based on ontology hierarchy, and hybrid methods. We summarize characteristics of each category, with emphasis on basic notions, advantages and disadvantages of these methods. Further, we extend our review to software tools implementing these methods and applications using these methods. |
The bioconductor project hosts the GOSemSim package for semantic similarity.
Task:
- Work through the following R-code. If you have problems, discuss them on the mailing list. Don't go through the code mechanically but make sure you are clear about what it does.
# GOsemanticSimilarity.R
# GO semantic similarity example
# B. Steipe for BCB420, January 2014
setwd("~/your-R-project-directory")
# GOSemSim is an R-package in the bioconductor project. It is not installed via
# the usual install.packages() comand (via CRAN) but via an installation script
# that is run from the bioconductor Website.
source("http://bioconductor.org/biocLite.R")
biocLite("GOSemSim")
library(GOSemSim)
# This loads the library and starts the Bioconductor environment.
# You can get an overview of functions by executing ...
browseVignettes()
# ... which will open a listing in your Web browser. Open the
# introduction to GOSemSim PDF. As the introduction suggests,
# now is a good time to execute ...
help(GOSemSim)
# The simplest function is to measure the semantic similarity of two GO
# terms. For example, SOX2 was annotated with GO:0035019 (somatic stem cell
# maintenance), QSOX2 was annotated with GO:0045454 (cell redox homeostasis),
# and Oct4 (POU5F1) with GO:0009786 (regulation of asymmetric cell division),
# among other associations. Lets calculate these similarities.
goSim("GO:0035019", "GO:0009786", ont="BP", measure="Wang")
goSim("GO:0035019", "GO:0045454", ont="BP", measure="Wang")
# Fair enough. Two numbers. Clearly we would appreciate an idea of the values
# that high similarity and low similarity can take. But in any case -
# we are really less interested in the similarity of GO terms - these
# are a function of how the Ontology was constructed. We are more
# interested in the functional similarity of our genes, and these
# have a number of GO terms associated with them.
# GOSemSim provides the functions ...
?geneSim()
?mgeneSim()
# ... to compute these values. Refer to the vignette for details, in
# particular, consider how multiple GO terms are combined, and how to
# keep/drop evidence codes.
# Here is a pairwise similarity example: the gene IDs are the ones you
# have recorded previously. Note that this will download a package
# of GO annotations - you might not want to do this on a low-bandwidth
# connection.
geneSim("6657", "5460", ont = "BP", measure="Wang", combine = "BMA")
# Another number. And the list of GO terms that were considered.
# Your task: use the mgeneSim() function to calculate the similarities
# between all six proteins for which you have recorded the GeneIDs
# previously (SOX2, POU5F1, E2F1, BMP4, UGT1A1 and NANOG) in the
# biological process ontology.
# This will run for some time. On my machine, half an hour or so.
# Do the results correspond to your expectations?
Pre-reading
In week 3, we will discuss various aspects of working with genome-scale data sets. For many experimental approaches, the ultimate outcome is a list of genes and the challenge is how to infer information from what such lists have in common:
Kim (2012) Chapter 8: Biological knowledge assembly and interpretation. PLoS Comput Biol 8:e1002858. (pmid: 23300429) |
[ PubMed ] [ DOI ] Most methods for large-scale gene expression microarray and RNA-Seq data analysis are designed to determine the lists of genes or gene products that show distinct patterns and/or significant differences. The most challenging and rate-liming step, however, is to determine what the resulting lists of genes and/or transcripts biologically mean. Biomedical ontology and pathway-based functional enrichment analysis is widely used to interpret the functional role of tightly correlated or differentially expressed genes. The groups of genes are assigned to the associated biological annotations using Gene Ontology terms or biological pathways and then tested if they are significantly enriched with the corresponding annotations. Unlike previous approaches, Gene Set Enrichment Analysis takes quite the reverse approach by using pre-defined gene sets. Differential co-expression analysis determines the degree of co-expression difference of paired gene sets across different conditions. Outcomes in DNA microarray and RNA-Seq data can be transformed into the graphical structure that represents biological semantics. A number of biomedical annotation and external repositories including clinical resources can be systematically integrated by biological semantics within the framework of concept lattice analysis. This array of methods for biological knowledge assembly and interpretation has been developed during the past decade and clearly improved our biological understanding of large-scale genomic data from the high-throughput technologies. |