Difference between revisions of "CSB Ontologies"
m (→Exercises) |
|||
Line 107: | Line 107: | ||
==Exercises== | ==Exercises== | ||
<section begin=exercises /> | <section begin=exercises /> | ||
− | + | ===AmiGO=== | |
+ | |||
+ | [http://amigo.geneontology.org/cgi-bin/amigo/go.cgi '''AmiGO'''] is a [http://www.geneontology.org/ '''GO'''] browser developed by the Gene Ontology consortium and hosted on their website. | ||
+ | |||
+ | ====Gene products==== | ||
+ | {{task|1= | ||
+ | # Navigate to the [http://www.geneontology.org/ '''GO'''] homepage. | ||
+ | # Enter <code>E2F1</code> into the search box to initiate a search for the human {{WP|E2F1}} transcription factor. | ||
+ | # The number of hits is not very large, but check to see the various ways by which you could filter and restrict the results. | ||
+ | # Open the gene product information page for the human protein via the [http://amigo.geneontology.org/cgi-bin/amigo/gp-details.cgi?gp=UniProtKB:Q01094 link in the left column] in a separate tab. Study the information on that page and note down the UniprotKB accession number. | ||
+ | # With the same approach, find and record the UniprotKB ID's (''a'') of the functionally related [http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=mbp1 yeast '''MBP1''' protein], (''b'') as a negative control, the functionally unrelated human {{WP|Myelin basic protein|'''MBP''' (myelin basic protein)}}, and (''c'') as a positive control, E2F1's human interaction partner TFDP1, which we would expect to be annotated as functionally similar to both E2F1 and MBP1. | ||
+ | }} | ||
+ | |||
+ | |||
+ | ====Associations==== | ||
+ | GO annotations for a protein are called ''associations''. | ||
+ | |||
+ | {{task|1= | ||
+ | # Open the ''associations'' information page for the human E2F1 protein via the [http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:Q01094 link in the right column] in a separate tab. Study the information on that page. | ||
+ | # Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the [http://www.geneontology.org/GO.evidence.shtml definitions of the codes]. '''Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC ''etc''.).''' <small>Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accesible, as GO is doing here. You won't find this everywhere, but as a user, please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.</small> | ||
+ | # One of the ''most specific'' associated terms on the page is for <code>GO:0000085</code> - the [http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0000085&session_id=1379amigo1358806334 G2 phase of the mitotic cell cycle] in the '''Biological Process''' ontology. Follow that link. | ||
+ | # Study the information available on that page. Look at the information available through the tabs on the page, especially the graph view. Then see how you can filter the gene product counts for the various levels of the hierarchy by species. Restrict the lineage to <code>H. sapiens</code>. | ||
+ | # Click on [http://amigo.geneontology.org/cgi-bin/amigo/term-assoc.cgi?term=GO:0051319&speciesdb=all&taxid=9606 the number behind the '''Is_a''' relationship of the G2 phase. The resulting page will give you all human proteins that have been annotated with this particular term. | ||
+ | }} | ||
+ | |||
+ | |||
+ | ===Semantic similarity=== | ||
+ | |||
+ | A good overview of semantic similarity measures is found in the following article. This is not a formal reading assignment, but download the article, browse over it and familiarize yourself with the measures that are discussed in the ''background'' and ''topology based clustering'' sections. | ||
+ | |||
+ | {{#pmid:21078182}} | ||
+ | |||
+ | |||
+ | GO tools and resources are curated by the [http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools "Neuroscience lexicon"]. | ||
+ | |||
+ | |||
+ | {{task|1= | ||
+ | # Navigate to the [http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools '''Neurolex''' Gene Ontology tools resource]. | ||
+ | # Scroll to the section on "Semantic Similarity", and on the way, familiarize yourself vaguely with the wealth of available tools. <small>''nb.'' - the list is not even complete, e.g. most '''R''' and '''bioconductor''' packages are not included here. | ||
+ | # This list is in a bit of a sorry state. Most of the tools noted there do not compute online semantic similarities for proteins. The ones that do, mostly don't work. The one that does is not directly linked from the list, has its own stated parameters only incompletely supported. But give it a try anyway: navigate to [http://lasige.di.fc.ul.pt/webtools/proteinon/ Proteinon] | ||
+ | # Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Check to ''ignore IEA'' (you remember what these are, right?). Enter your four UniProt IDs in the correct format (comma separated) and '''run''' the computation. | ||
+ | # Interpret the similarity score table. Does it correspond to your expectations? | ||
+ | |||
+ | |||
+ | }} | ||
+ | ; | ||
:'''A: Gene identifiers''' | :'''A: Gene identifiers''' | ||
Line 146: | Line 191: | ||
#Contrast this with the ontology for human MBP, specifically the axon ensheathment process. | #Contrast this with the ontology for human MBP, specifically the axon ensheathment process. | ||
<section end=exercises /> | <section end=exercises /> | ||
− | |||
− | |||
==References== | ==References== |
Revision as of 00:29, 22 January 2013
Ontologies for Computational Systems Biology
Poorly structured data can be integrated via ontologies. This is especially important for phenotype and "function" data. The primary example is the Gene Ontology (GO). Other examples include the Disease Ontology, OMIM and WikiGene.
Contents
Introduction
Harris (2008) Developing an ontology. Methods Mol Biol 452:111-24. (pmid: 18563371) |
Hackenberg & Matthiesen (2010) Algorithms and methods for correlating experimental results with annotation databases. Methods Mol Biol 593:315-40. (pmid: 19957156) |
GO
The Gene Ontology project is the most influential contributor to the definition of function in computational biology and the use of GO terms and GO annotations is ubiquitous.
GO: the Gene Ontology project [ link ] [ page ] Expand... | ![]() |
du Plessis et al. (2011) The what, where, how and why of gene ontology--a primer for bioinformaticians. Brief Bioinformatics 12:723-35. (pmid: 21330331) |
The GO actually comprises three separate ontologies:
- Molecular function
- ...
- Biological Process
- ...
- Cellular component
- ...
GO terms
GO terms comprise the core of the information in the ontology: a carefully crafted definition of a term in any of GO's separate ontologies.
GO relationships
The nature of the relationships is as much a part of the ontology as the terms themselves. GO uses three categories of relationships:
- is a
- part of
- regulates
GO annotations
The GO terms are conceptual in nature, and while they represent our interpretation of biological phenomena, they do not intrinsically represent biological objects, such a specific genes or proteins. In order to link molecules with these concepts, the ontology is used to annotate genes. The annotation project is referred to as GOA.
Dimmer et al. (2007) Methods for gene ontology annotation. Methods Mol Biol 406:495-520. (pmid: 18287709) |
GO evidence codes
Annotations can be made according to literature data or computational inference and it is important to note how an annotation has been justified by the curator to evaluate the level of trust we should have in the annotation. GO uses evidence codes to make this process transparent. When computing with the ontology, we may want to filter (exclude) particular terms in order to avoid tautologies: for example if we were to infer functional relationships between homologous genes, we should exclude annotations that have been based on the same inference or similar, and compute only with the actual experimental data.
The following evidence codes are in current use; if you want to exclude inferred anotations you would restrict the codes you use to the ones shown in bold: EXP, IDA, IPI, IMP, IEP, and perhaps IGI, although the interpretation of genetic interactions can require assumptions.
- Automatically-assigned Evidence Codes
- IEA: Inferred from Electronic Annotation
- Curator-assigned Evidence Codes
- Experimental Evidence Codes
- EXP: Inferred from Experiment
- IDA: Inferred from Direct Assay
- IPI: Inferred from Physical Interaction
- IMP: Inferred from Mutant Phenotype
- IGI: Inferred from Genetic Interaction
- IEP: Inferred from Expression Pattern
- Computational Analysis Evidence Codes
- ISS: Inferred from Sequence or Structural Similarity
- ISO: Inferred from Sequence Orthology
- ISA: Inferred from Sequence Alignment
- ISM: Inferred from Sequence Model
- IGC: Inferred from Genomic Context
- IBA: Inferred from Biological aspect of Ancestor
- IBD: Inferred from Biological aspect of Descendant
- IKR: Inferred from Key Residues
- IRD: Inferred from Rapid Divergence
- RCA: inferred from Reviewed Computational Analysis
- Author Statement Evidence Codes
- TAS: Traceable Author Statement
- NAS: Non-traceable Author Statement
- Curator Statement Evidence Codes
- IC: Inferred by Curator
- ND: No biological Data available
For further details, see the Guide to GO Evidence Codes and the GO Evidence Code Decision Tree.
GO tools
For many projects, the simplest approach will be to download the GO ontology itself. It is a well constructed, easily parseable file that is well suited for computation. For details, see Computing with GO on this wiki.
Introductory reading
Exercises
AmiGO
AmiGO is a GO browser developed by the Gene Ontology consortium and hosted on their website.
Gene products
Task:
- Navigate to the GO homepage.
- Enter
E2F1
into the search box to initiate a search for the human E2F1 transcription factor. - The number of hits is not very large, but check to see the various ways by which you could filter and restrict the results.
- Open the gene product information page for the human protein via the link in the left column in a separate tab. Study the information on that page and note down the UniprotKB accession number.
- With the same approach, find and record the UniprotKB ID's (a) of the functionally related yeast MBP1 protein, (b) as a negative control, the functionally unrelated human MBP (myelin basic protein), and (c) as a positive control, E2F1's human interaction partner TFDP1, which we would expect to be annotated as functionally similar to both E2F1 and MBP1.
Associations
GO annotations for a protein are called associations.
Task:
- Open the associations information page for the human E2F1 protein via the link in the right column in a separate tab. Study the information on that page.
- Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the definitions of the codes. Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC etc.). Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accesible, as GO is doing here. You won't find this everywhere, but as a user, please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.
- One of the most specific associated terms on the page is for
GO:0000085
- the G2 phase of the mitotic cell cycle in the Biological Process ontology. Follow that link. - Study the information available on that page. Look at the information available through the tabs on the page, especially the graph view. Then see how you can filter the gene product counts for the various levels of the hierarchy by species. Restrict the lineage to
H. sapiens
. - Click on [http://amigo.geneontology.org/cgi-bin/amigo/term-assoc.cgi?term=GO:0051319&speciesdb=all&taxid=9606 the number behind the Is_a relationship of the G2 phase. The resulting page will give you all human proteins that have been annotated with this particular term.
Semantic similarity
A good overview of semantic similarity measures is found in the following article. This is not a formal reading assignment, but download the article, browse over it and familiarize yourself with the measures that are discussed in the background and topology based clustering sections.
Jain & Bader (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 11:562. (pmid: 21078182) |
GO tools and resources are curated by the "Neuroscience lexicon".
Task:
- Navigate to the Neurolex Gene Ontology tools resource.
- Scroll to the section on "Semantic Similarity", and on the way, familiarize yourself vaguely with the wealth of available tools. nb. - the list is not even complete, e.g. most R and bioconductor packages are not included here.
- This list is in a bit of a sorry state. Most of the tools noted there do not compute online semantic similarities for proteins. The ones that do, mostly don't work. The one that does is not directly linked from the list, has its own stated parameters only incompletely supported. But give it a try anyway: navigate to Proteinon
- Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Check to ignore IEA (you remember what these are, right?). Enter your four UniProt IDs in the correct format (comma separated) and run the computation.
- Interpret the similarity score table. Does it correspond to your expectations?
- A: Gene identifiers
- Navigate to the Saccharomyces Genome Database and search for the gene name mbp1 using the search box. Review the information available on the result page. Find, and note down the UniProt ID.
- For comparison, review the gene information of the functionally related human E2F1 transcription factor at the NCBI. Here too, find, and note down the UniProt ID.
- To compare functional similarity, find the IDs of a protein of related, and of unrelated function in Uniprot.
- Find the UniProt ID of E2F1's human interaction partner TFDP1, which we would expect to be annotated as functionally similar to both E2F1 and MBP1;
- also find the UniProt ID of human MBP (myelin basic protein), which is functionally unrelated.
- B: Semantic similarity scores
Next, we compute the semantic similarity of these two genes. The GO database lists a number of tools for this task (http://www.geneontology.org/GO.tools_by_type.semantic_similarity.shtml).
- Navigate to the ProteInOn site at Lisbon University in Portugal - the online tool to compute GO-based semantic similarity that was discussed in last weeks reading assignment. Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Enter your four UniProt IDs in the correct format and run the computation.
- Interpret the similarity score table. Does it correspond to your expectations?
- C: Graphical view of the ontology
Finally, we'll use the GO's AmiGO browser to compare the genes graphically.
- Navigate to the AmiGO search interface, select "genes or proteins" and enter MBP1. Filter the results by the correct species and restrict the results to the biological process ontology.
- This should return the GO annotation page for the yeast Mbp1 protein. Follw the "5 term associations" in the header bar.
- Click on "view in tree" for the GO term GO:0000083.
- This shows you the ontology of the term in text form, including the number of genes annotated to each term. In the right hand box you should find a link that you can follow for a graphical view.
- In a separate window, repeat the process for human E2F1 (choose the most specific term, i.e. the one that refers to the gene's role in the G1/S transition - GO:0000082).
- Roughly compare the two ontologies.
- Contrast this with the ontology for human MBP, specifically the axon ensheathment process.
References
Further reading and resources
Alvarez & Yan (2011) A graph-based semantic similarity measure for the gene ontology. J Bioinform Comput Biol 9:681-95. (pmid: 22084008) |
Schriml et al. (2012) Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res 40:D940-6. (pmid: 22080554) |
Evelo et al. (2011) Answering biological questions: querying a systems biology database for nutrigenomics. Genes Nutr 6:81-7. (pmid: 21437033) |
Jain & Bader (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 11:562. (pmid: 21078182) |
Oti et al. (2009) The biological coherence of human phenome databases. Am J Hum Genet 85:801-8. (pmid: 20004759) |
Bastos et al. (2011) Application of gene ontology to gene identification. Methods Mol Biol 760:141-57. (pmid: 21779995) |
Groth et al. (2007) PhenomicDB: a new cross-species genotype/phenotype resource. Nucleic Acids Res 35:D696-9. (pmid: 16982638) |
Gene Ontology Consortium (2010) The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res 38:D331-5. (pmid: 19920128) |
Gene Ontology Consortium (2012) The Gene Ontology: enhancements for 2011. Nucleic Acids Res 40:D559-64. (pmid: 22102568) |
Sauro & Bergmann (2008) Standards and ontologies in computational systems biology. Essays Biochem 45:211-22. (pmid: 18793134) |