Difference between revisions of "CSB Ontologies"

From "A B C"
Jump to navigation Jump to search
Line 120: Line 120:
 
}}
 
}}
  
====Associations====
+
====AmiGO - Associations====
 
GO annotations for a protein are called ''associations''.
 
GO annotations for a protein are called ''associations''.
  
 
{{task|1=
 
{{task|1=
 
# Open the ''associations'' information page for the human E2F1 protein via the [http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:Q01094 link in the right column] in a separate tab. Study the information on that page.
 
# Open the ''associations'' information page for the human E2F1 protein via the [http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:Q01094 link in the right column] in a separate tab. Study the information on that page.
# Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the [http://www.geneontology.org/GO.evidence.shtml definitions of the codes]. '''Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC ''etc''.).''' <small>Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accesible, as GO is doing here. You won't find this everywhere, but as a user, please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.</small>   
+
# Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the [http://www.geneontology.org/GO.evidence.shtml definitions of the codes]. '''Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC ''etc''.).''' <small>Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accesible, as GO is doing here. You won't find this everywhere, but as a user please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.</small>   
 
# One of the ''most specific'' associated terms on the page is for <code>GO:0000085</code> - the [http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0000085&session_id=1379amigo1358806334 G2 phase of the mitotic cell cycle] in the '''Biological Process''' ontology. Follow that link.
 
# One of the ''most specific'' associated terms on the page is for <code>GO:0000085</code> - the [http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0000085&session_id=1379amigo1358806334 G2 phase of the mitotic cell cycle] in the '''Biological Process''' ontology. Follow that link.
 
# Study the information available on that page. Look at the information available through the tabs on the page, especially the graph view. Then see how you can filter the gene product counts for the various levels of the hierarchy by species. Restrict the lineage to <code>H. sapiens</code>.
 
# Study the information available on that page. Look at the information available through the tabs on the page, especially the graph view. Then see how you can filter the gene product counts for the various levels of the hierarchy by species. Restrict the lineage to <code>H. sapiens</code>.
# Click on [http://amigo.geneontology.org/cgi-bin/amigo/term-assoc.cgi?term=GO:0051319&speciesdb=all&taxid=9606 the number behind the '''Is_a''' relationship of the G2 phase. The resulting page will give you all human proteins that have been annotated with this particular term.
+
# Click on [http://amigo.geneontology.org/cgi-bin/amigo/term-assoc.cgi?term=GO:0051319&speciesdb=all&taxid=9606 the number behind the '''Is_a''' relationship of the G2 phase]. The resulting page will give you all human proteins that have been annotated with this particular term.
 
}}
 
}}
  
Line 144: Line 144:
 
{{task|1=
 
{{task|1=
 
# Navigate to the [http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools '''Neurolex''' Gene Ontology tools resource].
 
# Navigate to the [http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools '''Neurolex''' Gene Ontology tools resource].
# Scroll to the section on "Semantic Similarity", and on the way, familiarize yourself vaguely with the wealth of available tools. <small>''nb.'' - the list is not even complete, e.g. most '''R''' and '''bioconductor''' packages are not included here.
+
# Scroll to the section on "Semantic Similarity", and on the way, familiarize yourself vaguely with the wealth of available tools. <small>''nb.'' - the list is not even complete, e.g. most '''R''' and '''bioconductor''' packages are not included here.</small>
# This list is in a bit of a sorry state. Most of the tools noted there do not compute online semantic similarities for proteins. The ones that do, mostly don't work. The one that does is not directly linked from the list, has its own stated parameters only incompletely supported. But give it a try anyway: navigate to [http://lasige.di.fc.ul.pt/webtools/proteinon/ Proteinon]
+
# This list is in a bit of a sorry state. Most of the tools noted there do not compute online semantic similarities for proteins. The ones that do, mostly don't work. The one that does is not directly linked from the list, has its own stated parameters only incompletely supported. But give it a try anyway: navigate to [http://lasige.di.fc.ul.pt/webtools/proteinon/ ProteInOn]
# Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Check to ''ignore IEA'' (you remember what these are, right?). Enter your four UniProt IDs in the correct format (comma separated) and '''run''' the computation.
 
# Interpret the similarity score table. Does it correspond to your expectations?
 
 
 
 
 
}}
 
;
 
:'''A: Gene identifiers'''
 
 
 
 
 
# Navigate to the [http://www.yeastgenome.org/ ''Saccharomyces'' Genome Database] and search for the gene name '''mbp1''' using the search box. Review the information available on the result page. Find, and note down the UniProt ID.
 
# For comparison, review the gene information of the functionally related human [http://www.ncbi.nlm.nih.gov/gene/1869 E2F1 transcription factor] at the NCBI. Here too, find, and note down the UniProt ID.
 
# To compare functional similarity, find the IDs of a protein of related, and of unrelated function in Uniprot.
 
## Find the UniProt ID of E2F1's human interaction partner TFDP1, which we would expect to be annotated as functionally similar to both E2F1 and MBP1;
 
## also find the UniProt ID  of human MBP (myelin basic protein), which is functionally unrelated.
 
 
 
 
<!--
 
<!--
 
UniProt IDS:
 
UniProt IDS:
Line 169: Line 154:
 
   P39678, Q01094, Q14186, P02686
 
   P39678, Q01094, Q14186, P02686
 
-->
 
-->
:'''B: Semantic similarity scores'''
+
# Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Check to ''ignore IEA'' (you remember what these are, right?). Enter your four UniProt IDs in the correct format (comma separated) and '''run''' the computation.
 
 
 
 
Next, we compute the semantic similarity of these two genes. The GO database lists a number of tools for this task (http://www.geneontology.org/GO.tools_by_type.semantic_similarity.shtml).
 
 
 
# Navigate to the [http://xldb.di.fc.ul.pt/tools/proteinon/ ProteInOn] site at Lisbon University in Portugal - the online tool to compute GO-based semantic similarity that was discussed in last weeks reading assignment. Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Enter your four UniProt IDs in the correct format and '''run''' the computation.
 
 
# Interpret the similarity score table. Does it correspond to your expectations?
 
# Interpret the similarity score table. Does it correspond to your expectations?
  
  
:'''C: Graphical view of the ontology'''
+
}}
 
 
 
 
Finally, we'll use the GO's AmiGO browser to compare the genes graphically.
 
# Navigate to the AmiGO search interface, select "genes or proteins" and enter MBP1. Filter the results by the correct species and restrict the results to the biological process ontology.
 
#This should return the GO annotation page for the yeast Mbp1 protein. Follw the "5 term associations" in the header bar.
 
#Click on "view in tree" for the GO term GO:0000083.
 
#This shows you the ontology of the term in text form, including the number of genes annotated to each term. In the right hand box you should find a link that you can follow for a graphical view.
 
#In a separate window, repeat the process for human E2F1 (choose the most specific term, i.e. the one that refers to the gene's role in the G1/S transition - GO:0000082).
 
#Roughly compare the two ontologies.
 
#Contrast this with the ontology for human MBP, specifically the axon ensheathment process.
 
 
<section end=exercises />
 
<section end=exercises />
  

Revision as of 01:00, 22 January 2013

Ontologies for Computational Systems Biology


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Poorly structured data can be integrated via ontologies. This is especially important for phenotype and "function" data. The primary example is the Gene Ontology (GO). Other examples include the Disease Ontology, OMIM and WikiGene.



Introduction

Harris (2008) Developing an ontology. Methods Mol Biol 452:111-24. (pmid: 18563371)

PubMed ] [ DOI ] In recent years, biological ontologies have emerged as a means of representing and organizing biological concepts, enabling biologists, bioinformaticians, and others to derive meaning from large datasets.This chapter provides an overview of formal principles and practical considerations of ontology construction and application. Ontology development concepts are illustrated using examples drawn from the Gene Ontology (GO) and other OBO ontologies.

Hackenberg & Matthiesen (2010) Algorithms and methods for correlating experimental results with annotation databases. Methods Mol Biol 593:315-40. (pmid: 19957156)

PubMed ] [ DOI ] An important procedure in biomedical research is the detection of genes that are differentially expressed under pathologic conditions. These genes, or at least a subset of them, are key biomarkers and are thought to be important to describe and understand the analyzed biological system (the pathology) at a molecular level. To obtain this understanding, it is indispensable to link those genes to biological knowledge stored in databases. Ontological analysis is nowadays a standard procedure to analyze large gene lists. By detecting enriched and depleted gene properties and functions, important insights on the biological system can be obtained. In this chapter, we will give a brief survey of the general layout of the methods used in an ontological analysis and of the most important tools that have been developed.

GO

The Gene Ontology project is the most influential contributor to the definition of function in computational biology and the use of GO terms and GO annotations is ubiquitous.

GO: the Gene Ontology project


link ] [ page ] Ontologies are important tools to organize and compute with non-standardized information, such as gene annotations. The Gene Ontology project (GO) constructs ontologies for gene and gene product attributes across numerous species. Three major ontologies are being developed: molecular process, biological function and cellular location. Each includes terms, their definition, and their relationships. In addition, genes and gene products are being been annotated with their GO terms and the type of evidence that underlies the annotation. A number of tools such as the AmiGO browser are available to analyse relationships, construct ontologies and curate annotations. Data can be freely downloaded in formats that are convenient for computation.
size=200px
du Plessis et al. (2011) The what, where, how and why of gene ontology--a primer for bioinformaticians. Brief Bioinformatics 12:723-35. (pmid: 21330331)

PubMed ] [ DOI ] With high-throughput technologies providing vast amounts of data, it has become more important to provide systematic, quality annotations. The Gene Ontology (GO) project is the largest resource for cataloguing gene function. Nonetheless, its use is not yet ubiquitous and is still fraught with pitfalls. In this review, we provide a short primer to the GO for bioinformaticians. We summarize important aspects of the structure of the ontology, describe sources and types of functional annotations, survey measures of GO annotation similarity, review typical uses of GO and discuss other important considerations pertaining to the use of GO in bioinformatics applications.

The GO actually comprises three separate ontologies:

Molecular function
...


Biological Process
...


Cellular component
...


GO terms

GO terms comprise the core of the information in the ontology: a carefully crafted definition of a term in any of GO's separate ontologies.


GO relationships

The nature of the relationships is as much a part of the ontology as the terms themselves. GO uses three categories of relationships:

  • is a
  • part of
  • regulates


GO annotations

The GO terms are conceptual in nature, and while they represent our interpretation of biological phenomena, they do not intrinsically represent biological objects, such a specific genes or proteins. In order to link molecules with these concepts, the ontology is used to annotate genes. The annotation project is referred to as GOA.

Dimmer et al. (2007) Methods for gene ontology annotation. Methods Mol Biol 406:495-520. (pmid: 18287709)

PubMed ] [ DOI ] The Gene Ontology (GO) is an established dynamic and structured vocabulary that has been successfully used in gene and protein annotation. Designed by biologists to improve data integration, GO attempts to replace the multiple nomenclatures used by specialised and large biological knowledgebases. This chapter describes the methods used by groups to create new GO annotations and how users can apply publicly available GO annotations to enhance their datasets.


GO evidence codes

Annotations can be made according to literature data or computational inference and it is important to note how an annotation has been justified by the curator to evaluate the level of trust we should have in the annotation. GO uses evidence codes to make this process transparent. When computing with the ontology, we may want to filter (exclude) particular terms in order to avoid tautologies: for example if we were to infer functional relationships between homologous genes, we should exclude annotations that have been based on the same inference or similar, and compute only with the actual experimental data.

The following evidence codes are in current use; if you want to exclude inferred anotations you would restrict the codes you use to the ones shown in bold: EXP, IDA, IPI, IMP, IEP, and perhaps IGI, although the interpretation of genetic interactions can require assumptions.

Automatically-assigned Evidence Codes
  • IEA: Inferred from Electronic Annotation
Curator-assigned Evidence Codes
  • Experimental Evidence Codes
    • EXP: Inferred from Experiment
    • IDA: Inferred from Direct Assay
    • IPI: Inferred from Physical Interaction
    • IMP: Inferred from Mutant Phenotype
    • IGI: Inferred from Genetic Interaction
    • IEP: Inferred from Expression Pattern
  • Computational Analysis Evidence Codes
    • ISS: Inferred from Sequence or Structural Similarity
    • ISO: Inferred from Sequence Orthology
    • ISA: Inferred from Sequence Alignment
    • ISM: Inferred from Sequence Model
    • IGC: Inferred from Genomic Context
    • IBA: Inferred from Biological aspect of Ancestor
    • IBD: Inferred from Biological aspect of Descendant
    • IKR: Inferred from Key Residues
    • IRD: Inferred from Rapid Divergence
    • RCA: inferred from Reviewed Computational Analysis
  • Author Statement Evidence Codes
    • TAS: Traceable Author Statement
    • NAS: Non-traceable Author Statement
  • Curator Statement Evidence Codes
    • IC: Inferred by Curator
    • ND: No biological Data available

For further details, see the Guide to GO Evidence Codes and the GO Evidence Code Decision Tree.


 

GO tools

For many projects, the simplest approach will be to download the GO ontology itself. It is a well constructed, easily parseable file that is well suited for computation. For details, see Computing with GO on this wiki.

Introductory reading



Exercises

AmiGO

AmiGO is a GO browser developed by the Gene Ontology consortium and hosted on their website.

AmiGO - Gene products

Task:

  1. Navigate to the GO homepage.
  2. Enter E2F1 into the search box to initiate a search for the human E2F1 transcription factor.
  3. The number of hits is not very large, but check to see the various ways by which you could filter and restrict the results.
  4. Open the gene product information page for the human protein via the link in the left column in a separate tab. Study the information on that page and note down the UniprotKB accession number.
  5. With the same approach, find and record the UniprotKB ID's (a) of the functionally related yeast MBP1 protein, (b) as a negative control, the functionally unrelated human MBP (myelin basic protein), and (c) as a positive control, E2F1's human interaction partner TFDP1, which we would expect to be annotated as functionally similar to both E2F1 and MBP1.

AmiGO - Associations

GO annotations for a protein are called associations.

Task:

  1. Open the associations information page for the human E2F1 protein via the link in the right column in a separate tab. Study the information on that page.
  2. Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the definitions of the codes. Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC etc.). Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accesible, as GO is doing here. You won't find this everywhere, but as a user please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.
  3. One of the most specific associated terms on the page is for GO:0000085 - the G2 phase of the mitotic cell cycle in the Biological Process ontology. Follow that link.
  4. Study the information available on that page. Look at the information available through the tabs on the page, especially the graph view. Then see how you can filter the gene product counts for the various levels of the hierarchy by species. Restrict the lineage to H. sapiens.
  5. Click on the number behind the Is_a relationship of the G2 phase. The resulting page will give you all human proteins that have been annotated with this particular term.


Semantic similarity

A good overview of semantic similarity measures is found in the following article. This is not a formal reading assignment, but download the article, browse over it and familiarize yourself with the measures that are discussed in the background and topology based clustering sections.

Jain & Bader (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 11:562. (pmid: 21078182)

PubMed ] [ DOI ] BACKGROUND: Semantic similarity measures are useful to assess the physiological relevance of protein-protein interactions (PPIs). They quantify similarity between proteins based on their function using annotation systems like the Gene Ontology (GO). Proteins that interact in the cell are likely to be in similar locations or involved in similar biological processes compared to proteins that do not interact. Thus the more semantically similar the gene function annotations are among the interacting proteins, more likely the interaction is physiologically relevant. However, most semantic similarity measures used for PPI confidence assessment do not consider the unequal depth of term hierarchies in different classes of cellular location, molecular function, and biological process ontologies of GO and thus may over-or under-estimate similarity. RESULTS: We describe an improved algorithm, Topological Clustering Semantic Similarity (TCSS), to compute semantic similarity between GO terms annotated to proteins in interaction datasets. Our algorithm, considers unequal depth of biological knowledge representation in different branches of the GO graph. The central idea is to divide the GO graph into sub-graphs and score PPIs higher if participating proteins belong to the same sub-graph as compared to if they belong to different sub-graphs. CONCLUSIONS: The TCSS algorithm performs better than other semantic similarity measurement techniques that we evaluated in terms of their performance on distinguishing true from false protein interactions, and correlation with gene expression and protein families. We show an average improvement of 4.6 times the F1 score over Resnik, the next best method, on our Saccharomyces cerevisiae PPI dataset and 2 times on our Homo sapiens PPI dataset using cellular component, biological process and molecular function GO annotations.


GO tools and resources are curated by the "Neuroscience lexicon".


Task:

  1. Navigate to the Neurolex Gene Ontology tools resource.
  2. Scroll to the section on "Semantic Similarity", and on the way, familiarize yourself vaguely with the wealth of available tools. nb. - the list is not even complete, e.g. most R and bioconductor packages are not included here.
  3. This list is in a bit of a sorry state. Most of the tools noted there do not compute online semantic similarities for proteins. The ones that do, mostly don't work. The one that does is not directly linked from the list, has its own stated parameters only incompletely supported. But give it a try anyway: navigate to ProteInOn
  4. Select "compute protein semantic similarity", use "Measure: simGIC" and "GO type: Biological process". Check to ignore IEA (you remember what these are, right?). Enter your four UniProt IDs in the correct format (comma separated) and run the computation.
  5. Interpret the similarity score table. Does it correspond to your expectations?


References


Further reading and resources

Alvarez & Yan (2011) A graph-based semantic similarity measure for the gene ontology. J Bioinform Comput Biol 9:681-95. (pmid: 22084008)

PubMed ] [ DOI ] Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.

Schriml et al. (2012) Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res 40:D940-6. (pmid: 22080554)

PubMed ] [ DOI ] The Disease Ontology (DO) database (http://disease-ontology.org) represents a comprehensive knowledge base of 8043 inherited, developmental and acquired human diseases (DO version 3, revision 2510). The DO web browser has been designed for speed, efficiency and robustness through the use of a graph database. Full-text contextual searching functionality using Lucene allows the querying of name, synonym, definition, DOID and cross-reference (xrefs) with complex Boolean search strings. The DO semantically integrates disease and medical vocabularies through extensive cross mapping and integration of MeSH, ICD, NCI's thesaurus, SNOMED CT and OMIM disease-specific terms and identifiers. The DO is utilized for disease annotation by major biomedical databases (e.g. Array Express, NIF, IEDB), as a standard representation of human disease in biomedical ontologies (e.g. IDO, Cell line ontology, NIFSTD ontology, Experimental Factor Ontology, Influenza Ontology), and as an ontological cross mappings resource between DO, MeSH and OMIM (e.g. GeneWiki). The DO project (http://diseaseontology.sf.net) has been incorporated into open source tools (e.g. Gene Answers, FunDO) to connect gene and disease biomedical data through the lens of human disease. The next iteration of the DO web browser will integrate DO's extended relations and logical definition representation along with these biomedical resource cross-mappings.

Evelo et al. (2011) Answering biological questions: querying a systems biology database for nutrigenomics. Genes Nutr 6:81-7. (pmid: 21437033)

PubMed ] [ DOI ] The requirement of systems biology for connecting different levels of biological research leads directly to a need for integrating vast amounts of diverse information in general and of omics data in particular. The nutritional phenotype database addresses this challenge for nutrigenomics. A particularly urgent objective in coping with the data avalanche is making biologically meaningful information accessible to the researcher. This contribution describes how we intend to meet this objective with the nutritional phenotype database. We outline relevant parts of the system architecture, describe the kinds of data managed by it, and show how the system can support retrieval of biologically meaningful information by means of ontologies, full-text queries, and structured queries. Our contribution points out critical points, describes several technical hurdles. It demonstrates how pathway analysis can improve queries and comparisons for nutrition studies. Finally, three directions for future research are given.

Jain & Bader (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 11:562. (pmid: 21078182)

PubMed ] [ DOI ] BACKGROUND: Semantic similarity measures are useful to assess the physiological relevance of protein-protein interactions (PPIs). They quantify similarity between proteins based on their function using annotation systems like the Gene Ontology (GO). Proteins that interact in the cell are likely to be in similar locations or involved in similar biological processes compared to proteins that do not interact. Thus the more semantically similar the gene function annotations are among the interacting proteins, more likely the interaction is physiologically relevant. However, most semantic similarity measures used for PPI confidence assessment do not consider the unequal depth of term hierarchies in different classes of cellular location, molecular function, and biological process ontologies of GO and thus may over-or under-estimate similarity. RESULTS: We describe an improved algorithm, Topological Clustering Semantic Similarity (TCSS), to compute semantic similarity between GO terms annotated to proteins in interaction datasets. Our algorithm, considers unequal depth of biological knowledge representation in different branches of the GO graph. The central idea is to divide the GO graph into sub-graphs and score PPIs higher if participating proteins belong to the same sub-graph as compared to if they belong to different sub-graphs. CONCLUSIONS: The TCSS algorithm performs better than other semantic similarity measurement techniques that we evaluated in terms of their performance on distinguishing true from false protein interactions, and correlation with gene expression and protein families. We show an average improvement of 4.6 times the F1 score over Resnik, the next best method, on our Saccharomyces cerevisiae PPI dataset and 2 times on our Homo sapiens PPI dataset using cellular component, biological process and molecular function GO annotations.

Oti et al. (2009) The biological coherence of human phenome databases. Am J Hum Genet 85:801-8. (pmid: 20004759)

PubMed ] [ DOI ] Disease networks are increasingly explored as a complement to networks centered around interactions between genes and proteins. The quality of disease networks is heavily dependent on the amount and quality of phenotype information in phenotype databases of human genetic diseases. We explored which aspects of phenotype database architecture and content best reflect the underlying biology of disease. We used the OMIM-based HPO, Orphanet, and POSSUM phenotype databases for this purpose and devised a biological coherence score based on the sharing of gene ontology annotation to investigate the degree to which phenotype similarity in these databases reflects related pathobiology. Our analyses support the notion that a fine-grained phenotype ontology enhances the accuracy of phenome representation. In addition, we find that the OMIM database that is most used by the human genetics community is heavily underannotated. We show that this problem can easily be overcome by simply adding data available in the POSSUM database to improve OMIM phenotype representations in the HPO. Also, we find that the use of feature frequency estimates--currently implemented only in the Orphanet database--significantly improves the quality of the phenome representation. Our data suggest that there is much to be gained by improving human phenome databases and that some of the measures needed to achieve this are relatively easy to implement. More generally, we propose that curation and more systematic annotation of human phenome databases can greatly improve the power of the phenotype for genetic disease analysis.

Bastos et al. (2011) Application of gene ontology to gene identification. Methods Mol Biol 760:141-57. (pmid: 21779995)

PubMed ] [ DOI ] Candidate gene identification deals with associating genes to underlying biological phenomena, such as diseases and specific disorders. It has been shown that classes of diseases with similar phenotypes are caused by functionally related genes. Currently, a fair amount of knowledge about the functional characterization can be found across several public databases; however, functional descriptors can be ambiguous, domain specific, and context dependent. In order to cope with these issues, the Gene Ontology (GO) project developed a bio-ontology of broad scope and wide applicability. Thus, the structured and controlled vocabulary of terms provided by the GO project describing the biological roles of gene products can be very helpful in candidate gene identification approaches. The method presented here uses GO annotation data in order to identify the most meaningful functional aspects occurring in a given set of related gene products. The method measures this meaningfulness by calculating an e-value based on the frequency of annotation of each GO term in the set of gene products versus the total frequency of annotation. Then after selecting a GO term related to the underlying biological phenomena being studied, the method uses semantic similarity to rank the given gene products that are annotated to the term. This enables the user to further narrow down the list of gene products and identify those that are more likely of interest.

Groth et al. (2007) PhenomicDB: a new cross-species genotype/phenotype resource. Nucleic Acids Res 35:D696-9. (pmid: 16982638)

PubMed ] [ DOI ] Phenotypes are an important subject of biomedical research for which many repositories have already been created. Most of these databases are either dedicated to a single species or to a single disease of interest. With the advent of technologies to generate phenotypes in a high-throughput manner, not only is the volume of phenotype data growing fast but also the need to organize these data in more useful ways. We have created PhenomicDB (freely available at http://www.phenomicdb.de), a multi-species genotype/phenotype database, which shows phenotypes associated with their corresponding genes and grouped by gene orthologies across a variety of species. We have enhanced PhenomicDB recently by additionally incorporating quantitative and descriptive RNA interference (RNAi) screening data, by enabling the usage of phenotype ontology terms and by providing information on assays and cell lines. We envision that integration of classical phenotypes with high-throughput data will bring new momentum and insights to our understanding. Modern analysis tools under development may help exploiting this wealth of information to transform it into knowledge and, eventually, into novel therapeutic approaches.

Gene Ontology Consortium (2010) The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res 38:D331-5. (pmid: 19920128)

PubMed ] [ DOI ] The Gene Ontology (GO) Consortium (http://www.geneontology.org) (GOC) continues to develop, maintain and use a set of structured, controlled vocabularies for the annotation of genes, gene products and sequences. The GO ontologies are expanding both in content and in structure. Several new relationship types have been introduced and used, along with existing relationships, to create links between and within the GO domains. These improve the representation of biology, facilitate querying, and allow GO developers to systematically check for and correct inconsistencies within the GO. Gene product annotation using GO continues to increase both in the number of total annotations and in species coverage. GO tools, such as OBO-Edit, an ontology-editing tool, and AmiGO, the GOC ontology browser, have seen major improvements in functionality, speed and ease of use.

Gene Ontology Consortium (2012) The Gene Ontology: enhancements for 2011. Nucleic Acids Res 40:D559-64. (pmid: 22102568)

PubMed ] [ DOI ] The Gene Ontology (GO) (http://www.geneontology.org) is a community bioinformatics resource that represents gene product function through the use of structured, controlled vocabularies. The number of GO annotations of gene products has increased due to curation efforts among GO Consortium (GOC) groups, including focused literature-based annotation and ortholog-based functional inference. The GO ontologies continue to expand and improve as a result of targeted ontology development, including the introduction of computable logical definitions and development of new tools for the streamlined addition of terms to the ontology. The GOC continues to support its user community through the use of e-mail lists, social media and web-based resources.

Sauro & Bergmann (2008) Standards and ontologies in computational systems biology. Essays Biochem 45:211-22. (pmid: 18793134)

PubMed ] [ DOI ] With the growing importance of computational models in systems biology there has been much interest in recent years to develop standard model interchange languages that permit biologists to easily exchange models between different software tools. In the present chapter two chief model exchange standards, SBML (Systems Biology Markup Language) and CellML are described. In addition, other related features including visual layout initiatives, ontologies and best practices for model annotation are discussed. Software tools such as developer libraries and basic editing tools are also introduced, together with a discussion on the future of modelling languages and visualization tools in systems biology.