Expected Preparations:

  [BIN-FUNC]
GO
  [FND-STA]
Information_theory
 
  The units listed above are part of this course and contain important preparatory material.  

Keywords: Semantic similarity of terms in ontologies; using GO and GOA with R

Objectives:

This unit will …

  • … introduce the concept of semantic similarity;

  • … demonstrate how to compute semantic similarity and GO term enrichment in R.

Outcomes:

After working through this unit you …

  • … are familar with the idea of “semantic similarity”;

  • … can load a Bioconductor model-organism annotation database, calculate GO term semantic similarities between Genes, and discover potentially collaborating genes from significantly enriched GO terms in a gene set.


Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


Evaluation:

NA: This unit is not evaluated for course marks.

Contents

This unit introduces the concept of “semantic similarity” between GO terms, which is a fundamental measure that allows comparing and categorizing genes by their function! We also introduce Bioconductor functions to put this into practice.

Task…

 

A good, recent overview of ontology based functional annotation is found in the following article. This is not a formal reading assignment, but do familiarize yourself with section 3: Derivation of Semantic Similarity between Terms in an Ontology as an introduction to the code-based annotations below.

Gan, Mingxin, Xue Dou, and Rui Jiang. (2013). “From ontology to semantic similarity: calculation of ontology-based semantic similarity”. Thescientificworldjournal 2013:793091 .
[PMID: 23533360] [DOI: 10.1155/2013/793091]

Advances in high-throughput experimental techniques in the past decade have enabled the explosive increase of omics data, while effective organization, interpretation, and exchange of these data require standard and controlled vocabularies in the domain of biological and biomedical studies. Ontologies, as abstract description systems for domain-specific knowledge composition, hence receive more and more attention in computational biology and bioinformatics. Particularly, many applications relying on domain ontologies require quantitative measures of relationships between terms in the ontologies, making it indispensable to develop computational methods for the derivation of ontology-based semantic similarity between terms. Nevertheless, with a variety of methods available, how to choose a suitable method for a specific application becomes a problem. With this understanding, we review a majority of existing methods that rely on ontologies to calculate semantic similarity between terms. We classify existing methods into five categories: methods based on semantic distance, methods based on information content, methods based on properties of terms, methods based on ontology hierarchy, and hybrid methods. We summarize characteristics of each category, with emphasis on basic notions, advantages and disadvantages of these methods. Further, we extend our review to software tools implementing these methods and applications using these methods.

 

Task…

  • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
  • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included. This ensures that your data and code remain up to date when we update, or fix bugs.
  • Type init() if requested.
  • Open the file BIN-FUNC-Semantic_similarity.R and follow the instructions.

     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Further Reading

Wu, Xiaomei et al.. (2013). “Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method”. Plos One 8(5):e66745 .
[PMID: 23741529] [DOI: 10.1371/journal.pone.0066745]

BACKGROUND: Explicit comparisons based on the semantic similarity of Gene Ontology terms provide a quantitative way to measure the functional similarity between gene products and are widely applied in large-scale genomic research via integration with other models. Previously, we presented an edge-based method, Relative Specificity Similarity (RSS), which takes the global position of relevant terms into account. However, edge-based semantic similarity metrics are sensitive to the intrinsic structure of GO and simply consider terms at the same level in the ontology to be equally specific nodes, revealing the weaknesses that could be complemented using information content (IC).

Gan, Mingxin, Xue Dou, and Rui Jiang. (2013). “From ontology to semantic similarity: calculation of ontology-based semantic similarity”. Thescientificworldjournal 2013:793091 .
[PMID: 23533360] [DOI: 10.1155/2013/793091]

Advances in high-throughput experimental techniques in the past decade have enabled the explosive increase of omics data, while effective organization, interpretation, and exchange of these data require standard and controlled vocabularies in the domain of biological and biomedical studies. Ontologies, as abstract description systems for domain-specific knowledge composition, hence receive more and more attention in computational biology and bioinformatics. Particularly, many applications relying on domain ontologies require quantitative measures of relationships between terms in the ontologies, making it indispensable to develop computational methods for the derivation of ontology-based semantic similarity between terms. Nevertheless, with a variety of methods available, how to choose a suitable method for a specific application becomes a problem. With this understanding, we review a majority of existing methods that rely on ontologies to calculate semantic similarity between terms. We classify existing methods into five categories: methods based on semantic distance, methods based on information content, methods based on properties of terms, methods based on ontology hierarchy, and hybrid methods. We summarize characteristics of each category, with emphasis on basic notions, advantages and disadvantages of these methods. Further, we extend our review to software tools implementing these methods and applications using these methods.

Alvarez, Marco A and Changhui Yan. (2011). “A graph-based semantic similarity measure for the gene ontology”. Journal of Bioinformatics and Computational Biology 9(6):681–95 .
[PMID: 22084008] [DOI: 10.1142/s0219720011005641]

Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.

Jain, Shobhit and Gary D Bader. (2010). “An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology”. Bmc Bioinformatics 11:562 .
[PMID: 21078182] [DOI: 10.1186/1471-2105-11-562]

BACKGROUND: Semantic similarity measures are useful to assess the physiological relevance of protein-protein interactions (PPIs). They quantify similarity between proteins based on their function using annotation systems like the Gene Ontology (GO). Proteins that interact in the cell are likely to be in similar locations or involved in similar biological processes compared to proteins that do not interact. Thus the more semantically similar the gene function annotations are among the interacting proteins, more likely the interaction is physiologically relevant. However, most semantic similarity measures used for PPI confidence assessment do not consider the unequal depth of term hierarchies in different classes of cellular location, molecular function, and biological process ontologies of GO and thus may over-or under-estimate similarity.

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

Page ID: BIN-FUNC-Semantic_similarity

Author:
Boris Steipe ( <boris.steipe@utoronto.ca> )
Created:
2017-08-05
Last modified:
2022-09-14
Version:
1.1
Version History:
–  1.1 2020 Maintenance
–  1.0 First live version
–  0.1 First stub
Tagged with:
–  Unit
–  Live
–  Has lecture slides
–  Links to R course project
–  Has further reading

 

[END]