BIN-FUNC-Semantic similarity
Measuring "Semantic Similarity" in Ontologies
Keywords: Semantic similarity of terms in ontologies, using GO and GOA with R
Contents
This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.
Abstract
...
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
Objectives
...
Outcomes
...
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your course journal.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
Evaluation: NA
- This unit is not evaluated for course marks.
Contents
Task:
- Read the introductory notes on quantifying how similar the "meaning" of two terms in the Gene Ontology is.
Semantic similarity
A good, recent overview of ontology based functional annotation is found in the following article. This is not a formal reading assignment, but do familiarize yourself with section 3: Derivation of Semantic Similarity between Terms in an Ontology as an introduction to the code-based annotations below.
Gan et al. (2013) From ontology to semantic similarity: calculation of ontology-based semantic similarity. ScientificWorldJournal 2013:793091. (pmid: 23533360) |
[ PubMed ] [ DOI ] Advances in high-throughput experimental techniques in the past decade have enabled the explosive increase of omics data, while effective organization, interpretation, and exchange of these data require standard and controlled vocabularies in the domain of biological and biomedical studies. Ontologies, as abstract description systems for domain-specific knowledge composition, hence receive more and more attention in computational biology and bioinformatics. Particularly, many applications relying on domain ontologies require quantitative measures of relationships between terms in the ontologies, making it indispensable to develop computational methods for the derivation of ontology-based semantic similarity between terms. Nevertheless, with a variety of methods available, how to choose a suitable method for a specific application becomes a problem. With this understanding, we review a majority of existing methods that rely on ontologies to calculate semantic similarity between terms. We classify existing methods into five categories: methods based on semantic distance, methods based on information content, methods based on properties of terms, methods based on ontology hierarchy, and hybrid methods. We summarize characteristics of each category, with emphasis on basic notions, advantages and disadvantages of these methods. Further, we extend our review to software tools implementing these methods and applications using these methods. |
Practical work with GO: bioconductor.
The bioconductor project hosts the GOSemSim package for semantic similarity.
Task:
- Work through the following R-code. If you have problems, discuss them on the mailing list. Don't go through the code mechanically but make sure you are clear about what it does.
# GOsemanticSimilarity.R
# GO semantic similarity example
# B. Steipe for BCB420, January 2014
setwd("~/your-R-project-directory")
# GOSemSim is an R-package in the bioconductor project. It is not installed via
# the usual install.packages() comand (via CRAN) but via an installation script
# that is run from the bioconductor Website.
source("http://bioconductor.org/biocLite.R")
biocLite("GOSemSim")
library(GOSemSim)
# This loads the library and starts the Bioconductor environment.
# You can get an overview of functions by executing ...
browseVignettes()
# ... which will open a listing in your Web browser. Open the
# introduction to GOSemSim PDF. As the introduction suggests,
# now is a good time to execute ...
help(GOSemSim)
# The simplest function is to measure the semantic similarity of two GO
# terms. For example, SOX2 was annotated with GO:0035019 (somatic stem cell
# maintenance), QSOX2 was annotated with GO:0045454 (cell redox homeostasis),
# and Oct4 (POU5F1) with GO:0009786 (regulation of asymmetric cell division),
# among other associations. Lets calculate these similarities.
goSim("GO:0035019", "GO:0009786", ont="BP", measure="Wang")
goSim("GO:0035019", "GO:0045454", ont="BP", measure="Wang")
# Fair enough. Two numbers. Clearly we would appreciate an idea of the values
# that high similarity and low similarity can take. But in any case -
# we are really less interested in the similarity of GO terms - these
# are a function of how the Ontology was constructed. We are more
# interested in the functional similarity of our genes, and these
# have a number of GO terms associated with them.
# GOSemSim provides the functions ...
?geneSim()
?mgeneSim()
# ... to compute these values. Refer to the vignette for details, in
# particular, consider how multiple GO terms are combined, and how to
# keep/drop evidence codes.
# Here is a pairwise similarity example: the gene IDs are the ones you
# have recorded previously. Note that this will download a package
# of GO annotations - you might not want to do this on a low-bandwidth
# connection.
geneSim("6657", "5460", ont = "BP", measure="Wang", combine = "BMA")
# Another number. And the list of GO terms that were considered.
# Your task: use the mgeneSim() function to calculate the similarities
# between all six proteins for which you have recorded the GeneIDs
# previously (SOX2, POU5F1, E2F1, BMP4, UGT1A1 and NANOG) in the
# biological process ontology.
# This will run for some time. On my machine, half an hour or so.
# Do the results correspond to your expectations?
Further reading, links and resources
Wu et al. (2013) Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS ONE 8:e66745. (pmid: 23741529) |
[ PubMed ] [ DOI ] BACKGROUND: Explicit comparisons based on the semantic similarity of Gene Ontology terms provide a quantitative way to measure the functional similarity between gene products and are widely applied in large-scale genomic research via integration with other models. Previously, we presented an edge-based method, Relative Specificity Similarity (RSS), which takes the global position of relevant terms into account. However, edge-based semantic similarity metrics are sensitive to the intrinsic structure of GO and simply consider terms at the same level in the ontology to be equally specific nodes, revealing the weaknesses that could be complemented using information content (IC). RESULTS AND CONCLUSIONS: Here, we used the IC-based nodes to improve RSS and proposed a new method, Hybrid Relative Specificity Similarity (HRSS). HRSS outperformed other methods in distinguishing true protein-protein interactions from false. HRSS values were divided into four different levels of confidence for protein interactions. In addition, HRSS was statistically the best at obtaining the highest average functional similarity among human-mouse orthologs. Both HRSS and the groupwise measure, simGIC, are superior in correlation with sequence and Pfam similarities. Because different measures are best suited for different circumstances, we compared two pairwise strategies, the maximum and the best-match average, in the evaluation. The former was more effective at inferring physical protein-protein interactions, and the latter at estimating the functional conservation of orthologs and analyzing the CESSM datasets. In conclusion, HRSS can be applied to different biological problems by quantifying the functional similarity between gene products. The algorithm HRSS was implemented in the C programming language, which is freely available from http://cmb.bnu.edu.cn/hrss. |
Gan et al. (2013) From ontology to semantic similarity: calculation of ontology-based semantic similarity. ScientificWorldJournal 2013:793091. (pmid: 23533360) |
[ PubMed ] [ DOI ] Advances in high-throughput experimental techniques in the past decade have enabled the explosive increase of omics data, while effective organization, interpretation, and exchange of these data require standard and controlled vocabularies in the domain of biological and biomedical studies. Ontologies, as abstract description systems for domain-specific knowledge composition, hence receive more and more attention in computational biology and bioinformatics. Particularly, many applications relying on domain ontologies require quantitative measures of relationships between terms in the ontologies, making it indispensable to develop computational methods for the derivation of ontology-based semantic similarity between terms. Nevertheless, with a variety of methods available, how to choose a suitable method for a specific application becomes a problem. With this understanding, we review a majority of existing methods that rely on ontologies to calculate semantic similarity between terms. We classify existing methods into five categories: methods based on semantic distance, methods based on information content, methods based on properties of terms, methods based on ontology hierarchy, and hybrid methods. We summarize characteristics of each category, with emphasis on basic notions, advantages and disadvantages of these methods. Further, we extend our review to software tools implementing these methods and applications using these methods. |
Alvarez & Yan (2011) A graph-based semantic similarity measure for the gene ontology. J Bioinform Comput Biol 9:681-95. (pmid: 22084008) |
[ PubMed ] [ DOI ] Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations. |
Jain & Bader (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 11:562. (pmid: 21078182) |
[ PubMed ] [ DOI ] BACKGROUND: Semantic similarity measures are useful to assess the physiological relevance of protein-protein interactions (PPIs). They quantify similarity between proteins based on their function using annotation systems like the Gene Ontology (GO). Proteins that interact in the cell are likely to be in similar locations or involved in similar biological processes compared to proteins that do not interact. Thus the more semantically similar the gene function annotations are among the interacting proteins, more likely the interaction is physiologically relevant. However, most semantic similarity measures used for PPI confidence assessment do not consider the unequal depth of term hierarchies in different classes of cellular location, molecular function, and biological process ontologies of GO and thus may over-or under-estimate similarity. RESULTS: We describe an improved algorithm, Topological Clustering Semantic Similarity (TCSS), to compute semantic similarity between GO terms annotated to proteins in interaction datasets. Our algorithm, considers unequal depth of biological knowledge representation in different branches of the GO graph. The central idea is to divide the GO graph into sub-graphs and score PPIs higher if participating proteins belong to the same sub-graph as compared to if they belong to different sub-graphs. CONCLUSIONS: The TCSS algorithm performs better than other semantic similarity measurement techniques that we evaluated in terms of their performance on distinguishing true from false protein interactions, and correlation with gene expression and protein families. We show an average improvement of 4.6 times the F1 score over Resnik, the next best method, on our Saccharomyces cerevisiae PPI dataset and 2 times on our Homo sapiens PPI dataset using cellular component, biological process and molecular function GO annotations. |
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26:976-8. (pmid: 20179076) |
[ PubMed ] [ DOI ] SUMMARY: The semantic comparisons of Gene Ontology (GO) annotations provide quantitative ways to compute similarities between genes and gene groups, and have became important basis for many bioinformatics analysis approaches. GOSemSim is an R package for semantic similarity computation among GO terms, sets of GO terms, gene products and gene clusters. Four information content (IC)- and a graph-based methods are implemented in the GOSemSim package, multiple species including human, rat, mouse, fly and yeast are also supported. The functions provided by the GOSemSim offer flexibility for applications, and can be easily integrated into high-throughput analysis pipelines. AVAILABILITY: GOSemSim is released under the GNU General Public License within Bioconductor project, and freely available at http://bioconductor.org/packages/2.6/bioc/html/GOSemSim.html. |
w
Notes
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-08-05
Version:
- 0.1
Version history:
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.