Interaction prediction

From "A B C"
Revision as of 13:43, 6 December 2012 by Boris (talk | contribs) (→‎Further reading and resources)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Interaction prediction


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Once an interaction has been experimentally determined in a (model) organism, under what conditions can we infer that the homologues of the interaction partners interact as well? Just as in comparative sequence analysis, we need to consider the degree of sequence similarity, but other aspects also come into play. Just as in function prediction, predictions may also be made from first principles.



 

Introductory reading

Zinman et al. (2011) Biological interaction networks are conserved at the module level. BMC Syst Biol 5:134. (pmid: 21861884)

PubMed ] [ DOI ] BACKGROUND: Orthologous genes are highly conserved between closely related species and biological systems often utilize the same genes across different organisms. However, while sequence similarity often implies functional similarity, interaction data is not well conserved even for proteins with high sequence similarity. Several recent studies comparing high throughput data including expression, protein-protein, protein-DNA, and genetic interactions between close species show conservation at a much lower rate than expected. RESULTS: In this work we collected comprehensive high-throughput interaction datasets for four model organisms (S. cerevisiae, S. pombe, C. elegans, and D. melanogaster) and carried out systematic analyses in order to explain the apparent lower conservation of interaction data when compared to the conservation of sequence data. We first showed that several previously proposed hypotheses only provide a limited explanation for such lower conservation rates. We combined all interaction evidences into an integrated network for each species and identified functional modules from these integrated networks. We then demonstrate that interactions that are part of functional modules are conserved at much higher rates than previous reports in the literature, while interactions that connect between distinct functional modules are conserved at lower rates. CONCLUSIONS: We show that conservation is maintained between species, but mainly at the module level. Our results indicate that interactions within modules are much more likely to be conserved than interactions between proteins in different modules. This provides a network based explanation to the observed conservation rates that can also help explain why so many biological processes are well conserved despite the lower levels of conservation for the interactions of proteins participating in these processes.Accompanying website: http://www.sb.cs.cmu.edu/CrossSP.


 

Contents

The concept of Interologs.


   

Further reading and resources

Maetschke et al. (2012) Gene Ontology-driven inference of protein-protein interactions using inducers. Bioinformatics 28:69-75. (pmid: 22057159)

PubMed ] [ DOI ] MOTIVATION: Protein-protein interactions (PPIs) are pivotal for many biological processes and similarity in Gene Ontology (GO) annotation has been found to be one of the strongest indicators for PPI. Most GO-driven algorithms for PPI inference combine machine learning and semantic similarity techniques. We introduce the concept of inducers as a method to integrate both approaches more effectively, leading to superior prediction accuracies. RESULTS: An inducer (ULCA) in combination with a Random Forest classifier compares favorably to several sequence-based methods, semantic similarity measures and multi-kernel approaches. On a newly created set of high-quality interaction data, the proposed method achieves high cross-species prediction accuracies (Area under the ROC curve ≤ 0.88), rendering it a valuable companion to sequence-based methods. AVAILABILITY: Software and datasets are available at http://bioinformatics.org.au/go2ppi/ CONTACT: m.ragan@uq.edu.au.

Clark et al. (2011) Using coevolution to predict protein-protein interactions. Methods Mol Biol 781:237-56. (pmid: 21877284)

PubMed ] [ DOI ] Bioinformatic methods to predict protein-protein interactions (PPI) via coevolutionary analysis have -positioned themselves to compete alongside established in vitro methods, despite a lack of understanding for the underlying molecular mechanisms of the coevolutionary process. Investigating the alignment of coevolutionary predictions of PPI with experimental data can focus the effective scope of prediction and lead to better accuracies. A new rate-based coevolutionary method, MMM, preferentially finds obligate interacting proteins that form complexes, conforming to results from studies based on coimmunoprecipitation coupled with mass spectrometry. Using gold-standard databases as a benchmark for accuracy, MMM surpasses methods based on abundance ratios, suggesting that correlated evolutionary rates may yet be better than coexpression at predicting interacting proteins. At the level of protein domains, -coevolution is difficult to detect, even with MMM, except when considering small-scale experimental data involving proteins with multiple domains. Overall, these findings confirm that coevolutionary -methods can be confidently used in predicting PPI, either independently or as drivers of coimmunoprecipitation experiments.

Poupon & Janin (2010) Analysis and prediction of protein quaternary structure. Methods Mol Biol 609:349-64. (pmid: 20221929)

PubMed ] [ DOI ] The quaternary structure (QS) of a protein is determined by measuring its molecular weight in solution. The data have to be extracted from the literature, and they may be missing even for proteins that have a crystal structure reported in the Protein Data Bank (PDB). The PDB and other databases derived from it report QS information that either was obtained from the depositors or is based on an analysis of the contacts between polypeptide chains in the crystal, and this frequently differs from the QS determined in solution.The QS of a protein can be predicted from its sequence using either homology or threading methods. However, a majority of the proteins with less than 30% sequence identity have different QSs. A model of the QS can also be derived by docking the subunits when their 3D structure is independently known, but the model is likely to be incorrect if large conformation changes take place when the oligomer assembles.

Seidl & Schultz (2009) Evolutionary flexibility of protein complexes. BMC Evol Biol 9:155. (pmid: 19583842)

PubMed ] [ DOI ] BACKGROUND: Proteins play a key role in cellular life. They do not act alone but are organised in complexes. Throughout the life of a cell, complexes are dynamic in their composition due to attachments and shared components. Experimental and computational evidence indicate that consecutive addition and secondary losses of components played a major role in the evolution of some complexes, mostly without affecting the core function. Here, we analysed in a large scale approach whether this flexibility in evolution is only limited to a distinct number of complexes or represents a more general trend. RESULTS: Focussing on human protein complexes, we based our analysis on a manually curated dataset from HPRD. In total, 1,060 complexes with 6,136 proteins from 2,187 unique genes were considered. We computed interologs in 25 different species and predicted the composition of complexes. Over the analysed species, the composition of most complexes was highly flexible and only 25% of all genes were never lost. Even if one component was lost at a particular point in time, the fraction of observed second, independent losses of additional components was high (75% of all complexes affected). Still, loss of whole complexes happened rarely. This biological signal deviated significantly from random models. We exemplified this trend on the anaphase promoting complex (APC) where a core is highly conserved throughout all metazoans, but flexibility in certain components is observable. CONCLUSION: Consecutive additions and losses of distinct units is a fundamental process in the evolution of protein complexes. These evolutionary events affecting genes coding for units in human protein complexes showed a significantly different phylogenetic pattern compared to randomly selected genes. Determination of taxon specific attachments or losses might be linked to specific cellular or morphological features. Thus, protein complexes contain not only structural and functional, but also evolutionary cores.

Saeed & Deane (2008) An assessment of the uses of homologous interactions. Bioinformatics 24:689-95. (pmid: 18042554)

PubMed ] [ DOI ] MOTIVATION: Protein-protein interactions have proved to be a valuable starting point for understanding the inner workings of the cell. Computational methodologies have been built which both predict interactions and use interaction datasets in order to predict other protein features. Such methods require gold standard positive (GSP) and negative (GSN) interaction sets. Here we examine and demonstrate the usefulness of homologous interactions in predicting good quality positive and negative interaction datasets. RESULTS: We generate GSP interaction sets as subsets from experimental data using only interaction and sequence information. We can therefore produce sets for several species (many of which at present have no identified GSPs). Comprehensive error rate testing demonstrates the power of the method. We also show how the use of our datasets significantly improves the predictive power of algorithms for interaction prediction and function prediction. Furthermore, we generate GSN interaction sets for yeast and examine the use of homology along with other protein properties such as localization, expression and function. Using a novel method to assess the accuracy of a negative interaction set, we find that the best single selector for negative interactions is a lack of co-function. However, an integrated method using all the characteristics shows significant improvement over any current method for identifying GSN interactions. The nature of homologous interactions is also examined and we demonstrate that interologs are found more commonly within species than across species. CONCLUSION: GSP sets built using our homologous verification method are demonstrably better than standard sets in terms of predictive ability. We can build such GSP sets for several species. When generating GSNs we show a combination of protein features and lack of homologous interactions gives the highest quality interaction sets. AVAILABILITY: GSP and GSN datasets for all the studied species can be downloaded from http://www.stats.ox.ac.uk/~deane/HPIV.

Brown & Jurisica (2007) Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol 8:R95. (pmid: 17535438)

PubMed ] [ DOI ] BACKGROUND: Protein-protein interaction (PPI) networks have been transferred between organisms using interologs, allowing model organisms to supplement the interactomes of higher eukaryotes. However, the conservation of various network components has not been fully explored. Unequal conservation of certain network components may limit the ability to fully expand the target interactomes using interologs. RESULTS: In this study, we transfer high quality human interactions to lower eukaryotes, and examine the evolutionary conservation of individual network components. When human proteins are mapped to yeast, we find a strong positive correlation (r = 0.50, P = 3.9 x 10(-4)) between evolutionary conservation and the number of interacting proteins, which is also found when mapped to other model organisms. Examining overlapping PPI networks, Gene Ontology (GO) terms, and gene expression data, we are able to demonstrate that protein complexes are conserved preferentially, compared to transient interactions in the network. Despite the preferential conservation of complexes, and the fact that the human interactome comprises an abundance of transient interactions, we demonstrate how transferring human PPIs to yeast augments this well-studied protein interaction network, using the coatomer complex and replisome as examples. CONCLUSION: Human proteins, like yeast proteins, show a correlation between the number of interacting partners and evolutionary conservation. The preferential conservation of proteins with higher degree leads to enrichment in protein complexes when interactions are transferred between organisms using interologs.

Mika & Rost (2006) Protein-protein interactions more conserved within species than across species. PLoS Comput Biol 2:e79. (pmid: 16854211)

PubMed ] [ DOI ] Experimental high-throughput studies of protein-protein interactions are beginning to provide enough data for comprehensive computational studies. Today, about ten large data sets, each with thousands of interacting pairs, coarsely sample the interactions in fly, human, worm, and yeast. Another about 55,000 pairs of interacting proteins have been identified by more careful, detailed biochemical experiments. Most interactions are experimentally observed in prokaryotes and simple eukaryotes; very few interactions are observed in higher eukaryotes such as mammals. It is commonly assumed that pathways in mammals can be inferred through homology to model organisms, e.g. the experimental observation that two yeast proteins interact is transferred to infer that the two corresponding proteins in human also interact. Two pairs for which the interaction is conserved are often described as interologs. The goal of this investigation was a large-scale comprehensive analysis of such inferences, i.e. of the evolutionary conservation of interologs. Here, we introduced a novel score for measuring the overlap between protein-protein interaction data sets. This measure appeared to reflect the overall quality of the data and was the basis for our two surprising results from our large-scale analysis. Firstly, homology-based inferences of physical protein-protein interactions appeared far less successful than expected. In fact, such inferences were accurate only for extremely high levels of sequence similarity. Secondly, and most surprisingly, the identification of interacting partners through sequence similarity was significantly more reliable for protein pairs within the same organism than for pairs between species. Our analysis underlined that the discrepancies between different datasets are large, even when using the same type of experiment on the same organism. This reality considerably constrains the power of homology-based transfer of interactions. In particular, the experimental probing of interactions in distant model organisms has to be undertaken with some caution. More comprehensive images of protein-protein networks will require the combination of many high-throughput methods, including in silico inferences and predictions. http://www.rostlab.org/results/2006/ppi_homology/

Yu et al. (2004) Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 14:1107-18. (pmid: 15173116)

PubMed ] [ DOI ] Proteins function mainly through interactions, especially with DNA and other proteins. While some large-scale interaction networks are now available for a number of model organisms, their experimental generation remains difficult. Consequently, interolog mapping--the transfer of interaction annotation from one organism to another using comparative genomics--is of significant value. Here we quantitatively assess the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins. Using interaction information from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori, we find that protein-protein interactions can be transferred when a pair of proteins has a joint sequence identity >80% or a joint E-value <10(-70). (These "joint" quantities are the geometric means of the identities or E-values for the two pairs of interacting proteins.) We generalize our interolog analysis to protein-DNA binding, finding such interactions are conserved at specific thresholds between 30% and 60% sequence identity depending on the protein family. Furthermore, we introduce the concept of a "regulog"--a conserved regulatory relationship between proteins across different species. We map interologs and regulogs from yeast to a number of genomes with limited experimental annotation (e.g., Arabidopsis thaliana) and make these available through an online database at http://interolog.gersteinlab.org. Specifically, we are able to transfer approximately 90,000 potential protein-protein interactions to the worm. We test a number of these in two-hybrid experiments and are able to verify 45 overlaps, which we show to be statistically significant.