Function prediction

From "A B C"
Jump to navigation Jump to search

Function prediction


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Sequencing genomes is much, much easier than determining the function of gene products in the laboratory. In order to annotate genomes, the function of gene products has to be predicted...



Introductory reading

Erdin et al. (2011) Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol 21:180-8. (pmid: 21353529)

PubMed ] [ DOI ] Genomic centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, less than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.


Contents

  • Definition of function
  • Data sources
  • Strategies
  • Annotation
  • Examples


Exercises



References



Further reading and resources

Gruber & Kroth (2014) Deducing intracellular distributions of metabolic pathways from genomic data. Methods Mol Biol 1083:187-211. (pmid: 24218217)

PubMed ] [ DOI ] In the recent years, a large number of genomes from a variety of different organisms have been sequenced. Most of the sequence data has been publicly released and can be assessed by interested users. However, this wealth of information is currently underexploited by scientists not directly involved in genome annotation. This is partially because sequencing, assembly, and automated annotation can be done much faster than the identification, classification, and prediction of the intracellular localization of the gene products. This part of the annotation process still largely relies on manual curation and addition of contextual information. Users of genome databases who are unfamiliar with the types of data available from (whole) genomes might therefore find themselves either overwhelmed by the vast amount and multiple layers of data or dissatisfied with less-than-meaningful analyses of the data.In this chapter we present procedures and approaches to identify and characterize gene models of enzymes involved in metabolic pathways based on their similarity to known sequences. Furthermore we describe how to predict the subcellular location of the proteins using publicly available prediction servers and how to interpret the obtained results. The strategies we describe are generally applicable to organisms with primary plastids such as land plants or green algae. Additionally, we describe strategies suitable for those groups of algae with secondary plastids (for instance diatoms), which are characterized by a different cellular topology and a larger number of intracellular compartments compared to plants.

Yadav & Jayaraman (2012) Structure based function prediction of proteins using fragment library frequency vectors. Bioinformation 8:953-6. (pmid: 23144557)

PubMed ] [ DOI ] The function of the protein is primarily dictated by its structure. Therefore it is far more logical to find the functional clues of the protein in its overall 3-dimensional fold or its global structure. In this paper, we have developed a novel Support Vector Machines (SVM) based prediction model for functional classification and prediction of proteins using features extracted from its global structure based on fragment libraries. Fragment libraries have been previously used for abintio modelling of proteins and protein structure comparisons. The query protein structure is broken down into a collection of short contiguous backbone fragments and this collection is discretized using a library of fragments. The input feature vector is frequency vector that counts the number of each library fragment in the collection of fragments by all-to-all fragment comparisons. SVM models were trained and optimised for obtaining the best 10-fold Cross validation accuracy for classification. As an example, this method was applied for prediction and classification of Cell Adhesion molecules (CAMs). Thirty-four different fragment libraries with sizes ranging from 4 to 400 and fragment lengths ranging from 4 to 12 were used for obtaining the best prediction model. The best 10-fold CV accuracy of 95.25% was obtained for library of 400 fragments of length 10. An accuracy of 87.5% was obtained on an unseen test dataset consisting of 20 CAMs and 20 NonCAMs. This shows that protein structure can be accurately and uniquely described using 400 representative fragments of length 10.

Wass et al. (2012) CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 40:W466-70. (pmid: 22641853)

PubMed ] [ DOI ] Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein-protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.

Alvarez & Yan (2012) A new protein graph model for function prediction. Comput Biol Chem 37:6-10. (pmid: 22381922)

PubMed ] [ DOI ] As several structural proteomic projects are producing an increasing number of protein structures with unknown function, methods that can reliably predict protein functions from protein structures are in urgent need. In this paper, we present a method to explore the clustering patterns of amino acids on the 3-dimensional space for protein function prediction. First, amino acid residues on a protein structure are clustered into spatial groups using hierarchical agglomerative clustering, based on the distance between them. Second, the protein structure is represented using a graph, where each node denotes a cluster of amino acids. The nodes are labeled with an evolutionary profile derived from the multiple alignment of homologous sequences. Then, a shortest-path graph kernel is used to calculate similarities between the graphs. Finally, a support vector machine using this graph kernel is used to train classifiers for protein function prediction. We applied the proposed method to two separate problems, namely, prediction of enzymes and prediction of DNA-binding proteins. In both cases, the results showed that the proposed method outperformed other state-of-the-art methods.

Cui et al. (2011) Phylogenetically informed logic relationships improve detection of biological network organization. BMC Bioinformatics 12:476. (pmid: 22172058)

PubMed ] [ DOI ] BACKGROUND: A "phylogenetic profile" refers to the presence or absence of a gene across a set of organisms, and it has been proven valuable for understanding gene functional relationships and network organization. Despite this success, few studies have attempted to search beyond just pairwise relationships among genes. Here we search for logic relationships involving three genes, and explore its potential application in gene network analyses. RESULTS: Taking advantage of a phylogenetic matrix constructed from the large orthologs database Roundup, we invented a method to create balanced profiles for individual triplets of genes that guarantee equal weight on the different phylogenetic scenarios of coevolution between genes. When we applied this idea to LAPP, the method to search for logic triplets of genes, the balanced profiles resulted in significant performance improvement and the discovery of hundreds of thousands more putative triplets than unadjusted profiles. We found that logic triplets detected biological network organization and identified key proteins and their functions, ranging from neighbouring proteins in local pathways, to well separated proteins in the whole pathway, and to the interactions among different pathways at the system level. Finally, our case study suggested that the directionality in a logic relationship and the profile of a triplet could disclose the connectivity between the triplet and surrounding networks. CONCLUSION: Balanced profiles are superior to the raw profiles employed by traditional methods of phylogenetic profiling in searching for high order gene sets. Gene triplets can provide valuable information in detection of biological network organization and identification of key genes at different levels of cellular interaction.

Good et al. (2011) Mining the Gene Wiki for functional genomic knowledge. BMC Genomics 12:603. (pmid: 22165947)

PubMed ] [ DOI ] BACKGROUND: Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology. RESULTS: Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses. CONCLUSIONS: The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.

Pellegrini (2012) Using phylogenetic profiles to predict functional relationships. Methods Mol Biol 804:167-77. (pmid: 22144153)

PubMed ] [ DOI ] Phylogenetic profiling involves the comparison of phylogenetic data across gene families. It is possible to construct phylogenetic trees, or related data structures, for specific gene families using a wide variety of tools and approaches. Phylogenetic profiling involves the comparison of this data to determine which families have correlated or coupled evolution. The underlying assumption is that in certain cases these couplings may allow us to infer that the two families are functionally related: that is their function in the cell is coupled. Although this technique can be applied to noncoding genes, it is more commonly used to assess the function of protein coding genes. Examples of proteins that are functionally related include subunits of protein complexes, or enzymes that perform consecutive steps along biochemical pathways. We hypothesize the deletion of one of the families from a genome would then indirectly affect the function of the other. Dozens of different implementations of the phylogenetic profiling technique have been developed over the past decade. These range from the first simple approaches that describe phylogenetic profiles as binary vectors to the most complex ones that attempt to model to the coevolution of protein families on a phylogenetic tree. We discuss a set of these implementations and present the software and databases that are available to perform phylogenetic profiling.

Mohammed & Guda (2011) Computational Approaches for Automated Classification of Enzyme Sequences. J Proteomics Bioinform 4:147-152. (pmid: 22114367)

PubMed ] [ DOI ] Determining the functional role(s) of enzymes is very important to build the metabolic blueprint of an organism and to identify the potential roles enzymes may play in metabolic and disease pathways. With exponential growth in gene and protein sequence data, it is not feasible to experimentally characterize the function(s) of all enzymes. Alternatively, computational methods can be used to annotate the enormous amount of unannotated enzyme sequences. For function prediction and classification of enzymes, features based on amino acid composition, sequence and structural properties, domain composition and specific peptide information have been widely used by different computational approaches. Each feature space has its own merits and limitations on the overall prediction accuracy. Prediction accuracy improves when machine-learning methods are used to classify enzymes. Given the incomplete and unbalanced nature of annotations in biological databases, ensemble methods or methods that bank on a combination of orthogonal feature are more desirable for achieving higher accuracy and coverage in enzyme classification. In this review article, we systematically describe all the features and methods used thus far for enzyme class prediction. To the authors' knowledge, this review represents the most exhaustive description of methods used for computational prediction of enzyme classes.

Chi & Hou (2011) An iterative approach of protein function prediction. BMC Bioinformatics 12:437. (pmid: 22074332)

PubMed ] [ DOI ] BACKGROUND: Current approaches of predicting protein functions from a protein-protein interaction (PPI) dataset are based on an assumption that the available functions of the proteins (a.k.a. annotated proteins) will determine the functions of the proteins whose functions are unknown yet at the moment (a.k.a. un-annotated proteins). Therefore, the protein function prediction is a mono-directed and one-off procedure, i.e. from annotated proteins to un-annotated proteins. However, the interactions between proteins are mutual rather than static and mono-directed, although functions of some proteins are unknown for some reasons at present. That means when we use the similarity-based approach to predict functions of un-annotated proteins, the un-annotated proteins, once their functions are predicted, will affect the similarities between proteins, which in turn will affect the prediction results. In other words, the function prediction is a dynamic and mutual procedure. This dynamic feature of protein interactions, however, was not considered in the existing prediction algorithms. RESULTS: In this paper, we propose a new prediction approach that predicts protein functions iteratively. This iterative approach incorporates the dynamic and mutual features of PPI interactions, as well as the local and global semantic influence of protein functions, into the prediction. To guarantee predicting functions iteratively, we propose a new protein similarity from protein functions. We adapt new evaluation metrics to evaluate the prediction quality of our algorithm and other similar algorithms. Experiments on real PPI datasets were conducted to evaluate the effectiveness of the proposed approach in predicting unknown protein functions. CONCLUSIONS: The iterative approach is more likely to reflect the real biological nature between proteins when predicting functions. A proper definition of protein similarity from protein functions is the key to predicting functions iteratively. The evaluation results demonstrated that in most cases, the iterative approach outperformed non-iterative ones with higher prediction quality in terms of prediction precision, recall and F-value.

Erdin et al. (2011) Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol 21:180-8. (pmid: 21353529)

PubMed ] [ DOI ] Genomic centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, less than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.

Janga et al. (2011) Network-based function prediction and interactomics: the case for metabolic enzymes. Metab Eng 13:1-10. (pmid: 20654726)

PubMed ] [ DOI ] As sequencing technologies increase in power, determining the functions of unknown proteins encoded by the DNA sequences so produced becomes a major challenge. Functional annotation is commonly done on the basis of amino-acid sequence similarity alone. Long after sequence similarity becomes undetectable by pair-wise comparison, profile-based identification of homologs can often succeed due to the conservation of position-specific patterns, important for a protein's three dimensional folding and function. Nevertheless, prediction of protein function from homology-driven approaches is not without problems. Homologous proteins might evolve different functions and the power of homology detection has already started to reach its maximum. Computational methods for inferring protein function, which exploit the context of a protein in cellular networks, have come to be built on top of homology-based approaches. These network-based functional inference techniques provide both a first hand hint into a proteins' functional role and offer complementary insights to traditional methods for understanding the function of uncharacterized proteins. Most recent network-based approaches aim to integrate diverse kinds of functional interactions to boost both coverage and confidence level. These techniques not only promise to solve the moonlighting aspect of proteins by annotating proteins with multiple functions, but also increase our understanding on the interplay between different functional classes in a cell. In this article we review the state of the art in network-based function prediction and describe some of the underlying difficulties and successes. Given the volume of high-throughput data that is being reported the time is ripe to employ these network-based approaches, which can be used to unravel the functions of the uncharacterized proteins accumulating in the genomic databases.

Longhi et al. (2010) Conformational disorder. Methods Mol Biol 609:307-25. (pmid: 20221927)

PubMed ] [ DOI ] In recent years it was shown that a large number of proteins are either fully or partially disordered. Intrinsically disordered proteins are ubiquitary proteins that fulfill essential biological functions while lacking a stable 3D structure. Despite the large abundance of disorder, disordered regions are still poorly detected. The identification of disordered regions facilitates the functional annotation of proteins and is instrumental in delineating boundaries of protein domains amenable to crystallization. This chapter focuses on the methods currently employed for predicting disorder and identifying regions involved in induced folding.

Syed & Yona (2009) Enzyme function prediction with interpretable models. Methods Mol Biol 541:373-420. (pmid: 19381539)

PubMed ] [ DOI ] Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

Skolnick & Brylinski (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10:378-91. (pmid: 19324930)

PubMed ] [ DOI ] A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the approximately 50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used.

Emes (2008) Inferring function from homology. Methods Mol Biol 453:149-68. (pmid: 18712301)

PubMed ] [ DOI ] Modern molecular biology approaches often result in the accumulation of abundant biological sequence data. Ideally, the function of individual proteins predicted using such data would be determined experimentally. However, if a gene of interest has no predictable function or if the amount of data is too large to experimentally assess individual genes, bioinformatics techniques may provide additional information to allow the inference of function. This chapter proposes a pipeline of freely available Web-based tools to analyze protein-coding DNA sequences of unknown function. Accumulated information obtained during each step of the pipeline is used to build a testable hypothesis of function. The basis and use of sequence similarity methods of homologue detection are described, with emphasis on BLAST and PSI-BLAST. Annotation of gene function through protein domain detection using SMART and Pfam, and the potential for comparison to whole genome data are discussed.

Date (2007) Estimating protein function using protein-protein relationships. Methods Mol Biol 408:109-27. (pmid: 18314580)

PubMed ] [ DOI ] Many newly identified gene products from completely sequenced genomes are difficult to characterize in the absence of sequence homology to known proteins. In such a scenario, the context of the proteins' functional associations can be used for annotation; overrepresented functional linkages with a certain class of proteins or members of a pathway allow putative function assignments based on the "guilt-by-association" principle. Two computational functional genomics methods, phylogenetic profiling and identification of Rosetta stone linkages, are described in this chapter, which allow assessment of functional linkages between proteins, consequently facilitating annotation. Phylogenetic profiling involves measuring similarity between profiles that describe the presence or absence of a protein in a set of reference genomes, whereas Rosetta stone fusion sequences help link two or more independently transcribed and translated proteins. Both methods can be applied to investigate functional associations between individual proteins, and can also be extended to reconstruct the genome-wide network of functional linkages by querying the entire protein complement of an organism.