Function prediction

From "A B C"
Revision as of 23:29, 26 January 2012 by Boris (talk | contribs)
Jump to navigation Jump to search

Function prediction


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Sequencing genomes is much, much easier than determining the function of gene products in the laboratory. In order to annotate genomes, the function of gene products has to be predicted...



Introductory reading

Erdin et al. (2011) Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol 21:180-8. (pmid: 21353529)

PubMed ] [ DOI ] Genomic centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, less than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.


Contents

  • Definition of function
  • Data sources
  • Strategies
  • Examples


Exercises



References



Further reading and resources

Date (2007) Estimating protein function using protein-protein relationships. Methods Mol Biol 408:109-27. (pmid: 18314580)

PubMed ] [ DOI ] Many newly identified gene products from completely sequenced genomes are difficult to characterize in the absence of sequence homology to known proteins. In such a scenario, the context of the proteins' functional associations can be used for annotation; overrepresented functional linkages with a certain class of proteins or members of a pathway allow putative function assignments based on the "guilt-by-association" principle. Two computational functional genomics methods, phylogenetic profiling and identification of Rosetta stone linkages, are described in this chapter, which allow assessment of functional linkages between proteins, consequently facilitating annotation. Phylogenetic profiling involves measuring similarity between profiles that describe the presence or absence of a protein in a set of reference genomes, whereas Rosetta stone fusion sequences help link two or more independently transcribed and translated proteins. Both methods can be applied to investigate functional associations between individual proteins, and can also be extended to reconstruct the genome-wide network of functional linkages by querying the entire protein complement of an organism.

Emes (2008) Inferring function from homology. Methods Mol Biol 453:149-68. (pmid: 18712301)

PubMed ] [ DOI ] Modern molecular biology approaches often result in the accumulation of abundant biological sequence data. Ideally, the function of individual proteins predicted using such data would be determined experimentally. However, if a gene of interest has no predictable function or if the amount of data is too large to experimentally assess individual genes, bioinformatics techniques may provide additional information to allow the inference of function. This chapter proposes a pipeline of freely available Web-based tools to analyze protein-coding DNA sequences of unknown function. Accumulated information obtained during each step of the pipeline is used to build a testable hypothesis of function. The basis and use of sequence similarity methods of homologue detection are described, with emphasis on BLAST and PSI-BLAST. Annotation of gene function through protein domain detection using SMART and Pfam, and the potential for comparison to whole genome data are discussed.

Skolnick & Brylinski (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10:378-91. (pmid: 19324930)

PubMed ] [ DOI ] A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the approximately 50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used.

Syed & Yona (2009) Enzyme function prediction with interpretable models. Methods Mol Biol 541:373-420. (pmid: 19381539)

PubMed ] [ DOI ] Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

Janga et al. (2011) Network-based function prediction and interactomics: the case for metabolic enzymes. Metab Eng 13:1-10. (pmid: 20654726)

PubMed ] [ DOI ] As sequencing technologies increase in power, determining the functions of unknown proteins encoded by the DNA sequences so produced becomes a major challenge. Functional annotation is commonly done on the basis of amino-acid sequence similarity alone. Long after sequence similarity becomes undetectable by pair-wise comparison, profile-based identification of homologs can often succeed due to the conservation of position-specific patterns, important for a protein's three dimensional folding and function. Nevertheless, prediction of protein function from homology-driven approaches is not without problems. Homologous proteins might evolve different functions and the power of homology detection has already started to reach its maximum. Computational methods for inferring protein function, which exploit the context of a protein in cellular networks, have come to be built on top of homology-based approaches. These network-based functional inference techniques provide both a first hand hint into a proteins' functional role and offer complementary insights to traditional methods for understanding the function of uncharacterized proteins. Most recent network-based approaches aim to integrate diverse kinds of functional interactions to boost both coverage and confidence level. These techniques not only promise to solve the moonlighting aspect of proteins by annotating proteins with multiple functions, but also increase our understanding on the interplay between different functional classes in a cell. In this article we review the state of the art in network-based function prediction and describe some of the underlying difficulties and successes. Given the volume of high-throughput data that is being reported the time is ripe to employ these network-based approaches, which can be used to unravel the functions of the uncharacterized proteins accumulating in the genomic databases.

Erdin et al. (2011) Protein function prediction: towards integration of similarity metrics. Curr Opin Struct Biol 21:180-8. (pmid: 21353529)

PubMed ] [ DOI ] Genomic centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, less than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.

Chi & Hou (2011) An iterative approach of protein function prediction. BMC Bioinformatics 12:437. (pmid: 22074332)

PubMed ] [ DOI ] BACKGROUND: Current approaches of predicting protein functions from a protein-protein interaction (PPI) dataset are based on an assumption that the available functions of the proteins (a.k.a. annotated proteins) will determine the functions of the proteins whose functions are unknown yet at the moment (a.k.a. un-annotated proteins). Therefore, the protein function prediction is a mono-directed and one-off procedure, i.e. from annotated proteins to un-annotated proteins. However, the interactions between proteins are mutual rather than static and mono-directed, although functions of some proteins are unknown for some reasons at present. That means when we use the similarity-based approach to predict functions of un-annotated proteins, the un-annotated proteins, once their functions are predicted, will affect the similarities between proteins, which in turn will affect the prediction results. In other words, the function prediction is a dynamic and mutual procedure. This dynamic feature of protein interactions, however, was not considered in the existing prediction algorithms. RESULTS: In this paper, we propose a new prediction approach that predicts protein functions iteratively. This iterative approach incorporates the dynamic and mutual features of PPI interactions, as well as the local and global semantic influence of protein functions, into the prediction. To guarantee predicting functions iteratively, we propose a new protein similarity from protein functions. We adapt new evaluation metrics to evaluate the prediction quality of our algorithm and other similar algorithms. Experiments on real PPI datasets were conducted to evaluate the effectiveness of the proposed approach in predicting unknown protein functions. CONCLUSIONS: The iterative approach is more likely to reflect the real biological nature between proteins when predicting functions. A proper definition of protein similarity from protein functions is the key to predicting functions iteratively. The evaluation results demonstrated that in most cases, the iterative approach outperformed non-iterative ones with higher prediction quality in terms of prediction precision, recall and F-value.

Pellegrini (2012) Using phylogenetic profiles to predict functional relationships. Methods Mol Biol 804:167-77. (pmid: 22144153)

PubMed ] [ DOI ] Phylogenetic profiling involves the comparison of phylogenetic data across gene families. It is possible to construct phylogenetic trees, or related data structures, for specific gene families using a wide variety of tools and approaches. Phylogenetic profiling involves the comparison of this data to determine which families have correlated or coupled evolution. The underlying assumption is that in certain cases these couplings may allow us to infer that the two families are functionally related: that is their function in the cell is coupled. Although this technique can be applied to noncoding genes, it is more commonly used to assess the function of protein coding genes. Examples of proteins that are functionally related include subunits of protein complexes, or enzymes that perform consecutive steps along biochemical pathways. We hypothesize the deletion of one of the families from a genome would then indirectly affect the function of the other. Dozens of different implementations of the phylogenetic profiling technique have been developed over the past decade. These range from the first simple approaches that describe phylogenetic profiles as binary vectors to the most complex ones that attempt to model to the coevolution of protein families on a phylogenetic tree. We discuss a set of these implementations and present the software and databases that are available to perform phylogenetic profiling.