Sequence analysis

From "A B C"
Jump to navigation Jump to search

Sequence analysis


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Summary ...



 

Contents

Examples

Motifs

Sigrist et al. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 38:D161-6. (pmid: 19858104)

PubMed ] [ DOI ] PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries. Among the 983 (DNA-binding) domains, repeats and zinc fingers present in Swiss-Prot (release 57.8 of 22 September 2009), 696 ( approximately 70%) are annotated with PROSITE descriptors using information from ProRule. In order to allow better functional characterization of domains, PROSITE developments focus on subfamily specific profiles and a new profile building method giving more weight to functionally important residues. Here, we describe AMSA, an annotated multiple sequence alignment format used to build a new generation of generalized profiles, the migration of ScanProsite to Vital-IT, a cluster of 633 CPUs, and the adoption of the Distributed Annotation System (DAS) to facilitate PROSITE data integration and interchange with other sources. The latest version of PROSITE (release 20.54, of 22 September 2009) contains 1308 patterns, 863 profiles and 869 ProRules. PROSITE is accessible at: http://www.expasy.org/prosite/.

Disorder

Deng et al. (2012) A comprehensive overview of computational protein disorder prediction methods. Mol Biosyst 8:114-21. (pmid: 21874190)

PubMed ] [ DOI ] Over the past decade there has been a growing acknowledgement that a large proportion of proteins within most proteomes contain disordered regions. Disordered regions are segments of the protein chain which do not adopt a stable structure. Recognition of disordered regions in a protein is of great importance for protein structure prediction, protein structure determination and function annotation as these regions have a close relationship with protein expression and functionality. As a result, a great many protein disorder prediction methods have been developed so far. Here, we present an overview of current protein disorder prediction methods including an analysis of their advantages and shortcomings. In order to help users to select alternative tools under different circumstances, we also evaluate 23 disorder predictors on the benchmark data of the most recent round of the Critical Assessment of protein Structure Prediction (CASP) and assess their accuracy using several complementary measures.

Kozlowski & Bujnicki (2012) MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinformatics 13:111. (pmid: 22624656)

PubMed ] [ DOI ] BACKGROUND: Intrinsically unstructured proteins (IUPs) lack a well-defined three-dimensional structure. Some of them may assume a locally stable structure under specific conditions, e.g. upon interaction with another molecule, while others function in a permanently unstructured state. The discovery of IUPs challenged the traditional protein structure paradigm, which stated that a specific well-defined structure defines the function of the protein. As of December 2011, approximately 60 methods for computational prediction of protein disorder from sequence have been made publicly available. They are based on different approaches, such as utilizing evolutionary information, energy functions, and various statistical and machine learning methods. RESULTS: Given the diversity of existing intrinsic disorder prediction methods, we decided to test whether it is possible to combine them into a more accurate meta-prediction method. We developed a method based on arbitrarily chosen 13 disorder predictors, in which the final consensus was weighted by the accuracy of the methods. We have also developed a disorder predictor GSmetaDisorder3D that used no third-party disorder predictors, but alignments to known protein structures, reported by the protein fold-recognition methods, to infer the potentially structured and unstructured regions. Following the success of our disorder predictors in the CASP8 benchmark, we combined them into a meta-meta predictor called GSmetaDisorderMD, which was the top scoring method in the subsequent CASP9 benchmark. CONCLUSIONS: A series of disorder predictors described in this article is available as a MetaDisorder web server at http://iimcb.genesilico.pl/metadisorder/. Results are presented both in an easily interpretable, interactive mode and in a simple text format suitable for machine processing.


Signal peptides

Bendtsen et al. (2004) Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340:783-95. (pmid: 15223320)

PubMed ] [ DOI ] We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, combined with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions has increased notably for all three organism groups, eukaryotes, Gram-negative and Gram-positive bacteria. The accuracy of cleavage site prediction has increased in the range 6-17% over the previous version, whereas the signal peptide discrimination improvement is mainly due to the elimination of false-positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has been benchmarked against other available methods. Predictions can be made at the publicly available web server

Emanuelsson et al. (2007) Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc 2:953-71. (pmid: 17446895)

PubMed ] [ DOI ] Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.

Petersen et al. (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785-6. (pmid: 21959131)

PubMed ] [ DOI ]


Secondary Structure

Pirovano & Heringa (2010) Protein secondary structure prediction. Methods Mol Biol 609:327-48. (pmid: 20221928)

PubMed ] [ DOI ] While the prediction of a native protein structure from sequence continues to remain a challenging problem, over the past decades computational methods have become quite successful in exploiting the mechanisms behind secondary structure formation. The great effort expended in this area has resulted in the development of a vast number of secondary structure prediction methods. Especially the combination of well-optimized/sensitive machine-learning algorithms and inclusion of homologous sequence information has led to increased prediction accuracies of up to 80%. In this chapter, we will first introduce some basic notions and provide a brief history of secondary structure prediction advances. Then a comprehensive overview of state-of-the-art prediction methods will be given. Finally, we will discuss open questions and challenges in this field and provide some practical recommendations for the user.

Coiled coils


Rackham et al. (2010) The evolution and structure prediction of coiled coils across all genomes. J Mol Biol 403:480-93. (pmid: 20813113)

PubMed ] [ DOI ] Coiled coils are α-helical interactions found in many natural proteins. Various sequence-based coiled-coil predictors are available, but key issues remain: oligomeric state and protein-protein interface prediction and extension to all genomes. We present SpiriCoil (http://supfam.org/SUPERFAMILY/spiricoil), which is based on a novel approach to the coiled-coil prediction problem for coiled coils that fall into known superfamilies: hundreds of hidden Markov models representing coiled-coil-containing domain families. Using whole domains gives the advantage that sequences flanking the coiled coils help. SpiriCoil performs at least as well as existing methods at detecting coiled coils and significantly advances the state of the art for oligomer state prediction. SpiriCoil has been run on over 16 million sequences, including all completely sequenced genomes (more than 1200), and a resulting Web interface supplies data downloads, alignments, scores, oligomeric state classifications, three-dimensional homology models and visualisation. This has allowed, for the first time, a genomewide analysis of coiled-coil evolution. We found that coiled coils have arisen independently de novo well over a hundred times, and these are observed in 16 different oligomeric states. Coiled coils in almost all oligomeric states were present in the last universal common ancestor of life. The vast majority of occasions that individual coiled coils have arisen de novo were before the last universal common ancestor of life; we do, however, observe scattered instances throughout subsequent evolutionary history, mostly in the formation of the eukaryote superkingdom. Coiled coils do not change their oligomeric state over evolution and did not evolve from the rearrangement of existing helices in proteins; coiled coils were forged in unison with the fold of the whole protein.

spiricoil

Transmembrane Helices

Wang et al. (2012) Improving transmembrane protein consensus topology prediction using inter-helical interaction. Biochim Biophys Acta 1818:2679-86. (pmid: 22683598)

PubMed ] [ DOI ] Alpha helix transmembrane proteins (αTMPs) represent roughly 30% of all open reading frames (ORFs) in a typical genome and are involved in many critical biological processes. Due to the special physicochemical properties, it is hard to crystallize and obtain high resolution structures experimentally, thus, sequence-based topology prediction is highly desirable for the study of transmembrane proteins (TMPs), both in structure prediction and function prediction. Various model-based topology prediction methods have been developed, but the accuracy of those individual predictors remain poor due to the limitation of the methods or the features they used. Thus, the consensus topology prediction method becomes practical for high accuracy applications by combining the advances of the individual predictors. Here, based on the observation that inter-helical interactions are commonly found within the transmembrane helixes (TMHs) and strongly indicate the existence of them, we present a novel consensus topology prediction method for αTMPs, CNTOP, which incorporates four top leading individual topology predictors, and further improves the prediction accuracy by using the predicted inter-helical interactions. The method achieved 87% prediction accuracy based on a benchmark dataset and 78% accuracy based on a non-redundant dataset which is composed of polytopic αTMPs. Our method derives the highest topology accuracy than any other individual predictors and consensus predictors, at the same time, the TMHs are more accurately predicted in their length and locations, where both the false positives (FPs) and the false negatives (FNs) decreased dramatically. The CNTOP is available at: http://ccst.jlu.edu.cn/JCSB/cntop/CNTOP.html.

Hennerdal & Elofsson (2011) Rapid membrane protein topology prediction. Bioinformatics 27:1322-3. (pmid: 21493661)

PubMed ] [ DOI ] UNLABELLED: State-of-the-art methods for topology of α-helical membrane proteins are based on the use of time-consuming multiple sequence alignments obtained from PSI-BLAST or other sources. Here, we examine if it is possible to use the consensus of topology prediction methods that are based on single sequences to obtain a similar accuracy as the more accurate multiple sequence-based methods. Here, we show that TOPCONS-single performs better than any of the other topology prediction methods tested here, but ~6% worse than the best method that is utilizing multiple sequence alignments. AVAILABILITY AND IMPLEMENTATION: TOPCONS-single is available as a web server from http://single.topcons.net/ and is also included for local installation from the web site. In addition, consensus-based topology predictions for the entire international protein index (IPI) is available from the web server and will be updated at regular intervals.

Briesemeister et al. (2010) YLoc--an interpretable web server for predicting subcellular localization. Nucleic Acids Res 38:W497-502. (pmid: 20507917)

PubMed ] [ DOI ] Predicting subcellular localization has become a valuable alternative to time-consuming experimental methods. Major drawbacks of many of these predictors is their lack of interpretability and the fact that they do not provide an estimate of the confidence of an individual prediction. We present YLoc, an interpretable web server for predicting subcellular localization. YLoc uses natural language to explain why a prediction was made and which biological property of the protein was mainly responsible for it. In addition, YLoc estimates the reliability of its own predictions. YLoc can, thus, assist in understanding protein localization and in location engineering of proteins. The YLoc web server is available online at www.multiloc.org/YLoc.

Punta et al. (2007) Membrane protein prediction methods. Methods 41:460-74. (pmid: 17367718)

PubMed ] [ DOI ] We survey computational approaches that tackle membrane protein structure and function prediction. While describing the main ideas that have led to the development of the most relevant and novel methods, we also discuss pitfalls, provide practical hints and highlight the challenges that remain. The methods covered include: sequence alignment, motif search, functional residue identification, transmembrane segment and protein topology predictions, homology and ab initio modeling. In general, predictions of functional and structural features of membrane proteins are improving, although progress is hampered by the limited amount of high-resolution experimental information available. While predictions of transmembrane segments and protein topology rank among the most accurate methods in computational biology, more attention and effort will be required in the future to ameliorate database search, homology and ab initio modeling.


Localization

Yu et al. (2010) PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 26:1608-15. (pmid: 20472543)

PubMed ] [ DOI ] MOTIVATION: PSORTb has remained the most precise bacterial protein subcellular localization (SCL) predictor since it was first made available in 2003. However, the recall needs to be improved and no accurate SCL predictors yet make predictions for archaea, nor differentiate important localization subcategories, such as proteins targeted to a host cell or bacterial hyperstructures/organelles. Such improvements should preferably be encompassed in a freely available web-based predictor that can also be used as a standalone program. RESULTS: We developed PSORTb version 3.0 with improved recall, higher proteome-scale prediction coverage, and new refined localization subcategories. It is the first SCL predictor specifically geared for all prokaryotes, including archaea and bacteria with atypical membrane/cell wall topologies. It features an improved standalone program, with a new batch results delivery system complementing its web interface. We evaluated the most accurate SCL predictors using 5-fold cross validation plus we performed an independent proteomics analysis, showing that PSORTb 3.0 is the most accurate but can benefit from being complemented by Proteome Analyst predictions. AVAILABILITY: http://www.psort.org/psortb (download open source software or use the web interface). CONTACT: psort-mail@sfu.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Briesemeister et al. (2010) YLoc--an interpretable web server for predicting subcellular localization. Nucleic Acids Res 38:W497-502. (pmid: 20507917)

PubMed ] [ DOI ] Predicting subcellular localization has become a valuable alternative to time-consuming experimental methods. Major drawbacks of many of these predictors is their lack of interpretability and the fact that they do not provide an estimate of the confidence of an individual prediction. We present YLoc, an interpretable web server for predicting subcellular localization. YLoc uses natural language to explain why a prediction was made and which biological property of the protein was mainly responsible for it. In addition, YLoc estimates the reliability of its own predictions. YLoc can, thus, assist in understanding protein localization and in location engineering of proteins. The YLoc web server is available online at www.multiloc.org/YLoc.

Repeats

Heger & Holm (2000) Rapid automatic detection and alignment of repeats in protein sequences. Proteins 41:224-37. (pmid: 10966575)

PubMed ] [ <224::aid-prot70>3.0.co;2-z DOI ] Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information.

http://www.ebi.ac.uk/Tools/pfa/radar/

Pellegrini et al. (2012) Ab initio detection of fuzzy amino acid tandem repeats in protein sequences. BMC Bioinformatics 13 Suppl 3:S8. (pmid: 22536906)

PubMed ] [ DOI ] BACKGROUND: Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). RESULTS: In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. CONCLUSIONS: PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available.

Biegert & Söding (2008) De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics 24:807-14. (pmid: 18245125)

PubMed ] [ DOI ] MOTIVATION: An estimated 25% of all eukaryotic proteins contain repeats, which underlines the importance of duplication for evolving new protein functions. Internal repeats often correspond to structural or functional units in proteins. Methods capable of identifying diverged repeated segments or domains at the sequence level can therefore assist in predicting domain structures, inferring hypotheses about function and mechanism, and investigating the evolution of proteins from smaller fragments. RESULTS: We present HHrepID, a method for the de novo identification of repeats in protein sequences. It is able to detect the sequence signature of structural repeats in many proteins that have not yet been known to possess internal sequence symmetry, such as outer membrane beta-barrels. HHrepID uses HMM-HMM comparison to exploit evolutionary information in the form of multiple sequence alignments of homologs. In contrast to a previous method, the new method (1) generates a multiple alignment of repeats; (2) utilizes the transitive nature of homology through a novel merging procedure with fully probabilistic treatment of alignments; (3) improves alignment quality through an algorithm that maximizes the expected accuracy; (4) is able to identify different kinds of repeats within complex architectures by a probabilistic domain boundary detection method and (5) improves sensitivity through a new approach to assess statistical significance. AVAILABILITY: Server: http://toolkit.tuebingen.mpg.de/hhrepid; Executables: ftp://ftp.tuebingen.mpg.de/pub/protevo/HHrepID

http://toolkit.lmb.uni-muenchen.de/hhrepid

Integrated tools

Rost et al. (2004) The PredictProtein server. Nucleic Acids Res 32:W321-6. (pmid: 15215403)

PubMed ] [ DOI ] PredictProtein (http://www.predictprotein.org) is an Internet service for sequence analysis and the prediction of protein structure and function. Users submit protein sequences or alignments; PredictProtein returns multiple sequence alignments, PROSITE sequence motifs, low-complexity regions (SEG), nuclear localization signals, regions lacking regular structure (NORS) and predictions of secondary structure, solvent accessibility, globular regions, transmembrane helices, coiled-coil regions, structural switch regions, disulfide-bonds, sub-cellular localization and functional annotations. Upon request fold recognition by prediction-based threading, CHOP domain assignments, predictions of transmembrane strands and inter-residue contacts are also available. For all services, users can submit their query either by electronic mail or interactively via the World Wide Web.

Cong & Grishin (2012) MESSA: MEta-Server for protein Sequence Analysis. BMC Biol 10:82. (pmid: 23031578)

PubMed ] [ DOI ] BACKGROUND: Computational sequence analysis, that is, prediction of local sequence properties, homologs, spatial structure and function from the sequence of a protein, offers an efficient way to obtain needed information about proteins under study. Since reliable prediction is usually based on the consensus of many computer programs, meta-severs have been developed to fit such needs. Most meta-servers focus on one aspect of sequence analysis, while others incorporate more information, such as PredictProtein for local sequence feature predictions, SMART for domain architecture and sequence motif annotation, and GeneSilico for secondary and spatial structure prediction. However, as predictions of local sequence properties, three-dimensional structure and function are usually intertwined, it is beneficial to address them together. RESULTS: We developed a MEta-Server for protein Sequence Analysis (MESSA) to facilitate comprehensive protein sequence analysis and gather structural and functional predictions for a protein of interest. For an input sequence, the server exploits a number of select tools to predict local sequence properties, such as secondary structure, structurally disordered regions, coiled coils, signal peptides and transmembrane helices; detect homologous proteins and assign the query to a protein family; identify three-dimensional structure templates and generate structure models; and provide predictive statements about the protein's function, including functional annotations, Gene Ontology terms, enzyme classification and possible functionally associated proteins. We tested MESSA on the proteome of Candidatus Liberibacter asiaticus. Manual curation shows that three-dimensional structure models generated by MESSA covered around 75% of all the residues in this proteome and the function of 80% of all proteins could be predicted. AVAILABILITY: MESSA is free for non-commercial use at http://prodata.swmed.edu/MESSA/

Ooi et al. (2009) ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res 37:W435-40. (pmid: 19389726)

PubMed ] [ DOI ] Function prediction of proteins with computational sequence analysis requires the use of dozens of prediction tools with a bewildering range of input and output formats. Each of these tools focuses on a narrow aspect and researchers are having difficulty obtaining an integrated picture. ANNIE is the result of years of close interaction between computational biologists and computer scientists and automates an essential part of this sequence analytic process. It brings together over 20 function prediction algorithms that have proven sufficiently reliable and indispensable in daily sequence analytic work and are meant to give scientists a quick overview of possible functional assignments of sequence segments in the query proteins. The results are displayed in an integrated manner using an innovative AJAX-based sequence viewer. ANNIE is available online at: http://annie.bii.a-star.edu.sg. This website is free and open to all users and there is no login requirement.

Quick2D analysis - incorporating PSIPRED, JNET, Coils, MEMSAT-SVM, IUPRED etc.

   

Further reading and resources