Expected Preparations:

  [RPR]
Scripting_data_downloads
  [FND]
Homology
 
  The units listed above are part of this course and contain important preparatory material.  

Keywords: Domain discovery by multiple sequence alignment; HMMER algorithm; Domain databases: InterPro, SMART, CDART; Annotation of sequences

Objectives:

This unit will …

  • … introduce the concept of domains in proteins and discuss the use of domains inn sequence analysis;

  • … demonstrate key databases and services that are available for domain analysis on the Web;

  • … go through an exercise in domain annotation using R.

Outcomes:

After working through this unit you …

  • … are familar with key databases and services for domain annotation;

  • … can store domain annotations as features in your protein database;

  • … are able to use R’s plot function for the production of generic data driven graphics.


Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


Evaluation:

NA: This unit is not evaluated for course marks.

Contents

This unit introduces the observation that evolution composes higher-order functions from domains that are folding units, functional units, and units of inheritance. It then covers some of the databases and services that support discovery and analysis of domains, and guides through an exercise in domain annotation.

Task…

 

InterPro

 

The InterPro protein families and domain database is a large, curated collection of domain definitions and domain annotations hoisted by the EBI. Here, we use a resource that was inherited from Pfam, a resource that pioneered the discovery of prtotein domains from multiple sequence alignments, and their representation as Hidden Markov Models (a technique that can define a probability that a given sequence is part of the family that the model was trained on).

Task…

  • Retrieve the MYSPE UniProt ID from your Journal, or by issuing the following R commands:
pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$UniProtID[sel]
  • Navigate to the Interpro database. We will retrieve domain annotations for your Mbp1 orthologue. You can search the database by FASTA sequence, ID, or keyword.
  • Open the Search by text tab.
  • Enter the UniProt ID into the search field and click Search.
  • Study the annotations. Export the annotations as a JSON file, download the file into your .data/ folder, and name it Mbp1MYSPE_Pfam.JSON.
  • Are all expected domains present? (APSES or KilA-N domain, disordered segments, Ankyrin domains, coiled coil, … others ?)
  • “disorder”, “low complexity”, and “coiled coil” annotations are not based on alignments, but on sequence analysis algorithms. Visit Finn et al. (2014) and read how these regions are defined1.
  • Study the (well curated!) family annotations for:

 

SMART domain annotation

 

The SMART database at the EMBL in Heidelberg integrates a number of feature detection tools including Pfam / InterPro domain annotation and its own, HMM based SMART domain database. You can search by sequence, or by accession number and retrieve domain annotations and more.

 

Visual comparison of domain annotations in R

 

The versatile plotting functions of R allow us to compare domain annotations. The distribution of segments that are annotated as “low-complexity, presumably disordered, is particularly interesting: these are functional features that are often not alignable since there is no selective pressure on sequence similarity but they may have arisen from convergent evolution or diverged while maintaining average composition, not specific sequence. Sequence alignment is after all based on amino acid pair scores that have been optimized to detect amino acids that behave similarly in the same context of folded proteins.

In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the reference species in our protein database.

Task…

  • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
  • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included. This ensures that your data and code remain up to date when we update, or fix bugs.
  • Type init() if requested.
  • Open the file BIN-FUNC-Domain_annotation.R and follow the instructions.

     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

After you have worked through this code, your plot should look similar to this one:

SMART domain annotations for Mbp1 proteins for the ten reference fungi. Plot produced by code discussed in BIN-FUNC-Domain_annotation.R from annotation data stored in myDB.

 

CDART

 

The CDART database (Conserved Domain Architecture Retrieval Tool) finds proteins that have a similar domain architecture as a query. This has the potential to find homologous and functionally related proteins that are far todissimilar to be detected with sequence similarity searches.

Task…

  • Retrieve the MYSPE Refseq ID from your Journal or by issuing the following R commands:
pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$RefSeqID[sel]
  • Navigate to CDART.
  • Paste your Mbp1 protein ID and click Submit.
  • Note that the first page of the (very long! More than 2,000 pages.) results list shows proteins that contain both KilA-N and Ankyrin domains. However a few other domains are found as well (Atrophin, GNVR, SMV_N), and this raises the intriguing possibility that the MBP1_MYSPE protein might contain some or all of these as well, although the sequence similarity may be too low to detect this outright.
  • Study the domain annotations for

 

Further Reading

El-Gebali, Sara et al.. (2019). “The Pfam protein families database in 2019”. Nucleic Acids Research 47(D1):D427–D432 .
[PMID: 30357350] [DOI: 10.1093/nar/gky995]

The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.

Letunic, Ivica and Peer Bork. (2018). “20 years of the SMART protein domain annotation resource”. Nucleic Acids Research 46(D1):D493–D496 .
[PMID: 29040681] [DOI: 10.1093/nar/gkx922]

SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 8 contains manually curated models for more than 1300 protein domains, with approximately 100 new models added since our last update article (1). The underlying protein databases were synchronized with UniProt (2), Ensembl (3) and STRING (4), doubling the total number of annotated domains and other protein features to more than 200 million. In its 20th year, the SMART analysis results pages have been streamlined again and its information sources have been updated. SMART’s vector based display engine has been extended to all protein schematics in SMART and rewritten to use the latest web technologies. The internal full text search engine has been redesigned and updated, resulting in greatly increased search speed.

Geer, Lewis Y et al.. (2002). “CDART: protein homology by domain architecture”. Genome Research 12(10):1619–23 .
[PMID: 12368255] [DOI: 10.1101/gr.278202]

The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins. The algorithm finds protein similarities across significant evolutionary distances using sensitive protein domain profiles rather than by direct sequence similarity. Proteins similar to a query protein are grouped and scored by architecture. Relying on domain profiles allows CDART to be fast, and, because it relies on annotated functional domains, informative. Domain profiles are derived from several collections of domain definitions that include functional annotation. Searches can be further refined by taxonomy and by selecting domains of interest. CDART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi.

Yang, Mingzhang et al.. (2020). “NCBI’s Conserved Domain Database and Tools for Protein Domain Analysis”. Current Protocols in Bioinformatics 69(1):e90 .
[PMID: 31851420] [DOI: 10.1002/cpbi.90]

The Conserved Domain Database (CDD) is a freely available resource for the annotation of sequences with the locations of conserved protein domain footprints, as well as functional sites and motifs inferred from these footprints. It includes protein domain and protein family models curated in house by CDD staff, as well as imported from a variety of other sources. The latest CDD release (v3.17, April 2019) contains more than 57,000 domain models, of which almost 15,000 were curated by CDD staff. The CDD curation effort increases coverage and provides finer-grained classifications of common and widely distributed protein domain families, for which a wealth of functional and structural data have become available. The CDD maintains both live search capabilities and an archive of pre-computed domain annotations for a selected subset of sequences tracked by the NCBI’s Entrez protein database. These can be retrieved or computed for a single sequence using CD-Search or in bulk using Batch CD-Search, or computed via standalone RPS-BLAST plus the rpsbproc software package. The CDD can be accessed via https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The three protocols listed here describe how to perform a CD-Search (Basic Protocol 1), a Batch CD-Search (Basic Protocol 2), and a Standalone RPS-BLAST and rpsbproc (Basic Protocol 3). © 2019 The Authors. Basic Protocol 1: CD-search Basic Protocol 2: Batch CD-search Basic Protocol 3: Standalone RPS-BLAST and rpsbproc.

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

Page ID: BIN-FUNC-Domain_annotation

Author:
Boris Steipe ( <boris.steipe@utoronto.ca> )
Created:
2017-08-05
Last modified:
2022-12-06
Version:
1.2
Version History:
–  1.2 202: Pfam now hosted by InterPro
–  1.1 2020 Updates, fixed Pfam/SMART annotation mix-up
–  1.0 Live version
–  0.1 First stub, import of 2016 material
Tagged with:
–  Unit
–  Live
–  Has lecture slides
–  Has R code examples
–  Links to R course project
–  Contains images
–  Has further reading

 

[END]


  1. “Disorder” comes from IUPred predictions, “low complexity regions” are predicted by SEG, and “coiled coils” are predicted according to a scale developed by Rob Russell and Rune Linding; their coils server appears defunct but a similar algorithm (due to A. Lupas) is hosted at SIB.↩︎