BIN-FUNC-Domain annotation
Domain Annotation
Keywords: Domain discovery by multiple sequence alignment; HMMER algorithm; Domain databases: Pfam, SMART, CDART; Annotation of sequences
Contents
Abstract
This unit introduces the observation that evolution composes higher-order functions from domains that are folding units, functional units, and units of inheritance. It then covers some of the databases and services that support discovery and analysis of domains, and guides through an exercise in domain annotation.
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
- RPR-Scripting_data_downloads (Scripting Data Downloads)
- FND-Homology (Concepts and Consequences of Homology)
Objectives
This unit will ...
- ... introduce the concept of domains in proteins and discuss the use of domains inn sequence analysis;
- ... demonstrate key databases and services that are available for domain analysis on the Web;
- ... go through an exercise in domain annotation using R.
Outcomes
After working through this unit you ...
- ... are familar with key databases and services for domain annotation;
- ... can store domain annotations as features in your protein database;
- ... are able to use R's plot function for the production of generic data driven graphics.
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Contents
Task:
- Read the introductory notes on how domain annotations support the annotation of gene function.
Pfam
The Pfam protein domain family database is a large, curated collection of domain definitions and doamin annotations. Domains are discovered from multiple sequecne alignments, and represented as Hidden Markov Models (a probabilistic technique that can define a probability that a given sequence is part of a family the model was trained on).
Task:
- Retrieve the MYSPE UniProt ID from your Journal, or by issuing the following R commands:
pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$UniProtID[sel]
- Navigate to the Pfam database.
- Search for your Mbp1 protein by entering the ID into the search field and clicking Go.
- Study the annotations. Download the annotations asa JSON file and name it Mbp1MYSPE_Pfam.JSON.
- Are all expected domains present? (APSES or KilA-N domain, disordered segments, Ankyrin domains, coiled coil, more ?)
- "disorder", "low complexity", and "coiled coil" annotations are not based on alignments, but on sequence analysis algorithms. Visit Finn et al. (2014) and read how these regions are defined[1].
- Study the family annotations for:
- Note that neither the domain definitions on the sequence nor the descriptions are identical to Pfam.
SMART domain annotation
The SMART database at the EMBL in Heidelberg integrates a number of feature detection tools including Pfam domain annotation and its own, HMM based SMART domain database. You can search by sequence, or by accession number and retrieve domain annotations and more.
SMART search
Task:
- Access the SMART database at http://smart.embl-heidelberg.de/
- Click the link to access SMART in the normal mode.
- Paste the MYSPE Mbp1 UniProtID into the Sequence ID or ACC field.
- Check all the boxes for:
- outlier homologues (also including homologues in the PDB structure database)
- PFAM domains (domains defined by sequence similarity in the PFAM database)
- signal peptides (using the Gunnar von Heijne's SignalP 4.0 server at the Technical University in Lyngby, Denmark)
- internal repeats (using the programs ariadne and prospero at the Wellcome Trust Centre for Human Genetics at Oxford University, England)
- Click on Sequence SMART to run the search and annotation. (In case you get an error like: "Sorry, your entry seems to have no SMART domain ...", try again with the actual sequence instead of the accession number.)
Study the results.
- Study the (well curated!) family annotations for:
- Note down the following information so you can enter the annotation in the protein database for MYSPE:
- From the section on "Confidently predicted domains ..."
- The start and end coordinates of the KilA-N domain (...according to SMART, not Pfam, in case the two differ).
- All start and end coordinates of low complexity segments
- All start and end coordinates of ANK (Ankyrin) domains
- Start and end coordinates of coiled coil domain(s) I expect only one.
- Start and end coordinates of AT hook domain(s) I expect some but not all not all Mbp1 orthologues have one.
- From the section on "Features NOT shown ..."
- All start and end coordinates of low complexity segments for which the Reason is "overlap".
- Any start and end coordinates of overlapping coiled coil segments.
- I expect all other annotations - besides the overlapping KilA-N domain defined by Pfam - to arise from the succession of ankyrin domains that the proteins have, both Pfam_ANK.. domains, as well as internal repeats, OR to be excluded because they did not exceed the significance threshold. However, if there are other features I have not mentioned here, please let me know.
- From the section on "Outlier homologues ..."
- Start and end coordinates of a PDB:1SW6|B annotation (if you have one): this is a region of sequence similarity to a protein for which the 3D structural coordinate are known.
- Of course there should also be annotations to the structure of 1BM8 / 1MB1 and/or 1L3G - all of which are structures of the Mbp1 APSES domain that we have already annotated as an"APSES fold" feature previously. And there will be BLAST annotations to Ankyrin domains. We will not annotate these separately either.
- From the section on "Confidently predicted domains ..."
- Follow the links to the database entries for the information so you know what these domains and features are.
Next we'll enter the features into our database, so we can compare them with the annotations that I have prepared from SMART annotations of Mbp1 orthologues from the ten reference fungi.
Visual comparison of domain annotations in R
The versatile plotting functions of R allow us to compare domain annotations. The distribution of segments that are annotated as "low-complexity, presumably disordered, is particularly interesting: these are functional features that are often not alignable since there is no selective pressure on sequence similarity but they may have arisen from convergent evolution or diverged while maintaining average composition, not specific sequence. Sequence alignment is after all based on amino acid pair scores that have been optimized to detect amino acids that behave similarly in the same context of folded proteins.
In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the reference species in our protein database.
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
BIN-FUNC-Domain_annotation.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
After you have worked through this code, your plot should look similar to this one:
CDART
The CDART database (Conserved Domain Architecture Retrieval Tool) finds proteins that have a similar domain architecture as a query. This has the potential to find homologous and functionally related proteins that are far todissimilar to be detected with sequence similarity searches.
Task:
- Retrieve the MYSPE Refseq ID from your Journal or by issuing the following R commands:
pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$RefSeqID[sel]
- Navigate to CDART.
- Paste your Mbp1 protein ID and click Submit.
- Note that the first page of the (very long!) results list shows proteisn that contain both KilA-N and Ankyrin domains. However a few other domains are found as well (Atrophin, GNVR, SMV_N), and this raises the intriguing possibility that the MBP1_MYSPE protein might contain some or all of these as well, although the sequence similarity may be to low to detect this outright.
- Study the domain annotations for
Further reading, links and resources
Finn et al. (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44:D279-85. (pmid: 26673716) |
[ PubMed ] [ DOI ] In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool. |
Letunic & Bork (2018) 20 years of the SMART protein domain annotation resource. Nucleic Acids Res 46:D493-D496. (pmid: 29040681) |
[ PubMed ] [ DOI ] SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 8 contains manually curated models for more than 1300 protein domains, with approximately 100 new models added since our last update article (1). The underlying protein databases were synchronized with UniProt (2), Ensembl (3) and STRING (4), doubling the total number of annotated domains and other protein features to more than 200 million. In its 20th year, the SMART analysis results pages have been streamlined again and its information sources have been updated. SMART's vector based display engine has been extended to all protein schematics in SMART and rewritten to use the latest web technologies. The internal full text search engine has been redesigned and updated, resulting in greatly increased search speed. |
Geer et al. (2002) CDART: protein homology by domain architecture. Genome Res 12:1619-23. (pmid: 12368255) |
[ PubMed ] [ DOI ] The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins. The algorithm finds protein similarities across significant evolutionary distances using sensitive protein domain profiles rather than by direct sequence similarity. Proteins similar to a query protein are grouped and scored by architecture. Relying on domain profiles allows CDART to be fast, and, because it relies on annotated functional domains, informative. Domain profiles are derived from several collections of domain definitions that include functional annotation. Searches can be further refined by taxonomy and by selecting domains of interest. CDART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi. |
Notes
- ↑ "Disorder" comes from IUPred predictions, "low complexity regions" are predicted by SEG, and "coiled coils" are predicted according to a scale developed by Rob Russell and Rune Linding; their coils server appears defunct but a similar algorithm (due to A. Lupas) is hosted at SIB.
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-11-13
Version:
- 1.0
Version history:
- 1.0 Live version
- 0.1 First stub, import of 2016 material
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.