Domain Annotation

Contents
InterPro
SMART domain annotation
- SMART search
- Visual comparison of domain annotations in R
CDART
Further Reading
Questions, comments
References

Expected Preparations:

	[RPR] Scripting_data_downloads		[FND] Homology
	The units listed above are part of this course and contain important preparatory material.

Keywords: Domain discovery by multiple sequence alignment; HMMER algorithm; Domain databases: InterPro, SMART, CDART; Annotation of sequences

Objectives:

This unit will …

… introduce the concept of domains in proteins and discuss the use of domains inn sequence analysis;
… demonstrate key databases and services that are available for domain analysis on the Web;
… go through an exercise in domain annotation using R.

Outcomes:

After working through this unit you …

… are familar with key databases and services for domain annotation;
… can store domain annotations as features in your protein database;
… are able to use R’s plot function for the production of generic data driven graphics.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation:

NA: This unit is not evaluated for course marks.

This unit introduces the observation that evolution composes higher-order functions from domains that are folding units, functional units, and units of inheritance. It then covers some of the databases and services that support discovery and analysis of domains, and guides through an exercise in domain annotation.

Task…

Read the introductory notes on how domain annotations support the annotation of gene functionPDF.

InterPro

The InterPro protein families and domain database is a large, curated collection of domain definitions and domain annotations hoisted by the EBI. Here, we use a resource that was inherited from Pfam, a resource that pioneered the discovery of prtotein domains from multiple sequence alignments, and their representation as Hidden Markov Models (a technique that can define a probability that a given sequence is part of the family that the model was trained on).

Task…

Retrieve the MYSPE UniProt ID from your Journal, or by issuing the following R commands:

pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$UniProtID[sel]

Navigate to the Interpro database. We will retrieve domain annotations for your Mbp1 orthologue. You can search the database by FASTA sequence, ID, or keyword.
Open the Search by text tab.
Enter the UniProt ID into the search field and click Search.
Study the annotations. Export the annotations as a JSON file, download the file into your .data/ folder, and name it Mbp1MYSPE_Pfam.JSON.
Are all expected domains present? (APSES or KilA-N domain, disordered segments, Ankyrin domains, coiled coil, … others ?)
“disorder”, “low complexity”, and “coiled coil” annotations are not based on alignments, but on sequence analysis algorithms. Visit Finn et al. (2014) and read how these regions are defined¹.
Study the (well curated!) family annotations for:
- the KilA-N domain and the domains linked from that page;
- the Ankyrin repeat

SMART domain annotation

The SMART database at the EMBL in Heidelberg integrates a number of feature detection tools including Pfam / InterPro domain annotation and its own, HMM based SMART domain database. You can search by sequence, or by accession number and retrieve domain annotations and more.

SMART search

Task…

Access the SMART database at http://smart.embl-heidelberg.de/
Click the link to access SMART in the normal mode.
Paste the MYSPE Mbp1 UniProtID into the Sequence ID or ACC field.
Check all the boxes for:
- outlier homologues (also including homologues in the PDB structure database)
- Pfam domains (domains defined by sequence similarity in the Pfam database)
- signal peptides (using the Gunnar von Heijne’s SignalP 4.0 server at the Technical University in Lyngby, Denmark)
- internal repeats (using the programs ariadne and prospero at the Wellcome Trust Centre for Human Genetics at Oxford University, England)
Click on Sequence SMART to run the search and annotation. (In case you get an error like: “Sorry, your entry seems to have no SMART domain …”, try again with the actual sequence instead of the accession number.)
Study the family annotations for:
- the KilA-N domain
- the Ankyrin repeat domain(s)
Note that neither the domain definitions on the sequence nor the descriptions are identical to the InterPro annotations.
Note down the following information so you can enter the annotation in the protein database for MYSPE:
- From the section on “Confidently predicted domains …”
  - The start and end coordinates of the KilA-N domain (…according to SMART, not Pfam, in case the two differ).
  - All start and end coordinates of low complexity segments
  - All start and end coordinates of ANK (Ankyrin) domains
  - Start and end coordinates of coiled coil domain(s) I expect only one.
  - Start and end coordinates of AT hook domain(s) I expect some but not all not all Mbp1 orthologues have one.
- From the section on “Features NOT shown …”
  - All start and end coordinates of low complexity segments for which the Reason is “overlap”.
  - Any start and end coordinates of overlapping coiled coil segments.
  - I expect all other annotations - besides the overlapping KilA-N domain defined by Pfam - to arise from the succession of ankyrin domains that the proteins have, both Pfam_ANK.. domains, as well as internal repeats, OR to be excluded because they did not exceed the significance threshold. However, if there are other features I have not mentioned here, please let me know.
- From the section on “Outlier homologues …”
  - Start and end coordinates of a PDB:1SW6|B annotation (if you have one): this is a region of sequence similarity to a protein for which the 3D structural coordinate are known.
  - Of course there should also be annotations to the structure of 1BM8 / 1MB1 and/or 1L3G - all of which are structures of the Mbp1 APSES domain that we have already annotated as an”APSES fold” feature previously. And there will be BLAST annotations to Ankyrin domains. We will not annotate these separately either.
Follow the links to the database entries for the information so you know what these domains and features are.

Next we’ll enter the features into our database, so we can compare them with the annotations that I have prepared from SMART annotations of Mbp1 orthologues from the ten reference fungi.

Visual comparison of domain annotations in R

The versatile plotting functions of R allow us to compare domain annotations. The distribution of segments that are annotated as “low-complexity, presumably disordered, is particularly interesting: these are functional features that are often not alignable since there is no selective pressure on sequence similarity but they may have arisen from convergent evolution or diverged while maintaining average composition, not specific sequence. Sequence alignment is after all based on amino acid pair scores that have been optimized to detect amino acids that behave similarly in the same context of folded proteins.

In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the reference species in our protein database.

Task…

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File ▹ Recent projects ▹ ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools ▹ Version Control ▹ Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included. This ensures that your data and code remain up to date when we update, or fix bugs.
Type init() if requested.
Open the file BIN-FUNC-Domain_annotation.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

After you have worked through this code, your plot should look similar to this one:

SMART domain annotations for Mbp1 proteins for the ten reference fungi. Plot produced by code discussed in BIN-FUNC-Domain_annotation.R from annotation data stored in myDB.

CDART

The CDART database (Conserved Domain Architecture Retrieval Tool) finds proteins that have a similar domain architecture as a query. This has the potential to find homologous and functionally related proteins that are far todissimilar to be detected with sequence similarity searches.

Task…

Retrieve the MYSPE Refseq ID from your Journal or by issuing the following R commands:

pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$RefSeqID[sel]

Navigate to CDART.
Paste your Mbp1 protein ID and click Submit.
Note that the first page of the (very long! More than 2,000 pages.) results list shows proteins that contain both KilA-N and Ankyrin domains. However a few other domains are found as well (Atrophin, GNVR, SMV_N), and this raises the intriguing possibility that the MBP1_MYSPE protein might contain some or all of these as well, although the sequence similarity may be too low to detect this outright.
Study the domain annotations for
- the KilA-N domain
- the Ankyrin repeat domain(s)

The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.

Letunic, Ivica and Peer Bork. (2018). “20 years of the SMART protein domain annotation resource”. Nucleic Acids Research 46(D1):D493–D496 .
[PMID: 29040681] [DOI: 10.1093/nar/gkx922]

Abstract …

SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 8 contains manually curated models for more than 1300 protein domains, with approximately 100 new models added since our last update article (1). The underlying protein databases were synchronized with UniProt (2), Ensembl (3) and STRING (4), doubling the total number of annotated domains and other protein features to more than 200 million. In its 20th year, the SMART analysis results pages have been streamlined again and its information sources have been updated. SMART’s vector based display engine has been extended to all protein schematics in SMART and rewritten to use the latest web technologies. The internal full text search engine has been redesigned and updated, resulting in greatly increased search speed.

Geer, Lewis Y et al.. (2002). “CDART: protein homology by domain architecture”. Genome Research 12(10):1619–23 .
[PMID: 12368255] [DOI: 10.1101/gr.278202]

Abstract …

The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins. The algorithm finds protein similarities across significant evolutionary distances using sensitive protein domain profiles rather than by direct sequence similarity. Proteins similar to a query protein are grouped and scored by architecture. Relying on domain profiles allows CDART to be fast, and, because it relies on annotated functional domains, informative. Domain profiles are derived from several collections of domain definitions that include functional annotation. Searches can be further refined by taxonomy and by selecting domains of interest. CDART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi.

Yang, Mingzhang et al.. (2020). “NCBI’s Conserved Domain Database and Tools for Protein Domain Analysis”. Current Protocols in Bioinformatics 69(1):e90 .
[PMID: 31851420] [DOI: 10.1002/cpbi.90]

Abstract …

The Conserved Domain Database (CDD) is a freely available resource for the annotation of sequences with the locations of conserved protein domain footprints, as well as functional sites and motifs inferred from these footprints. It includes protein domain and protein family models curated in house by CDD staff, as well as imported from a variety of other sources. The latest CDD release (v3.17, April 2019) contains more than 57,000 domain models, of which almost 15,000 were curated by CDD staff. The CDD curation effort increases coverage and provides finer-grained classifications of common and widely distributed protein domain families, for which a wealth of functional and structural data have become available. The CDD maintains both live search capabilities and an archive of pre-computed domain annotations for a selected subset of sequences tracked by the NCBI’s Entrez protein database. These can be retrieved or computed for a single sequence using CD-Search or in bulk using Batch CD-Search, or computed via standalone RPS-BLAST plus the rpsbproc software package. The CDD can be accessed via https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The three protocols listed here describe how to perform a CD-Search (Basic Protocol 1), a Batch CD-Search (Basic Protocol 2), and a Standalone RPS-BLAST and rpsbproc (Basic Protocol 3). © 2019 The Authors. Basic Protocol 1: CD-search Basic Protocol 2: Batch CD-search Basic Protocol 3: Standalone RPS-BLAST and rpsbproc.

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

About this page …

[END]

“Disorder” comes from IUPred predictions, “low complexity regions” are predicted by SEG, and “coiled coils” are predicted according to a scale developed by Rob Russell and Rune Linding; their coils server appears defunct but a similar algorithm (due to A. Lupas) is hosted at SIB.↩︎

Domain Annotation

Boris Steipe

Contents

InterPro

SMART domain annotation

SMART search

Visual comparison of domain annotations in R

CDART

Further Reading

Questions, comments

References