Expected Preparations:
|
|||||||||||
|
|||||||||||
Keywords: Domain discovery by multiple sequence alignment; HMMER algorithm; Domain databases: InterPro, SMART, CDART; Annotation of sequences | |||||||||||
|
|||||||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||||||
|
|||||||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||||||
|
|||||||||||
Evaluation: NA: This unit is not evaluated for course marks. |
This unit introduces the observation that evolution composes higher-order functions from domains that are folding units, functional units, and units of inheritance. It then covers some of the databases and services that support discovery and analysis of domains, and guides through an exercise in domain annotation.
Task…
The InterPro protein families and domain database is a large, curated collection of domain definitions and domain annotations hoisted by the EBI. Here, we use a resource that was inherited from Pfam, a resource that pioneered the discovery of prtotein domains from multiple sequence alignments, and their representation as Hidden Markov Models (a technique that can define a probability that a given sequence is part of the family that the model was trained on).
Task…
pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$UniProtID[sel]
.data/
folder, and name it
Mbp1MYSPE_Pfam.JSON
.
The SMART database at the EMBL in Heidelberg integrates a number of feature detection tools including Pfam / InterPro domain annotation and its own, HMM based SMART domain database. You can search by sequence, or by accession number and retrieve domain annotations and more.
Task…
Access the SMART database at http://smart.embl-heidelberg.de/
Click the link to access SMART in the normal mode.
Paste the MYSPE Mbp1 UniProtID into the Sequence ID or ACC field.
Check all the boxes for:
Click on Sequence SMART to run the search and annotation. (In case you get an error like: “Sorry, your entry seems to have no SMART domain …”, try again with the actual sequence instead of the accession number.)
Study the family annotations for:
Note that neither the domain definitions on the sequence nor the descriptions are identical to the InterPro annotations.
Note down the following information so you can enter the annotation in the protein database for MYSPE:
Follow the links to the database entries for the information so you know what these domains and features are.
Next we’ll enter the features into our database, so we can compare them with the annotations that I have prepared from SMART annotations of Mbp1 orthologues from the ten reference fungi.
The versatile plotting functions of R allow us to compare domain annotations. The distribution of segments that are annotated as “low-complexity, presumably disordered, is particularly interesting: these are functional features that are often not alignable since there is no selective pressure on sequence similarity but they may have arisen from convergent evolution or diverged while maintaining average composition, not specific sequence. Sequence alignment is after all based on amino acid pair scores that have been optimized to detect amino acids that behave similarly in the same context of folded proteins.
In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the reference species in our protein database.
Task…
ABC-units
R project. If you
have loaded it before, choose File ▹ Recent
projects ▹ ABC-Units. If you have not loaded
it before, follow the instructions in the RPR-Introduction
unit.init()
if requested.BIN-FUNC-Domain_annotation.R
and follow
the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
After you have worked through this code, your plot should look similar to this one:
SMART domain annotations for Mbp1 proteins for the
ten reference fungi. Plot produced by code discussed in
BIN-FUNC-Domain_annotation.R
from annotation data stored in
myDB
.
The CDART database (Conserved Domain Architecture Retrieval Tool) finds proteins that have a similar domain architecture as a query. This has the potential to find homologous and functionally related proteins that are far todissimilar to be detected with sequence similarity searches.
Task…
pName <- sprintf("MBP1_%s", biCode(MYSPE))
sel <- which(myDB$protein$name == pName)
myDB$protein$RefSeqID[sel]
El-Gebali,
Sara et al.. (2019). “The Pfam protein families database
in 2019”. Nucleic Acids Research
47(D1):D427–D432 .
[PMID:
30357350]
[DOI: 10.1093/nar/gky995]
Letunic,
Ivica and Peer Bork. (2018). “20 years of the SMART protein
domain annotation resource”. Nucleic Acids Research
46(D1):D493–D496 .
[PMID:
29040681]
[DOI: 10.1093/nar/gkx922]
Geer, Lewis
Y et al.. (2002). “CDART: protein homology by domain
architecture”. Genome Research 12(10):1619–23
.
[PMID: 12368255]
[DOI: 10.1101/gr.278202]
Yang,
Mingzhang et al.. (2020). “NCBI’s Conserved Domain
Database and Tools for Protein Domain Analysis”. Current Protocols
in Bioinformatics 69(1):e90 .
[PMID: 31851420]
[DOI: 10.1002/cpbi.90]
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
“Disorder” comes from IUPred predictions, “low complexity regions” are predicted by SEG, and “coiled coils” are predicted according to a scale developed by Rob Russell and Rune Linding; their coils server appears defunct but a similar algorithm (due to A. Lupas) is hosted at SIB.↩︎