The units listed above are part of this
course and contain important preparatory material.
Keywords: The EBI databases and services; UniProt
Objectives:
This unit will …
… introduce EBI Search and the UniProt Knowledgebase as a linking
hub to EBI databases and services;
… demonstrate how to navigate from a generic search to a specific
record in UniProt and what information is linked from there;
… explore the contents of some associated databases.
Outcomes:
After working through this unit you …
… can find the UniProt ID record for the Mbp1 homologue you found
in MYSPE;
… are familar with the EBI databases and other information items
that are linked from the UniProt page;
… can confidently retrieve key information about a
protein.
Deliverables:
Time management: Before you begin,
estimate how long it will take you to complete this unit. Then, record
in your course journal: the number of hours you estimated, the number of
hours you worked on the unit, and the amount of time that passed between
start and completion of this unit.
Journal: Document your progress in
your Course
Journal. Some tasks may ask you to include specific items in your
journal. Don’t overlook these.
Insights: If you find something
particularly noteworthy about this unit, make a note in your insights!
page.
Evaluation:
Material based on this learning unit can be submitted for
formative feedback. To submit:
Create a new document in your shared Google drive folder.
Include a (CC) license at the end of your document, as instructed at
the beginning of the course.
When you are done with everything, go to the
Assignments page on Quercus and open the first
Feedback Unit that you have not submitted yet. Paste the URL of
your report document into the form, and click on Submit
Assignment. Your link can be submitted only once and not
edited. Also: do not edit your document after it has been
submitted.
Contents
The EBI hosts some of the world’s most
important bioinformatics databases and services. This learning unit
explores them in the context of our search for information on yeast Mbp1
and its homologue in MYSPE.
The EBI (European
Bioinformatics Institute) is one of the two largest, international
providers of data for genomics and molecular biology (the NCBI is the
other). It organizes a cutting-edge program of data management at the
largest scale with a special focus on data integration and services, it
makes data, services, and educational resources freely and openly
available over the Internet, and it runs significant in-house research
projects.
In this unit we explore some of the offerings of the EBI that can
contribute to our objective of studying a particular gene in an organism
of interest.
Task…
Read the introductory article on EBI databases and services:
Cook, Charles
Eet al.. (2020). “The European Bioinformatics Institute
in 2020: building a global infrastructure of interconnected data
resources for the life sciences”. Nucleic Acids Research48(D1):D17–D23 . [PMID:
31701143][DOI: 10.1093/nar/gkz1033]
Data resources at the European Bioinformatics Institute (EMBL-EBI, https://www.ebi.ac.uk/)
archive, organize and provide added-value analysis of research data
produced around the world. This year’s update for EMBL-EBI focuses on
data exchanges among resources, both within the institute and with a
wider global infrastructure. Within EMBL-EBI, data resources exchange
data through a rich network of data flows mediated by automated systems.
This network ensures that users are served with as much information as
possible from any search and any starting point within EMBL-EBI’s
websites. EMBL-EBI data resources also exchange data with hundreds of
other data resources worldwide and collectively are a key component of a
global infrastructure of interconnected life sciences data resources. We
also describe the BioImage Archive, a deposition database for raw images
derived from primary research that will supply data for future
knowledgebases that will add value through curation of primary image
data. We also report a new release of the PRIDE database with an
improved technical infrastructure, a new API, a new webpage, and
improved data exchange with UniProt and Expression Atlas. Training is a
core mission of EMBL-EBI and in 2018 our training team served more
users, both in-person and through web-based programmes, than ever
before.
One of the most important EBI databases is UniProt, a curated, highly
cross-refernced database of protein knowledge.
Task…
Read the overview article on UniProt updates:
UniProt
Consortium. (2019). “UniProt: a worldwide hub of protein
knowledge”. Nucleic Acids Research47(D1):D506–D515 . [PMID:
30395287][DOI: 10.1093/nar/gky1049]
The UniProt Knowledgebase is a collection of sequences and annotations
for over 120 million proteins across all branches of life. Detailed
annotations extracted from the literature by expert curators have been
collected for over half a million of these proteins. These annotations
are supplemented by annotations provided by rule based automated
systems, and those imported from other resources. In this article we
describe significant updates that we have made over the last 2 years to
the resource. We have greatly expanded the number of Reference Proteomes
that we provide and in particular we have focussed on improving the
number of viral Reference Proteomes. The UniProt website has been
augmented with new data visualizations for the subcellular localization
of proteins as well as their structure and interactions. UniProt
resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
EBI Search
Task…
Remember to document your activities as
lab-notes on your Wiki.
The resulting page has entries for a wide variety of information
items, with just enough information for each to allow you to evaluate
whether it is relevant to your interests.
Access the help page for
EBI search from the link at the top of the page,
and familiarize yourself with its capabilities and what query types are
available.
Back on the search-results page, look for the P39678
(MBP1_YEAST) entry and click on the link. This is the
UniProtKB entry for the yeast Mbp1 protein - the main
annotation hub that links to the EBI’s other data resources.
Explore the page and explore the following information items (record
your explorations in your journal):
Status: Click on “Reviewed” and note that this
protein is a Swiss-Prot entry. Swiss-Prot is a hand-curated database of
proteins, annotated at the highest level. An alternative value could be
“Trembl” (Translated EMBL Database), automatic translations from genome
sequencing projects or other incidental submissions. Then follow the
link from “Experimental evidence”. Note what evidence types for the
existence of a protein are available and what they mean.
Function: follow the i information
link from the word “Function” and review the types of information that
are available in this section. Follow the link to the complete
GO annotation and look at the evidence codes for these terms. GO has
its own learning unit, but do note the meaning of the terms you see
here, and the proportion of term annotations that are contributed from
experimental evidence (IDA, IPI, IMP), those that are contributed from
computational analysis (IBA, IEA), and the relative proportion of manual
versus automatic annotation. This will give you a somewhat more
fine-grained idea how such annotations are generated in the first place
and perhaps how reliable they are.
Names & Taxonomy: follow the i
information link from the section name and review the types of
information that are available in this section. Note that the taxonomic
identifier and lineage are available directly from this page. You will
also find a link to SGD - not all organisms have community curated
databases available, but if such a model organism database exists it is
often the best available resource to support research on the organism’s
genes. Examples include SGD
(yeast), FlyBase (drosophila),
WormBase (caenorhabditis), Fugu, MGI (Mouse), TAIR (Arabidopsis) etc. Note that
humans are not considered “model organisms”.
Subcellular location: follow the i
information link from the section name and review the types of
information that are available in this section.
PTM / Processing (Post Translational Modification):
follow the i information link from the section name and
review the types of information that are available in this section.
Follow the link into iPTMnet,
note the detail of annotation that is available, and especially, note
that there are links to the literature(!) for every single annotation.
Carefully making the evidence chain for each annotation explicit is the
hallmark of a high-quality database!
Interaction: follow the i
information link from the section name and review the types of
information that are available in this section. Interaction databases
have their own learning unit, but note that the number of recorded
interactions in BioGrid and IntAct is very different. Follow the link to
IntAct,
click the check-box for the first dozen or so genes, and click on the
mRNA Expression data among the list of “Actions for
selection”. What does the result page display? Is this a Cargo Cult
page? Why or why not? Discuss on the board.
Structure: follow the i
information link from the section name and review the types of
information that are available in this section. Note the detailed
annotation of secondary structure. Also note that there are three
structures available, and key information such as resolution and
coverage are listed in the table, so you can immediately decide which of
the structures is most useful.
Family & Domains: follow the i
information link from the section name and review the types of
information that are available in this section. In particular, note that
there is one APSES domain annotated (this is a domain that is a bit
larger than the KilA-N domain that is annotated in other domain
databases for the Mbp1 protein), and there are two Ankyrin domains.
Follow the link to view the protein in
InterPro, the EBI’s protein family database and look at the
annotations. One would expect that the domain structure is conserved
between orthologues, such as MBP1_YEAST and
MBP1_MYSPE.
Sequence: follow the i information
link from the section name and review the types of information that are
available in this section. Note that cross-references into RefSeq are
supplied for both the protein ( NP_010227) and the mRNA
(NM_001180115).
Similar proteins: follow the i
information link from the section name and review the types of
information that are available in this section. UniProt invests
significant resources into defining clusters of similar sequences and
these are very useful tools to find variants and homologues. Follow the
link to the 90%
identity cluster (click on the ClusterID), and check the
boxes for all full-length sequences. Then click on the align option. On
the resulting alignment page, check the “similarity” box to highlight
sequence variants. All of these are yeast strains, but the sequence is
not identical for some of them.
Follow the i information links for the remaining
sections name and review the types of information that are
available.
Assume we want to verify whether the N-terminal start codon is
correctly annotated. Where is the link to the DNA sequence in its
genomic context? What is the location of the annotated transcript? What
is the sequence of the 30 upstream nucleotides? How can you download it
as a FASTA formatted text file? Can’t find it? Ask on the mailing
list!
As you see, this is a very well engineered page that makes a rich set
of annotations available. It links to EBI internal databases wherever
possible, but it generously provides links to many other databases as
well. In general, the provided links are realized through URLs that make
it easy to script access to their tragets. This is in contrast to the
NCBI, whose cross-references are less open.
Original Information and Annotation Transfer
Task…
In the BIN-Storing_data
unit you have found the protein of MYSPE that is most similar to yeast
Mbp1, in MYSPE and you have recorded its RefSeq ID. What is its UniProt
ID? What information is available via UniProt?
Access the UniProt ID
mapping service to retrieve the UniProt ID for the protein. Paste
the RefSeq ID and choose RefSeq Protein as the
From: option and UniProtKB as the
To: option.
If the mapping works, the UniProt ID will be in the
Entry: column of the table that is being returned.
Record the ID, and click on it to navigate to the UniProt entry
page.
Sometimes the mapping does not work and does not return a result.
Most likely, UniProt contains the sequence, but for some reason, the
mapping service does not know. If this happens, you can work around the
problem as follows.
(Yes, UniProt runs its own BLAST version, and that searches UniProt
databases, not Genbank)
Paste the sequence into the search form and run BLAST.
… if the sequence is in UniProt, you will get the top hit with 100%
sequence identity. If you still can’t find a UniProt ID for your
sequence, contact me.
Navigate to the UniProt page for MBP1_MYSPE. Explore the links that
go out from the page and contrast this with what you have found for
MBP1_YEAST. Assess which resources are independently useful, and which
resources merely recapitulate information that relates to yeast Mbp1,
the protein that you originally searched with. The goal is to develop a
sense for where a page like this one collects original information, and
where it merely acts as a record of annotation transfer.
Report topic
Task…
The goal of this short report is to develop a sense for how
bioinformatics resources support questions of biological or medical
interest. In the BIN-Storing_data
unit you have found that protein of MYSPE that is most similar to yeast
Mbp1. Navigate to the UniProt page for this protein. Explore the links
that go out from the page to other databases and resources.
From these resources, choose one that appears to contain particularly
useful information, and describe a plausible scenario how it would be
used to answer a research question in a laboratory. What is the
available data? What types of questions can it help answer? How would
you interpret the annotations it supports?
Then submit your report as a formative feedback
assignment.
Questions, comments
If in doubt, ask! If anything about this contents is
not clear to you, do not proceed but ask for clarification. If you have
ideas about how to make this material better, let’s hear them. We are
aiming to compile a list of FAQs for all learning units, and your
contributions will count towards your participation marks.
Improve this page! If you have questions or
comments, please post them on the Quercus Discussion board with a
subject line that includes the name of the unit.
References
[END]
If you find this URL hard to remember, consider the
acronyms: ebi.ac.uk – European
Bioinformatics Institute /
ACademic domains / United
Kingdom↩︎