Databases and services at the EBI

Contents
- EBI Search
- Original Information and Annotation Transfer
Report topic
Questions, comments
References

Expected Preparations:

	[BIN] Databases
	The units listed above are part of this course and contain important preparatory material.

Keywords: The EBI databases and services; UniProt

Objectives:

This unit will …

… introduce EBI Search and the UniProt Knowledgebase as a linking hub to EBI databases and services;
… demonstrate how to navigate from a generic search to a specific record in UniProt and what information is linked from there;
… explore the contents of some associated databases.

Outcomes:

After working through this unit you …

… can find the UniProt ID record for the Mbp1 homologue you found in MYSPE;
… are familar with the EBI databases and other information items that are linked from the UniProt page;
… can confidently retrieve key information about a protein.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation:

Material based on this learning unit can be submitted for formative feedback. To submit:

Create a new document in your shared Google drive folder.
Call your document BIN-EBI-<your name>-2022
Write a short report on the topic defined below.
Include a (CC) license at the end of your document, as instructed at the beginning of the course.
When you are done with everything, go to the Assignments page on Quercus and open the first Feedback Unit that you have not submitted yet. Paste the URL of your report document into the form, and click on Submit Assignment. Your link can be submitted only once and not edited. Also: do not edit your document after it has been submitted.

The EBI hosts some of the world’s most important bioinformatics databases and services. This learning unit explores them in the context of our search for information on yeast Mbp1 and its homologue in MYSPE.

The EBI (European Bioinformatics Institute) is one of the two largest, international providers of data for genomics and molecular biology (the NCBI is the other). It organizes a cutting-edge program of data management at the largest scale with a special focus on data integration and services, it makes data, services, and educational resources freely and openly available over the Internet, and it runs significant in-house research projects.

In this unit we explore some of the offerings of the EBI that can contribute to our objective of studying a particular gene in an organism of interest.

Task…

Read the introductory article on EBI databases and services:

Cook, Charles E et al.. (2020). “The European Bioinformatics Institute in 2020: building a global infrastructure of interconnected data resources for the life sciences”. Nucleic Acids Research 48(D1):D17–D23 .
[PMID: 31701143] [DOI: 10.1093/nar/gkz1033]

Abstract …

Data resources at the European Bioinformatics Institute (EMBL-EBI, https://www.ebi.ac.uk/) archive, organize and provide added-value analysis of research data produced around the world. This year’s update for EMBL-EBI focuses on data exchanges among resources, both within the institute and with a wider global infrastructure. Within EMBL-EBI, data resources exchange data through a rich network of data flows mediated by automated systems. This network ensures that users are served with as much information as possible from any search and any starting point within EMBL-EBI’s websites. EMBL-EBI data resources also exchange data with hundreds of other data resources worldwide and collectively are a key component of a global infrastructure of interconnected life sciences data resources. We also describe the BioImage Archive, a deposition database for raw images derived from primary research that will supply data for future knowledgebases that will add value through curation of primary image data. We also report a new release of the PRIDE database with an improved technical infrastructure, a new API, a new webpage, and improved data exchange with UniProt and Expression Atlas. Training is a core mission of EMBL-EBI and in 2018 our training team served more users, both in-person and through web-based programmes, than ever before.

One of the most important EBI databases is UniProt, a curated, highly cross-refernced database of protein knowledge.

EBI Search

Task…

Remember to document your activities as lab-notes on your Wiki.

Access the EBI website at https://www.ebi.ac.uk/¹
In the search bar, enter mbp1 and hit Return.
The resulting page has entries for a wide variety of information items, with just enough information for each to allow you to evaluate whether it is relevant to your interests.
Access the help page for EBI search from the link at the top of the page, and familiarize yourself with its capabilities and what query types are available.
Back on the search-results page, look for the P39678 (MBP1_YEAST) entry and click on the link. This is the UniProtKB entry for the yeast Mbp1 protein - the main annotation hub that links to the EBI’s other data resources.
Explore the page and explore the following information items (record your explorations in your journal):
1. Status: Click on “Reviewed” and note that this protein is a Swiss-Prot entry. Swiss-Prot is a hand-curated database of proteins, annotated at the highest level. An alternative value could be “Trembl” (Translated EMBL Database), automatic translations from genome sequencing projects or other incidental submissions. Then follow the link from “Experimental evidence”. Note what evidence types for the existence of a protein are available and what they mean.
2. Function: follow the i information link from the word “Function” and review the types of information that are available in this section. Follow the link to the complete GO annotation and look at the evidence codes for these terms. GO has its own learning unit, but do note the meaning of the terms you see here, and the proportion of term annotations that are contributed from experimental evidence (IDA, IPI, IMP), those that are contributed from computational analysis (IBA, IEA), and the relative proportion of manual versus automatic annotation. This will give you a somewhat more fine-grained idea how such annotations are generated in the first place and perhaps how reliable they are.
3. Names & Taxonomy: follow the i information link from the section name and review the types of information that are available in this section. Note that the taxonomic identifier and lineage are available directly from this page. You will also find a link to SGD - not all organisms have community curated databases available, but if such a model organism database exists it is often the best available resource to support research on the organism’s genes. Examples include SGD (yeast), FlyBase (drosophila), WormBase (caenorhabditis), Fugu, MGI (Mouse), TAIR (Arabidopsis) etc. Note that humans are not considered “model organisms”.
4. Subcellular location: follow the i information link from the section name and review the types of information that are available in this section.
5. PTM / Processing (Post Translational Modification): follow the i information link from the section name and review the types of information that are available in this section. Follow the link into iPTMnet, note the detail of annotation that is available, and especially, note that there are links to the literature(!) for every single annotation. Carefully making the evidence chain for each annotation explicit is the hallmark of a high-quality database!
6. Interaction: follow the i information link from the section name and review the types of information that are available in this section. Interaction databases have their own learning unit, but note that the number of recorded interactions in BioGrid and IntAct is very different. Follow the link to IntAct, click the check-box for the first dozen or so genes, and click on the mRNA Expression data among the list of “Actions for selection”. What does the result page display? Is this a Cargo Cult page? Why or why not? Discuss on the board.
7. Structure: follow the i information link from the section name and review the types of information that are available in this section. Note the detailed annotation of secondary structure. Also note that there are three structures available, and key information such as resolution and coverage are listed in the table, so you can immediately decide which of the structures is most useful.
8. Family & Domains: follow the i information link from the section name and review the types of information that are available in this section. In particular, note that there is one APSES domain annotated (this is a domain that is a bit larger than the KilA-N domain that is annotated in other domain databases for the Mbp1 protein), and there are two Ankyrin domains. Follow the link to view the protein in InterPro, the EBI’s protein family database and look at the annotations. One would expect that the domain structure is conserved between orthologues, such as MBP1_YEAST and MBP1_MYSPE.
9. Sequence: follow the i information link from the section name and review the types of information that are available in this section. Note that cross-references into RefSeq are supplied for both the protein ( NP_010227) and the mRNA (NM_001180115).
10. Similar proteins: follow the i information link from the section name and review the types of information that are available in this section. UniProt invests significant resources into defining clusters of similar sequences and these are very useful tools to find variants and homologues. Follow the link to the 90% identity cluster (click on the ClusterID), and check the boxes for all full-length sequences. Then click on the align option. On the resulting alignment page, check the “similarity” box to highlight sequence variants. All of these are yeast strains, but the sequence is not identical for some of them.
Follow the i information links for the remaining sections name and review the types of information that are available.
Assume we want to verify whether the N-terminal start codon is correctly annotated. Where is the link to the DNA sequence in its genomic context? What is the location of the annotated transcript? What is the sequence of the 30 upstream nucleotides? How can you download it as a FASTA formatted text file? Can’t find it? Ask on the mailing list!

As you see, this is a very well engineered page that makes a rich set of annotations available. It links to EBI internal databases wherever possible, but it generously provides links to many other databases as well. In general, the provided links are realized through URLs that make it easy to script access to their tragets. This is in contrast to the NCBI, whose cross-references are less open.

Original Information and Annotation Transfer

Task…

In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE and you have recorded its RefSeq ID. What is its UniProt ID? What information is available via UniProt?

Access the UniProt ID mapping service to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose RefSeq Protein as the From: option and UniProtKB as the To: option.

If the mapping works, the UniProt ID will be in the Entry: column of the table that is being returned. Record the ID, and click on it to navigate to the UniProt entry page.

What could possibly go wrong? …

Sometimes the mapping does not work and does not return a result. Most likely, UniProt contains the sequence, but for some reason, the mapping service does not know. If this happens, you can work around the problem as follows.

Load the RefSeq protein page
View the protein as FASTA and copy the sequence.
Open the UniProt BLAST page http://www.uniprot.org/blast/

(Yes, UniProt runs its own BLAST version, and that searches UniProt databases, not Genbank)
Paste the sequence into the search form and run BLAST.

… if the sequence is in UniProt, you will get the top hit with 100% sequence identity. If you still can’t find a UniProt ID for your sequence, contact me.

Navigate to the UniProt page for MBP1_MYSPE. Explore the links that go out from the page and contrast this with what you have found for MBP1_YEAST. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with. The goal is to develop a sense for where a page like this one collects original information, and where it merely acts as a record of annotation transfer.

If you find this URL hard to remember, consider the acronyms: ebi.ac.uk – European Bioinformatics Institute / ACademic domains / United Kingdom↩︎

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

Databases and services at the EBI

Boris Steipe

Contents

EBI Search

Original Information and Annotation Transfer

Report topic

Questions, comments

References