BIN-EBI

From "A B C"
Revision as of 11:07, 23 September 2020 by Boris (talk | contribs)
Jump to navigation Jump to search

Databases and services at the EBI

(The EBI databases and services, Uniprot)


 


Abstract:

The EBI hosts some of the world's most important bioinformatics databases and services. This learning unit explores them in the context of our search for information on yeast Mbp1 and its homologue in MYSPE.


Objectives:
This unit will ...

  • ... introduce EBI Search and the UniProt Knowledgebase as a linking hub to EBI databases and services;
  • ... demonstrate how to navigate from a generic search to a specific record in UniProt and what information is linked from there;
  • ... explore the contents of some associated databases.

Outcomes:
After working through this unit you ...

  • ... can find the UniProt ID record for the Mbp1 homologue you found in MYSPE;
  • ... are familar with the EBI databases and other information items that are linked from the UniProt page;
  • ... can confidently retrieve key information about a protein.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    This learning unit can be evaluated for a maximum of 5 marks. There are several options for submission. Choose one option, then ...

    1. Create a new page on the student Wiki as a subpage of your User Page.
    2. Put all of your writing to submit on this one page.
    3. When you are done with everything, go to the Qercus Assignments page and open the first Learning Unit that you have not submitted yet. Paste the URL of your Wiki page into the form, and click on Submit Assignment.

    Do not change your Wiki page after you have submitted your assignment, until it has been graded.

    You can only submit either this unit or the NCBI Learning unit for marking, not both.
    Short Report option
    In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE. Navigate to the UniProt page for this protein. Explore the links that go out from the page.
    1. Create a new page on the student Wiki as a subpage of your User Page.
    2. For three linked resources, describe a plausible scenario how each would be used to answer a research question in a laboratory. The goal of this short report is to develop a sense for how bioinformatics resources support questions of biological or medical interest. Refer to the "General" section of the marking rubrics for aspects of the report that will be evaluated.
    3. When you are done, submit the link to your page via Quercus as described above.


    Option to write a "Self-Evaluation Question"
    You can submit a "Self-Evaluation Question" for at most one of your assignments.
    Write a "Self-evaluation Question" (and a model solution) that explores a significant, non-trivial aspect of studying how to work with EBI resources within this learning unit. Ensure that the question is feasible, given the existing content of the unit - or coordinate an extension of the contents with your instructor. Ensure your question pursues a high-level learning goal, it should allow others to demonstrate understanding, critical analysis, and/or the capacity to integrate and synthesize knowledge, not merely test memorization. Ensure that your question is specific, not ambiguous, vague or tangential to the contents. Ensure you are testing valuable knowledge and skills, not Cargo Cult. Apply the marking rubrics in spirit to satisfy yourself of the quality of your contribution. Obviously, details of evaluation will vary with the question. Use the format and code templates that you find on the Self evaluation questions page - but don't assume those examples are already models of excellent contributions. Note: assume that approximately the same amount of work is expected for all evaluation options. Consequently, the standard to achieve an excellent mark for this option will be high.
    1. Create a new page on the student Wiki as a subpage of your User Page. Develop your question there.
    2. When you are done, submit the link to your page via Quercus as described above.

    Contents

    The EBI (European Bioinformatics Institute) is one of the two largest, international providers of data for genomics and molecular biology (the NCBI is the other). It organizes a cutting-edge program of data management at the largest scale with a special focus on data integration and services, it makes data, services, and educational resources freely and openly available over the Internet, and it runs significant in-house research projects.

    In this unit we explore some of the offerings of the EBI that can contribute to our objective of studying a particular gene in an organism of interest.


    Task:

    • Read the introductory article on EBI databases and services:
    Cook et al. (2020) The European Bioinformatics Institute in 2020: building a global infrastructure of interconnected data resources for the life sciences. Nucleic Acids Res 48:D17-D23. (pmid: 31701143)

    PubMed ] [ DOI ] Data resources at the European Bioinformatics Institute (EMBL-EBI, https://www.ebi.ac.uk/) archive, organize and provide added-value analysis of research data produced around the world. This year's update for EMBL-EBI focuses on data exchanges among resources, both within the institute and with a wider global infrastructure. Within EMBL-EBI, data resources exchange data through a rich network of data flows mediated by automated systems. This network ensures that users are served with as much information as possible from any search and any starting point within EMBL-EBI's websites. EMBL-EBI data resources also exchange data with hundreds of other data resources worldwide and collectively are a key component of a global infrastructure of interconnected life sciences data resources. We also describe the BioImage Archive, a deposition database for raw images derived from primary research that will supply data for future knowledgebases that will add value through curation of primary image data. We also report a new release of the PRIDE database with an improved technical infrastructure, a new API, a new webpage, and improved data exchange with UniProt and Expression Atlas. Training is a core mission of EMBL-EBI and in 2018 our training team served more users, both in-person and through web-based programmes, than ever before.


     

    One of the most important EBI databases is UniProt, a curated, highly cross-refernced database of protein knowledge.

    Task:

    • Read the overview article on UniProt updates:
    UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506-D515. (pmid: 30395287)

    PubMed ] [ DOI ] The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.


     

    EBI Search

    Task:
    Remember to document your activities as lab-notes on your Wiki.

    1. Access the EBI website at https://www.ebi.ac.uk/ [1]
    2. In the search bar, enter mbp1 and hit Return.
    3. The resulting page has entries for a wide variety of information items, with just enough information for each to allow you to evaluate whether it is relevant to your interests.
    4. Access the help page for EBI search from the link at the top of the page, and familiarize yourself with its capabilities and what query types are available.
    5. Back on the search-results page, look for the P39678 (MBP1_YEAST) entry and click on the link. This is the UniProtKB entry for the yeast Mbp1 protein - the main annotation hub that links to the EBI's other data resources.
    6. Explore the page and explore the following information items (record your explorations in your journal):
      1. Status: Click on "Reviewed" and note that this protein is a Swiss-Prot entry. Swiss-Prot is a hand-curated database of proteins, annotated at the highest level. An alternative value could be "Trembl" (Translated EMBL Database), automatic translations from genome sequencing projects or other incidental submissions. Then follow the link from "Experimental evidence". Note what evidence types for the existence of a protein are available and what they mean.
      2. Function: follow the i information link from the word "Function" and review the types of information that are available in this section. Follow the link to the complete GO annotation and look at the evidence codes for these terms. GO has its own learning unit, but do note the meaning of the terms you see here, and the proportion of term annotations that are contributed from experimental evidence (IDA, IPI, IMP), those that are contributed from computational analysis (IBA, IEA), and the relative proportion of manual versus automatic annotation. This will give you a somewhat more fine-grained idea how such annotations are generated in the first place and perhaps how reliable they are.
      3. Names & Taxonomy: follow the i information link from the section name and review the types of information that are available in this section. Note that the taxonomic identifier and lineage are available directly from this page. You will also find a link to SGD - not all organisms have community curated databases available, but if such a model organism database exists it is often the best available resource to support research on the organism's genes. Examples include SGD (yeast), FlyBase (drosophila), WormBase (caenorhabditis), Fugu, MGI (Mouse), TAIR (Arabidopsis) etc. Note that humans are not considered "model organisms".
      4. Subcellular location: follow the i information link from the section name and review the types of information that are available in this section.
      5. PTM / Processing (Post Translational Modification): follow the i information link from the section name and review the types of information that are available in this section. Follow the link into iPTMnet, note the detail of annotation that is available, and especially, note that there are links to the literature(!) for every single annotation. Carefully making the evidence chain for each annotation explicit is the hallmark of a high-quality database!
      6. Interaction: follow the i information link from the section name and review the types of information that are available in this section. Interaction databases have their own learning unit, but note that the number of recorded interactions in BioGrid and IntAct is very different. Follow the link to IntAct, click the check-box for the first dozen or so genes, and click on the mRNA Expression data among the list of "Actions for selection". What does the result page display? Is this a Cargo Cult page? Why or why not? Discuss on the mailing list.
      7. Structure: follow the i information link from the section name and review the types of information that are available in this section. Note the detailed annotation of secondary structure. Also note that there are three structures available, and key information such as resolution and coverage are listed in the table, so you can immediately decide which of the structures is most useful.
      8. Family & Domains: follow the i information link from the section name and review the types of information that are available in this section. In particular, note that there is one APSES domain annotated (this is a domain that is a bit larger than the KilA-N domain that is annotated in other domain databases for the Mbp1 protein), and there are two Ankyrin domains. Follow the link to view the protein in InterPro, the EBI's protein family database and look at the annotations. One would expect that the domain structure is conserved between orthologues, such as MBP1_YEAST and MBP1_MYSPE.
      9. Sequence: follow the i information link from the section name and review the types of information that are available in this section. Note that cross-references into RefSeq are supplied for both the protein ( NP_010227) and the mRNA (NM_001180115).
      10. Similar proteins: follow the i information link from the section name and review the types of information that are available in this section. UniProt invests significant resources into defining clusters of similar sequences and these are very useful tools to find variants and homologues. Follow the link to the 90% identity cluster (click on the ClusterID), and check the boxes for all full-length sequences. Then click on the align option. On the resulting alignment page, check the "similarity" box to highlight sequence variants. All of these are yeast strains, but the sequence is not identical for some of them.
      11. Follow the i information links for the remaining sections name and review the types of information that are available.
    7. Assume we want to verify whether the N-terminal start codon is correctly annotated. Where is the link to the DNA sequence in its genomic context? What is the location of the annotated transcript? What is the sequence of the 30 upstream nucleotides? How can you download it as a FASTA formatted text file? Can't find it? Ask on the mailing list!

    As you see, this is a very well engineered page that makes a rich set of annotations available. It links to EBI internal databases wherever possible, but it generously provides links to many other databases as well. In general, the provided links are realized through URLs that make it easy to script access to their tragets. This is in contrast to the NCBI, whose cross-references are less open.


    Original Information and Annotation Transfer

    Task:
    In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE and you have recorded its RefSeq ID. What is its UniProt ID? What information is available via UniProt?

    • Access the UniProt ID mapping service to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose RefSeq Protein as the From: option and UniProtKB as the To: option.
    If the mapping works, the UniProt ID will be in the Entry: column of the table that is being returned. Record the ID, and click on it to navigate to the UniProt entry page.

    What could possibly go wrong? ... Click to expand.

    Sometimes the mapping does not work and does not return a result. Most likely, UniProt contains the sequence, but for some reason, the mapping service does not know. If this happens, you can work around the problem as follows.

    1. Load the RefSeq protein page 2. View the protein as FASTA and copy the sequence. 3. Open the UniProt BLAST page http://www.uniprot.org/blast/

    (Yes, UniProt runs its own BLAST version, and that searches UniProt databases, not Genbank)

    4. Paste the sequence into the search form and run BLAST.

    ... if the sequence is in UniProt, you will get the top hit with 100% sequence identity. If you still can't find a UniProt ID for your sequence, contact me.

    • Navigate to the UniProt page for MBP1_MYSPE. Explore the links that go out from the page and contrast this to what you found for MBP1_YEAST. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with. The goal is to develop a sense for where a page like this one collects original information, and where it merely acts as a record of annotation transfer.



    Notes

    1. If you find this URL hard to remember, consider the acronyms:
      ebi.ac.uk
      EBI: European Bioinformatics Institute
      AC: ACademic domains
      UK: United Kingdom


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-20

    Version:

    1.1

    Version history:

    • 1.1 2020 updates and revised marking
    • 1.0 First live version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.