Difference between revisions of "BIN-EBI"
m |
m |
||
Line 103: | Line 103: | ||
: Open the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-BIN-EBI_Quiz '''signup-page for the quiz for this unit (linked from here)'''] and add your name. Your name must be signed up by 12:00 of the day of the Quiz to ensure copies of the quiz are available for all participants. | : Open the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-BIN-EBI_Quiz '''signup-page for the quiz for this unit (linked from here)'''] and add your name. Your name must be signed up by 12:00 of the day of the Quiz to ensure copies of the quiz are available for all participants. | ||
<!-- included from "ABC-unit_components.wtxt", section: "quiz-mechanics" --> | <!-- included from "ABC-unit_components.wtxt", section: "quiz-mechanics" --> | ||
− | |||
:Quizzes will be written in class, back-to-back if there is more than one quiz scheduled. We may begin at any time. We will have an open-ended Q&A session before the quiz. You can't take the quiz if you are not present in class when the question sheets are handed out, so don't be late. Once all scheduled quizzes are written, we will discuss and mark them. You will mark your own quiz. All marking must be done with a red pen - so you '''must''' bring a red pen to class in order to participate. The mark you give yourself may be revised by the instructor after spot-checking quizzes. If this is necessary, you will be notified. You must mark your quiz correctly and honestly - don't get into trouble with academic integrity rules: it will be an academic offence if you mark questions as correct that were discussed in class and should have been marked incorrect. When in doubt, ask. | :Quizzes will be written in class, back-to-back if there is more than one quiz scheduled. We may begin at any time. We will have an open-ended Q&A session before the quiz. You can't take the quiz if you are not present in class when the question sheets are handed out, so don't be late. Once all scheduled quizzes are written, we will discuss and mark them. You will mark your own quiz. All marking must be done with a red pen - so you '''must''' bring a red pen to class in order to participate. The mark you give yourself may be revised by the instructor after spot-checking quizzes. If this is necessary, you will be notified. You must mark your quiz correctly and honestly - don't get into trouble with academic integrity rules: it will be an academic offence if you mark questions as correct that were discussed in class and should have been marked incorrect. When in doubt, ask. | ||
Line 113: | Line 112: | ||
; Option to write a "Self-Evaluation Question" | ; Option to write a "Self-Evaluation Question" | ||
:If you submit both [[BIN-NCBI]] and [[BIN-EBI]] for evaluation, you can choose this option for only one of the two. | :If you submit both [[BIN-NCBI]] and [[BIN-EBI]] for evaluation, you can choose this option for only one of the two. | ||
− | : Write a "Self-evaluation Question" (with a model solution) that explores a significant, non-trivial aspect of studying how to work with EBI resources within this learning unit. Ensure that the question is feasible, given the existing content of the unit - or coordinate an extension of the contents with your instructor. Ensure your question pursues a high-level learning goal, it should allow others to demonstrate understanding, critical analysis, and/or the capacity to integrate and synthesize knowledge, not merely test memorization. Ensure that your question is specific, not ambiguous, vague or tangential to the contents. Ensure you are testing '''valuable''' knowledge and skills, not Cargo Cult. Apply the [[ABC-Rubrics| '''marking rubrics''']] in spirit to satisfy yourself of the quality of your contribution. Obviously, details of evaluation will vary with the question. Use the format and code templates that you find on the [[Self_evaluation_questions|'''Self evaluation questions page''']] - but don't assume those examples are already models of excellent contributions. | + | : Write a "Self-evaluation Question" (with a model solution) that explores a significant, non-trivial aspect of studying how to work with EBI resources within this learning unit. Ensure that the question is feasible, given the existing content of the unit - or coordinate an extension of the contents with your instructor. Ensure your question pursues a high-level learning goal, it should allow others to demonstrate understanding, critical analysis, and/or the capacity to integrate and synthesize knowledge, not merely test memorization. Ensure that your question is specific, not ambiguous, vague or tangential to the contents. Ensure you are testing '''valuable''' knowledge and skills, not Cargo Cult. Apply the [[ABC-Rubrics| '''marking rubrics''']] in spirit to satisfy yourself of the quality of your contribution. Obviously, details of evaluation will vary with the question. Use the format and code templates that you find on the [[Self_evaluation_questions|'''Self evaluation questions page''']] - but don't assume those examples are already models of excellent contributions. Note: assume that approximately the same amount of work is expected for all evaluation options. Consequently, the standard of excellence for this option will be quite high. |
:#Create a new page on the student Wiki as a subpage of your User Page. Develop your question there. | :#Create a new page on the student Wiki as a subpage of your User Page. Develop your question there. | ||
:#When you are done with developing this contents, add the following category tag to the page: | :#When you are done with developing this contents, add the following category tag to the page: |
Revision as of 02:44, 29 October 2017
Databases and services at the EBI
Keywords: The EBI databases and services, Uniprot
Contents
Abstract
The EBI hosts some of the world's most important bioinformatics databases and services. This learning unit explores them in the context of our search for information on yeast Mbp1 and its homologue in MYSPE.
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
Objectives
This unit will ...
- ... introduce EBI Search and the UniProt Knowledgebase as a linking hub to EBI databases and services;
- ... demonstrate how to navigate from a generic search to a specific record in UniProt and what information is linked from there;
- ... explore the contents of some associated databases.
Outcomes
After working through this unit you ...
- ... can find the UniProt ID record for the Mbp1 homologue you found in MYSPE;
- ... are familar with the EBI databases and other information items that are linked from the UniProt page;
- ... can confidently retrieve key information about a protein.
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
This learning unit can be evaluated for a maximum of 6 marks. If you want to submit tasks for this unit for credit you have the following options. If you have any questions about these options, discuss on the mailing list.
- Short Report option
- If you submit both BIN-NCBI and BIN-EBI for evaluation, you can choose this option for only one of the two.
- In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE. Navigate to the UniProt page for this protein. Explore the links that go out from the page. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with.
- Create a new page on the student Wiki as a subpage of your User Page.
- Write a short report on your findings. The goal of this short report is to develop a sense for where a page like this one collects original information, and where it merely acts as a record of annotation transfer. Refer to the "General" section of the marking rubrics for aspects of the report that will be evaluated.
- When you are done with everything, add the following category tag to the page:
[[Category:EVAL-BIN-EBI]]
- Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
- Quiz option
- Open the signup-page for the quiz for this unit (linked from here) and add your name. Your name must be signed up by 12:00 of the day of the Quiz to ensure copies of the quiz are available for all participants.
- Quizzes will be written in class, back-to-back if there is more than one quiz scheduled. We may begin at any time. We will have an open-ended Q&A session before the quiz. You can't take the quiz if you are not present in class when the question sheets are handed out, so don't be late. Once all scheduled quizzes are written, we will discuss and mark them. You will mark your own quiz. All marking must be done with a red pen - so you must bring a red pen to class in order to participate. The mark you give yourself may be revised by the instructor after spot-checking quizzes. If this is necessary, you will be notified. You must mark your quiz correctly and honestly - don't get into trouble with academic integrity rules: it will be an academic offence if you mark questions as correct that were discussed in class and should have been marked incorrect. When in doubt, ask.
- Option to write a "Self-Evaluation Question"
- If you submit both BIN-NCBI and BIN-EBI for evaluation, you can choose this option for only one of the two.
- Write a "Self-evaluation Question" (with a model solution) that explores a significant, non-trivial aspect of studying how to work with EBI resources within this learning unit. Ensure that the question is feasible, given the existing content of the unit - or coordinate an extension of the contents with your instructor. Ensure your question pursues a high-level learning goal, it should allow others to demonstrate understanding, critical analysis, and/or the capacity to integrate and synthesize knowledge, not merely test memorization. Ensure that your question is specific, not ambiguous, vague or tangential to the contents. Ensure you are testing valuable knowledge and skills, not Cargo Cult. Apply the marking rubrics in spirit to satisfy yourself of the quality of your contribution. Obviously, details of evaluation will vary with the question. Use the format and code templates that you find on the Self evaluation questions page - but don't assume those examples are already models of excellent contributions. Note: assume that approximately the same amount of work is expected for all evaluation options. Consequently, the standard of excellence for this option will be quite high.
- Create a new page on the student Wiki as a subpage of your User Page. Develop your question there.
- When you are done with developing this contents, add the following category tag to the page:
[[Category:EVAL-BIN-EBI]]
- Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
Contents
The EBI (European Bioinformatics Institute) is one of the two largest, international providers of data for genomics and molecular biology (the NCBI is the other). It organizes a cutting-edge program of data management at the largest scale with a special focus on data integration and services, it makes data, services, and educational resources freely and openly available over the Internet, and it runs significant in-house research projects.
In this unit we explore some of the offerings of the EBI that can contribute to our objective of studying a particular gene in an organism of interest.
Task:
- Read the introductory article on EBI databases and services:
Cook et al. (2016) The European Bioinformatics Institute in 2016: Data growth and integration. Nucleic Acids Res 44:D20-6. (pmid: 26673705) |
[ PubMed ] [ DOI ] New technologies are revolutionising biological research and its applications by making it easier and cheaper to generate ever-greater volumes and types of data. In response, the services and infrastructure of the European Bioinformatics Institute (EMBL-EBI, www.ebi.ac.uk) are continually expanding: total disk capacity increases significantly every year to keep pace with demand (75 petabytes as of December 2015), and interoperability between resources remains a strategic priority. Since 2014 we have launched two new resources: the European Variation Archive for genetic variation data and EMPIAR for two-dimensional electron microscopy data, as well as a Resource Description Framework platform. We also launched the Embassy Cloud service, which allows users to run large analyses in a virtual environment next to EMBL-EBI's vast public data resources. |
One of the most important EBI databases is UniProt, a curated, highly cross-refernced database of protein knowledge.
Task:
- Read the overview article on UniProt updates:
The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158-D169. (pmid: 27899622) |
[ PubMed ] [ DOI ] The UniProt knowledgebase is a large resource of protein sequences and associated detailed annotation. The database contains over 60 million sequences, of which over half a million sequences have been curated by experts who critically review experimental and predicted data for each protein. The remainder are automatically annotated based on rule systems that rely on the expert curated knowledge. Since our last update in 2014, we have more than doubled the number of reference proteomes to 5631, giving a greater coverage of taxonomic diversity. We implemented a pipeline to remove redundant highly similar proteomes that were causing excessive redundancy in UniProt. The initial run of this pipeline reduced the number of sequences in UniProt by 47 million. For our users interested in the accessory proteomes, we have made available sets of pan proteome sequences that cover the diversity of sequences for each species that is found in its strains and sub-strains. To help interpretation of genomic variants, we provide tracks of detailed protein information for the major genome browsers. We provide a SPARQL endpoint that allows complex queries of the more than 22 billion triples of data in UniProt (http://sparql.uniprot.org/). UniProt resources can be accessed via the website at http://www.uniprot.org/. |
EBI Search
Task:
Remember to document your activities as lab-notes on your Wiki.
- Access the EBI website at https://www.ebi.ac.uk/ [1]
- In the search bar, enter
mbp1
and click on the looking-glass icon. - The resulting page has entries for a wide variety of information items, with just enough information for each to allow you to evaluate whether it is relevant to your interests.
- Access the help page for EBI search from the link at the top of the page, and familiarize yourself with its capabilities and what query types are available.
- Back on the search-results page, look for the
P39678 (MBP1_YEAST)
entry and click on the link. This is the UniProtKB entry for the yeast Mbp1 protein - the main annotation hub that links to the EBI's other data resources. - Explore the page and explore the following information items (record your explorations in your journal):
- Status: Click on "Reviewed" and note that this protein is a Swiss-Prot entry. Swiss-Prot is a hand-curated database of proteins, annotated at the highest level. An alternative value could be "Trembl" (Translated EMBL Database), automatic translations from genome sequencing projects or other incidental submissions. Then follow the link from "Experimental evidence". Note what evidence types for the existence of a protein are available and what they mean.
- Function: follow the i information link from the word "Function" and review the types of information that are available in this section. Follow the link to the complete GO annotation and look at the evidence codes for these terms. GO has its own learning unit, but do note the meaning of the terms you see here, and the proportion of term annotations that are contributed from experimental evidence (IDA, IPI, IMP), those that are contributed from computational analysis (IBA, IEA), and the relative proportion of manual versus automatic annotation. This will give you a somewhat more fine-grained idea how such annotations are generated in the first place and perhaps how reliable they are.
- Names & Taxonomy: follow the i information link from the section name and review the types of information that are available in this section. Note that the taxonomic identifier and lineage are available directly from this page. You will also find a link to SGD - not all organisms have community curated databases available, but if such a model organism database exists it is often the best available resource to support research on the organism's genes. Examples include SGD (yeast), FlyBase (drosophila), WormBase (caenorhabditis), Fugu, MGI (Mouse), TAIR (Arabidopsis) etc. Note that humans are not considered "model organisms".
- Subcellular location: follow the i information link from the section name and review the types of information that are available in this section.
- PTM / Processing (Post Translational Modification): follow the i information link from the section name and review the types of information that are available in this section. Follow the link into iPTMnet, note the detail of annotation that is available, and especially, note that there are links to the literature(!) for every single annotation. Carefully making the evidence chain for each annotation explicit is the hallmark of a high-quality database!
- Interaction: follow the i information link from the section name and review the types of information that are available in this section. Interaction databases have their own learning unit, but note that the number of recorded interactions in BioGrid and IntAct is very different. Follow the link to IntAct, click the check-box for the first dozen or so genes, and click on the mRNA Expression data among the list of "Actions for selection". What does the result page display? Is this a Cargo Cult page? Why or why not? Discuss on the mailing list.
- Structure: follow the i information link from the section name and review the types of information that are available in this section. Note the detailed annotation of secondary structure. Also note that there are three structures available, and key information such as resolution and coverage are listed in the table, so you can immediately decide which of the structures is most useful.
- Family & Domains: follow the i information link from the section name and review the types of information that are available in this section. In particular, note that there is one APSES domain annotated (this is a domain that is a bit larger than the KilA-N domain that is annotated in other domain databases for the Mbp1 protein), and there are two Ankyrin domains. Follow the link to view the protein in InterPro, the EBI's protein family database and look at the annotations. One would expect that the domain structure is conserved between orthologues, such as
MBP1_YEAST
andMBP1_MYSPE
. - Sequence: follow the i information link from the section name and review the types of information that are available in this section. Note that cross-references into RefSeq are supplied for both the protein ( NP_010227) and the mRNA (NM_001180115).
- Similar proteins: follow the i information link from the section name and review the types of information that are available in this section. UniProt invests significant resources into defining clusters of similar sequences and these are very useful tools to find variants and homologues. Follow the link to the 90% identity cluster (click on the ClusterID), and check the boxes for all full-length sequences. Then click on the align option. On the resulting alignment page, check the "similarity" box to highlight sequence variants. All of these are yeast strains, but the sequence is not identical for some of them.
- Follow the i information links for the remaining sections name and review the types of information that are available.
- Assume we want to verify whether the N-terminal start codon is correctly annotated. Where is the link to the DNA sequence in its genomic context? What is the location of the annotated transcript? What is the sequence of the 30 upstream nucleotides? How can you download it as a FASTA formatted text file? Can't find it? Ask on the mailing list!
As you see, this is a very well engineered page that makes a rich set of annotations available. It links to EBI internal databases wherever possible, but it generously provides links to many other databases as well. In general, the provided links are realized through URLs that make it easy to script access to their tragets. This is in contrast to the NCBI, whose cross-references are less open.
Original Information and Annotation Transfer
Task:
In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE and you have recorded its RefSeq ID. What is its UniProt ID? What information is available via UniProt?
- Access the UniProt ID mapping service to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose RefSeq Protein as the From: option and UniProtKB as the To: option.
- If the mapping works, the UniProt ID will be in the Entry: column of the table that is being returned. Record the ID, and click on it to navigate to the UniProt entry page.
What could possibly go wrong? ... Click to expand.→
Sometimes the mapping does not work and does not return a result. Most likely, UniProt contains the sequence, but for some reason, the mapping service does not know. If this happens, you can work around the problem as follows.
1. Load the RefSeq protein page 2. View the protein as FASTA and copy the sequence. 3. Open the UniProt BLAST page http://www.uniprot.org/blast/
- (Yes, UniProt runs its own BLAST version, and that searches UniProt databases, not Genbank)
4. Paste the sequence into the search form and run BLAST.
... if the sequence is in UniProt, you will get the top hit with 100% sequence identity. If you still can't find a UniProt ID for your sequence, contact me.
- Navigate to the UniProt page for MBP1_MYSPE. Explore the links that go out from the page and contrast this to what you found for MBP1_YEAST. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with. The goal is to develop a sense for where a page like this one collects original information, and where it merely acts as a record of annotation transfer.
Further reading, links and resources
Notes
- ↑ If you find this URL hard to remember, consider the acronyms:
- ebi.ac.uk
- EBI: European Bioinformatics Institute
- AC: ACademic domains
- UK: United Kingdom
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-10-03
Version:
- 1.0
Version history:
- 1.0 First live version
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.