Difference between revisions of "BIN-NCBI"

From "A B C"
Jump to navigation Jump to search
m
m
Line 40: Line 40:
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
You need to complete the following units before beginning this one:
*[[BIN-Databases]]
+
*[[BIN-Databases|BIN-Databases (Bioinformatics Databases)]]
  
 
{{Vspace}}
 
{{Vspace}}
Line 73: Line 73:
 
=== Evaluation ===
 
=== Evaluation ===
 
<!-- included from "../components/BIN-NCBI.components.wtxt", section: "evaluation" -->
 
<!-- included from "../components/BIN-NCBI.components.wtxt", section: "evaluation" -->
<!-- included from "ABC-unit_components.wtxt", section: "eval-INT-TBD" -->
+
This learning unit can be evaluated for a maximum of 6 marks. If you want to submit tasks for this unit for credit you have the following options:
<b>Evaluation: Integrated Unit</b><br />
+
 
:This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD.
+
; Short Report option
 +
:In the [[BIN-Storing_data]] unit you have found the protein of YFO that is most similar to yeast Mbp1, in YFO. Navigate to the NCBI Protein page for the RefSeq entry of this protein. Explore the links that go out from the page. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with.
 +
:#Create a new page on the student Wiki as a subpage of your User Page.
 +
:#Write a short report on your findings. The goal of this short report is to develop a sense where a page like this one collects original information, and where it merely acts as a record of annotation transfer. Refer to [[ABC-Rubrics|the '''"General" section of the marking rubrics''']] for aspects of the report that will be evaluated.
 +
:# When you are done with everything, add the following category tag to the page:
 +
::<code><nowiki>[[Category:EVAL-BIN-NCBI]]</nowiki></code>
 +
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
 +
 
 +
<!--
 +
; Tasks submission option
 +
:# Create a new page on the student Wiki as a subpage of your User Page.
 +
:# There are a number of tasks in which you are explicitly asked you to submit code or other text for credit. Put all of these submission on this one page.
 +
:# When you are done with everything, add the following category tag to the page:
 +
::<code><nowiki>[[Category:EVAL-BIN-NCBI]]</nowiki></code>
 +
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
 +
-->
 +
 
 +
; Quiz option
 +
: Open the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-BIN-NCBI_Quiz  '''signup-page for the quiz for this unit (linked from here)'''] and add your name. Your name must be signed up by 12:00 of the day of the Quiz to ensure copies of the quiz are available for all participants.
 +
:include("ABC-unit_components.wtxt", section = "quiz-mechanics")
 +
 
 +
<!--
 +
; R-code option
 +
:Submit code according to the following requirements. Make sure your code is documented.
 +
-->
 +
 
 +
; Option to write a "Self-Evaluation Question"
 +
: Write a "Self-evaluation Question that explores a significant, non-trivial aspect of studying how to work with NCBI resources within this learning unit. Ensure that the question is feasible, given the existing content of the unit - or coordinate an extension of the contents with your instructor. Ensure your question pursues a high-level learning goal, it should allow others to demonstrate understanding, critical analysis, and/or the capacity to integrate and synthesize knowledge, not merely test memorization. Ensure that your question is specific, not ambiguous, vague or tangential to the contents. Ensure you are testing '''valuable''' knowledge and skills, not Cargo Cult. Apply the [[ABC-Rubrics| '''marking rubrics''']] in spirit to satisfy yourself of the quality of your contribution. Obviously, details of evaluation will vary with the question. Use the format that you find on other learning unit pages, e.g. [['''here''']] - but don't assume those questions are models of excellent contributions. Of course the question won't be complete without you model solution.
 +
:#Create a new page on the student Wiki as a subpage of your User Page. Develop your question there.
 +
:#When you are done with developing this contents, add the following category tag to the page:
 +
::<code><nowiki>[[Category:EVAL-BIN-NCBI]]</nowiki></code>
 +
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
 +
 
 +
 
 +
 
 +
 
 +
-->
 +
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 85: Line 122:
 
<!-- included from "../components/BIN-NCBI.components.wtxt", section: "contents" -->
 
<!-- included from "../components/BIN-NCBI.components.wtxt", section: "contents" -->
  
 +
The [http://www.ncbi.nlm.nih.gov/guide/sitemap/ '''NCBI''' (National Center for Biotechnology Information)] is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.
  
{{Task|1=
+
In this unit we explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.
*Read the introductory notes on {{ABC-PDF|BIN-NCBI|public databases and services at the US National Center for Biotechnology Information (NCBI)}}.
 
}}
 
  
  
 +
{{Task|1=
  
 +
*Read the introductory article on NCBI database resources:
 +
{{#pmid:27899561}}
  
The [http://www.ncbi.nlm.nih.gov/guide/sitemap/ '''NCBI''' (National Center for Biotechnology Information)] is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.
+
}}
 
 
In thi unit we explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 103: Line 140:
  
 
{{task|1=
 
{{task|1=
<small>Remember to '''document''' your activities as lab-notes on your Wiki.</small>
+
<small>Remember to '''document''' your activities as lab-notes on your Wiki.</small>
  
 
# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/ <ref>If you find this URL hard to remember, consider the acronyms:<br />
 
# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/ <ref>If you find this URL hard to remember, consider the acronyms:<br />
Line 117: Line 154:
  
  
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 530 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in ''Arabidopsis'', bacterial mannose binding proteins, a ''Saccharomyces'' protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
+
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 610 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in ''Arabidopsis'', myrosinase binding proteins, bacterial mannose binding proteins, a ''Saccharomyces'' protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
  
  
Line 132: Line 169:
 
## How to restrict a search to a particular organism.
 
## How to restrict a search to a particular organism.
  
Don't skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn't it? Why don't you go and make one: I have put a template page on the Student Wiki ([http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Entrez '''A synopsis of Entrez codes''']). Contributors and editors welcome!
+
And you should know that these filters are in part database specific, i.e. not all of them will work in all databases.
 +
 
 +
Don't skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn't it? We have started building one on the Student Wiki ([http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Entrez '''A synopsis of Entrez codes''']). Currently, I think it lacks structure, and examples. Contributors and editors welcome!
 
}}
 
}}
  
Line 159: Line 198:
  
  
This finds two proteins. Follow the link to the result <code>CAA98618.1</code>&mdash;a data record in Genbank Flat File (GFF) format<ref>If there is only a single match, you will be been taken directly to the page.</ref>. The database identifier <code>CAA98618.1</code> tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to [http://www.ncbi.nlm.nih.gov/protein/1431055?report=ipg "Identical Proteins"] near the top of the record shows you what these are:
+
This finds two entries in the Protein database. Follow the link to the result <code>CAA98618.1</code>&mdash;a data record in Genbank Flat File (GFF) format<ref>If there is only a single match, you will be been taken directly to the page.</ref>. The database identifier <code>CAA98618.1</code> tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to [https://www.ncbi.nlm.nih.gov/ipg/258763 "Identical Protein Groups" Database] near the top of the record shows you what these are:
  
  
Line 165: Line 204:
  
  
* there are seven records for which the source is [http://www.insdc.org/ the INSDC], these are archival entries, submitted by independent yeast genome research projects;
+
* there are several records for which the source is [http://www.insdc.org/ the INSDC], these are archival entries, submitted by independent yeast genome research projects;
  
* there two entries in the '''RefSeq''' database linking to the same protein: [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 <code>NP_010227.1</code>]. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of the sequence for us to work with. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers &ndash; they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence. The RefSeq ID <code>NP_010227.1</code> actually appears twice, once linked to its genomic sequence, and once to its mRNA.
+
* there are two entries in the '''RefSeq''' database linking to the same protein: [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 <code>NP_010227.1</code>]. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of a sequence for our purposes. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers &ndash; they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence.
  
* there is a '''SwissProt''' sequence [http://www.ncbi.nlm.nih.gov/protein/P39678.1 <code>P39678.1</code>]<ref>Actually the "real" SwissProt identifier would be patterned like <code>MBP1_YEAST</code>. <code>P39678</code> is the corresponding UniProt identifier.</ref>. This link is kind of a big deal. It's a cross-reference into [http://www.uniprot.org/uniprot/P39678 '''UniProt'''], the huge protein sequence database maintained by the [http://www.ebi.ac.uk/ '''EBI''' (European Bioinformatics Institute)], which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices that we will encounter, work with UniProt ID's (e.g. <code>P39678.1</code>), rather than RefSeq. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.
+
* there is a '''SwissProt''' sequence [http://www.ncbi.nlm.nih.gov/protein/P39678.1 <code>P39678.1</code>]<ref>Actually the "real" SwissProt identifier would be patterned like <code>MBP1_YEAST</code>. <code>P39678</code> is the corresponding UniProt identifier.</ref>. This link is kind of a big deal. It's a cross-reference into [http://www.uniprot.org/uniprot/P39678 '''UniProt'''], the huge protein sequence database maintained by the [http://www.ebi.ac.uk/ '''EBI''' (European Bioinformatics Institute)], which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices work with UniProt ID's (e.g. <code>P39678.1</code>), rather than NCBI IDs such as a RefSeq ID. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.
  
 
       <!-- Column 1 end -->
 
       <!-- Column 1 end -->
Line 177: Line 216:
  
  
*Note that the entries of the same sequence in different yeast strains. These don't '''have''' to be identical, they just happen to be. Sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider [http://www.ncbi.nlm.nih.gov/protein/EIW11153.1 <code>EIW11153.1</code>], [http://www.ncbi.nlm.nih.gov/protein/AJU86440.1 <code>AJU86440.1</code>], [http://www.ncbi.nlm.nih.gov/protein/AJU58508.1 <code>AJU58508.1</code>], and [http://www.ncbi.nlm.nih.gov/protein/AJU61971.1 <code>AJU61971.1</code>] to be identical proteins, although they have the same sequence.
+
*Note that while all of these entries come from ''Saccharomyces cerevisiae''', they have been sequenced in different yeast strains. Thus they don't '''have''' to be identical (excepot for the fact that this is a table of identical sequences), such sequences might be slightly different because the strains are not genetically identical. And sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider [http://www.ncbi.nlm.nih.gov/protein/EIW11153.1 <code>EIW11153.1</code>], [http://www.ncbi.nlm.nih.gov/protein/AJU86440.1 <code>AJU86440.1</code>], [http://www.ncbi.nlm.nih.gov/protein/AJU58508.1 <code>AJU58508.1</code>], and [http://www.ncbi.nlm.nih.gov/protein/AJU61971.1 <code>AJU61971.1</code>] to be identical proteins, although they have the same sequence.
  
  
Line 189: Line 228:
 
# Note down the RefSeq ID and the UniProt (SwissProt) ID of Mbp1 in your journal.
 
# Note down the RefSeq ID and the UniProt (SwissProt) ID of Mbp1 in your journal.
 
# Follow the link to the RefSeq entry [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 <code>NP_010227.1</code>].
 
# Follow the link to the RefSeq entry [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 <code>NP_010227.1</code>].
# Explore the page and follow these links (note the contents in your journal):
+
# Explore the page and explore these links (note the contents in your journal):
 
## Under "Analyze this Sequence": [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=live&SEQUENCE=NP_010227.1 Identify Conserved Domains]
 
## Under "Analyze this Sequence": [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=live&SEQUENCE=NP_010227.1 Identify Conserved Domains]
 
## Under "Protein 3D Structure": [http://www.ncbi.nlm.nih.gov/protein?Db=structure&DbFrom=protein&Cmd=Link&LinkName=protein_structure&LinkReadableName=Structure&IdsFromResult=6320147 See all 3 structures...]
 
## Under "Protein 3D Structure": [http://www.ncbi.nlm.nih.gov/protein?Db=structure&DbFrom=protein&Cmd=Link&LinkName=protein_structure&LinkReadableName=Structure&IdsFromResult=6320147 See all 3 structures...]
Line 216: Line 255:
  
 
# Return back to the [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 '''MBP1''' RefSeq record].
 
# Return back to the [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 '''MBP1''' RefSeq record].
#  Find the [http://www.ncbi.nlm.nih.gov/pubmed?LinkName=protein_pubmed_weighted&from_uid=1431055 '''PubMed'''] link under '''Related information''' in the right-hand margin and explore it. "PubMed (Weighted)" applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.
+
#  Find the [https://www.ncbi.nlm.nih.gov/pubmed?LinkName=protein_pubmed&from_uid=6320147 '''PubMed'''] link under '''Related information''' in the right-hand margin and explore it. This are links that are directly related to the NP_010227 sequence in the database.
 +
#  Next follow the link to [https://www.ncbi.nlm.nih.gov/pubmed?LinkName=protein_pubmed_weighted&from_uid=6320147  "PubMed (Weighted)"] which applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.
  
 
But it does not find '''all''' Mbp1 related literature.
 
But it does not find '''all''' Mbp1 related literature.
  
 
# On any of the PubMed pages open the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember.  Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for<ref>A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.</ref>.
 
# On any of the PubMed pages open the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember.  Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for<ref>A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.</ref>.
# Now find publications from anywhere in PubMed with Mbp1 '''in the title'''. In the result list, follow the links for the two ''Biochemistry'' papers, by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, we will need them later.
+
# Now find publications from anywhere in PubMed with Mbp1 '''in the title'''. In the result list, follow the links for the two ''Biochemistry'' papers, by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, these manuscripts will be needed in a later unit.
  
 
}}
 
}}

Revision as of 23:10, 2 October 2017

The NCBI Database and Services


 

Keywords:  The NCBI databases and services


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

This learning unit can be evaluated for a maximum of 6 marks. If you want to submit tasks for this unit for credit you have the following options:

Short Report option
In the BIN-Storing_data unit you have found the protein of YFO that is most similar to yeast Mbp1, in YFO. Navigate to the NCBI Protein page for the RefSeq entry of this protein. Explore the links that go out from the page. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with.
  1. Create a new page on the student Wiki as a subpage of your User Page.
  2. Write a short report on your findings. The goal of this short report is to develop a sense where a page like this one collects original information, and where it merely acts as a record of annotation transfer. Refer to the "General" section of the marking rubrics for aspects of the report that will be evaluated.
  3. When you are done with everything, add the following category tag to the page:
[[Category:EVAL-BIN-NCBI]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.


Quiz option
Open the signup-page for the quiz for this unit (linked from here) and add your name. Your name must be signed up by 12:00 of the day of the Quiz to ensure copies of the quiz are available for all participants.
include("ABC-unit_components.wtxt", section = "quiz-mechanics")


Option to write a "Self-Evaluation Question"
Write a "Self-evaluation Question that explores a significant, non-trivial aspect of studying how to work with NCBI resources within this learning unit. Ensure that the question is feasible, given the existing content of the unit - or coordinate an extension of the contents with your instructor. Ensure your question pursues a high-level learning goal, it should allow others to demonstrate understanding, critical analysis, and/or the capacity to integrate and synthesize knowledge, not merely test memorization. Ensure that your question is specific, not ambiguous, vague or tangential to the contents. Ensure you are testing valuable knowledge and skills, not Cargo Cult. Apply the marking rubrics in spirit to satisfy yourself of the quality of your contribution. Obviously, details of evaluation will vary with the question. Use the format that you find on other learning unit pages, e.g. '''here''' - but don't assume those questions are models of excellent contributions. Of course the question won't be complete without you model solution.
  1. Create a new page on the student Wiki as a subpage of your User Page. Develop your question there.
  2. When you are done with developing this contents, add the following category tag to the page:
[[Category:EVAL-BIN-NCBI]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.



-->


 


Contents

The NCBI (National Center for Biotechnology Information) is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.

In this unit we explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.


Task:

  • Read the introductory article on NCBI database resources:
NCBI Resource Coordinators (2017) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res 45:D12-D17. (pmid: 27899561)

PubMed ] [ DOI ] The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. The Entrez system provides search and retrieval operations for most of these data from 37 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include iCn3D, MutaBind, and the Antimicrobial Resistance Gene Reference Database; and resources that were updated in the past year include My Bibliography, SciENcv, the Pathogen Detection Project, Assembly, Genome, the Genome Data Viewer, BLAST and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


 

Entrez

Task:
Remember to document your activities as lab-notes on your Wiki.

  1. Access the NCBI website at http://www.ncbi.nlm.nih.gov/ [1]
  2. In the search bar, enter mbp1 and click Search.
  3. On the resulting page, look for the Protein section and click on the link. What do you find?


The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 610 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in Arabidopsis, myrosinase binding proteins, bacterial mannose binding proteins, a Saccharomyces protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.


Task:

  1. Navigate to the Entrez Help Page and read about the Entrez system, especially about:
    1. Boolean operators,
    2. wildcards,
    3. limits, and
    4. filters.
  2. You should minimally understand:
    1. How to search by keyword;
    2. How to search by gene or protein name;
    3. How to restrict a search to a particular organism.

And you should know that these filters are in part database specific, i.e. not all of them will work in all databases.

Don't skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn't it? We have started building one on the Student Wiki (A synopsis of Entrez codes). Currently, I think it lacks structure, and examples. Contributors and editors welcome!


Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access the keywords via the Advanced Search interface of any of the database pages.


 

Protein Sequence


 

Task:
With this knowledge we can restrict the search to proteins called "Mbp1" that occur in Baker's Yeast. Return to the Global Search page and in the search field, type:

Mbp1[protein name] AND
"Saccharomyces cerevisiae"[organism]


This finds two entries in the Protein database. Follow the link to the result CAA98618.1—a data record in Genbank Flat File (GFF) format[2]. The database identifier CAA98618.1 tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to "Identical Protein Groups" Database near the top of the record shows you what these are:


Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (S. cerevisiae). In particular:


  • there are several records for which the source is the INSDC, these are archival entries, submitted by independent yeast genome research projects;
  • there are two entries in the RefSeq database linking to the same protein: NP_010227.1. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of a sequence for our purposes. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers – they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence.
  • there is a SwissProt sequence P39678.1[3]. This link is kind of a big deal. It's a cross-reference into UniProt, the huge protein sequence database maintained by the EBI (European Bioinformatics Institute), which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices work with UniProt ID's (e.g. P39678.1), rather than NCBI IDs such as a RefSeq ID. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.


  • Note that while all of these entries come from Saccharomyces cerevisiae', they have been sequenced in different yeast strains. Thus they don't have to be identical (excepot for the fact that this is a table of identical sequences), such sequences might be slightly different because the strains are not genetically identical. And sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider EIW11153.1, AJU86440.1, AJU58508.1, and AJU61971.1 to be identical proteins, although they have the same sequence.


Note all the .1 suffixes of the sequence identifiers. These are version numbers. Two observations:

  1. It's great that version numbers are now used throughout the NCBI database. This is good database engineering practice because it's really important for reproducible research that updates to database records are possible, but recognizable. When working with data you always must provide for the possibility of updates, and manage the changes transparently and explicitly. Proper versioning should be a part of all datamodels. In fact, the NCBI is currently phasing out its internal unique identifiers – the GI number – in favour of accession-number.version IDs
  2. When searching, or for general use, you can (and should) omit the version number, i.e. use NP_010227 or P39678 not NP_010227.1 resp. P39678.1. This way the database system will resolve the identifier to the most current, highest version number (unless you want the older one, of course).


Task:

  1. Note down the RefSeq ID and the UniProt (SwissProt) ID of Mbp1 in your journal.
  2. Follow the link to the RefSeq entry NP_010227.1.
  3. Explore the page and explore these links (note the contents in your journal):
    1. Under "Analyze this Sequence": Identify Conserved Domains
    2. Under "Protein 3D Structure": See all 3 structures...
    3. Under "Pathways for the MBP1 gene": Cell cycle - yeast
    4. Under "Related information" Proteins with Similar Sequence

As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.



 

PubMed

Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail.


Task:

  1. Return back to the MBP1 RefSeq record.
  2. Find the PubMed link under Related information in the right-hand margin and explore it. This are links that are directly related to the NP_010227 sequence in the database.
  3. Next follow the link to "PubMed (Weighted)" which applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.

But it does not find all Mbp1 related literature.

  1. On any of the PubMed pages open the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for[4].
  2. Now find publications from anywhere in PubMed with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers, by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, these manuscripts will be needed in a later unit.


 


 


Further reading, links and resources

 


Notes

  1. If you find this URL hard to remember, consider the acronyms:
    ncbi.nlm.nih.gov
    NCBI: National Center for Biotechnology Information
    NLM: National Library of Medicine
    NIH: National Institutes of Health
    GOV: the US GOVernment top-level domain
  2. If there is only a single match, you will be been taken directly to the page.
  3. Actually the "real" SwissProt identifier would be patterned like MBP1_YEAST. P39678 is the corresponding UniProt identifier.
  4. A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.