BIN-NCBI
The NCBI Database and Services
Keywords: The NCBI databases and services
Contents
This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.
Abstract
...
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
Objectives
...
Outcomes
...
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your course journal.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
Evaluation: Integrated Unit
- This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD.
Contents
Task:
- Read the introductory notes on public databases and services at the US National Center for Biotechnology Information (NCBI).
The NCBI (National Center for Biotechnology Information) is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.
In thi unit we explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.
Entrez
Task:
Remember to document your activities as lab-notes on your Wiki.
- Access the NCBI website at http://www.ncbi.nlm.nih.gov/ [1]
- In the search bar, enter
mbp1
and click Search. - On the resulting page, look for the Protein section and click on the link. What do you find?
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 530 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in Arabidopsis, bacterial mannose binding proteins, a Saccharomyces protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
Task:
- Navigate to the Entrez Help Page and read about the Entrez system, especially about:
- Boolean operators,
- wildcards,
- limits, and
- filters.
- You should minimally understand:
- How to search by keyword;
- How to search by gene or protein name;
- How to restrict a search to a particular organism.
Don't skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn't it? Why don't you go and make one: I have put a template page on the Student Wiki (A synopsis of Entrez codes). Contributors and editors welcome!
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access the keywords via the Advanced Search interface of any of the database pages.
Protein Sequence
Task:
With this knowledge we can restrict the search to proteins called "Mbp1" that occur in Baker's Yeast. Return to the Global Search page and in the search field, type:
Mbp1[protein name] AND
"Saccharomyces cerevisiae"[organism]
This finds two proteins. Follow the link to the result CAA98618.1
—a data record in Genbank Flat File (GFF) format[2]. The database identifier CAA98618.1
tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to "Identical Proteins" near the top of the record shows you what these are:
Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (S. cerevisiae). In particular:
- there are seven records for which the source is the INSDC, these are archival entries, submitted by independent yeast genome research projects;
- there two entries in the RefSeq database linking to the same protein:
NP_010227.1
. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of the sequence for us to work with. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers – they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence. The RefSeq IDNP_010227.1
actually appears twice, once linked to its genomic sequence, and once to its mRNA.
- there is a SwissProt sequence
P39678.1
[3]. This link is kind of a big deal. It's a cross-reference into UniProt, the huge protein sequence database maintained by the EBI (European Bioinformatics Institute), which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices that we will encounter, work with UniProt ID's (e.g.P39678.1
), rather than RefSeq. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.
- Note that the entries of the same sequence in different yeast strains. These don't have to be identical, they just happen to be. Sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider
EIW11153.1
,AJU86440.1
,AJU58508.1
, andAJU61971.1
to be identical proteins, although they have the same sequence.
Note all the .1
suffixes of the sequence identifiers. These are version numbers. Two observations:
- It's great that version numbers are now used throughout the NCBI database. This is good database engineering practice because it's really important for reproducible research that updates to database records are possible, but recognizable. When working with data you always must provide for the possibility of updates, and manage the changes transparently and explicitly. Proper versioning should be a part of all datamodels. In fact, the NCBI is currently phasing out its internal unique identifiers – the GI number – in favour of accession-number.version IDs
- When searching, or for general use, you can (and should) omit the version number, i.e. use
NP_010227
orP39678
notNP_010227.1
resp.P39678.1
. This way the database system will resolve the identifier to the most current, highest version number (unless you want the older one, of course).
Task:
- Note down the RefSeq ID and the UniProt (SwissProt) ID of Mbp1 in your journal.
- Follow the link to the RefSeq entry
NP_010227.1
. - Explore the page and follow these links (note the contents in your journal):
- Under "Analyze this Sequence": Identify Conserved Domains
- Under "Protein 3D Structure": See all 3 structures...
- Under "Pathways for the MBP1 gene": Cell cycle - yeast
- Under "Related information" Proteins with Similar Sequence
As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.
PubMed
Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail.
Task:
- Return back to the MBP1 RefSeq record.
- Find the PubMed link under Related information in the right-hand margin and explore it. "PubMed (Weighted)" applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.
But it does not find all Mbp1 related literature.
- On any of the PubMed pages open the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for[4].
- Now find publications from anywhere in PubMed with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers, by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, we will need them later.
Further reading, links and resources
Notes
- ↑ If you find this URL hard to remember, consider the acronyms:
- ncbi.nlm.nih.gov
- NCBI: National Center for Biotechnology Information
- NLM: National Library of Medicine
- NIH: National Institutes of Health
- GOV: the US GOVernment top-level domain
- ↑ If there is only a single match, you will be been taken directly to the page.
- ↑ Actually the "real" SwissProt identifier would be patterned like
MBP1_YEAST
.P39678
is the corresponding UniProt identifier. - ↑ A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-08-05
Version:
- 0.1
Version history:
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.