Expected Preparations:
|
|||||||
|
|||||||
Keywords: The NCBI databases and services | |||||||
|
|||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||
|
|||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||
|
|||||||
Evaluation: Material based on this learning unit can be submitted for formative feedback. To submit:
|
The NCBI hosts some of the world’s most important bioinformatics databases and services. This learning unit explores them in the context of our search for information on yeast Mbp1 and its homologue in MYSPE.
The NCBI (National Center for Biotechnology Information) is one of the two largest, international providers of data for genomics and molecular biology (the EBI is the other). With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.
In this unit we explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.
Task…
Sayers, Eric
W et al.. (2020). “Database resources of the National
Center for Biotechnology Information”. Nucleic Acids Research
48(D1):D9–D16 .
[PMID: 31602479]
[DOI: 10.1093/nar/gkz899]
Task…
Remember to document your activities as lab-notes on your Wiki.
mbp1
and click
Search.The result page of your search in “All Databases” is the “Global Query Result Page” of the Entrez system. If you follow the “Protein” link, you get taken to the more than 4,000 sequences in the NCBI Protein database that contain the keyword “mbp1”. But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in Arabidopsis, myrosinase binding proteins, bacterial mannose binding proteins, a Saccharomyces protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
Task…
And you should know that these filters are in part database specific, i.e. not all of them will work in all databases.
Don’t skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn’t it? We have started building one on the Student Wiki (A synopsis of Entrez codes). Currently, I think it lacks structure, and examples. Contributors and editors welcome!
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access the keywords via the Advanced Search interface of any of the database pages.
Task…
With this knowledge we can restrict the search to proteins called “Mbp1” that occur in Baker’s Yeast. Return to the Global Search page and in the search field, type:
This finds three entries in the Protein database. Follow the link to
the result CAA98618.1
—a data record in Genbank Flat File
(GFF) format1. The database identifier
CAA98618.1
tells you that this is a record in the GenPept
database. There are actually several, identical versions of this
sequence in the NCBI’s holdings. A link to the “Identical Protein
Groups” Database near the top of the record shows you what these
are:
Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (S. cerevisiae). In particular:
there are several records for which the source is the INSDC, these are archival entries, submitted by independent yeast genome research projects;
there are two entries in the RefSeq database
linking to the same protein: NP_010227.1
.
One is derived from genome sequence, the other from mRNA. This RefSeq
entry is the preferred version of a sequence for our purposes. RefSeq is
a curated, non-redundant database which solves a number of problems of
archival databases. You can recognize RefSeq identifiers – they always
look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This
reflects whether the sequence is protein, mRNA or genomic, and inferred
or obtained through experimental evidence.
there is a SwissProt sequence P39678.1
2. This link
is kind of a big deal. It’s a cross-reference into UniProt,
the huge protein sequence database maintained by the EBI (European
Bioinformatics Institute), which is the NCBI’s counterpart in
Europe. SwissProt entries have the highest annotation standard overall
and are expertly curated. Many Webservices work with UniProt ID’s
(e.g. P39678.1
), rather than NCBI IDs such as a RefSeq ID.
But it used to be until recently that the two databases did not link to
each other, mostly for reasons of funding politics. It’s great to see
that this divide has now been overcome.
Note that while all of these entries come from Saccharomyces
cerevisiae, they have been sequenced in different yeast strains.
Thus they don’t have to be identical (except for the
fact that this is a table of identical sequences), such related
sequences might be slightly different because the strains are after all
not genetically identical. And sometimes we find identical sequences in
quite divergent species. Therefore I would not actually consider EIW11153.1
,
AJU86440.1
,
AJU58508.1
,
and AJU61971.1
to be identical proteins, although they have the same sequence.
Note all the .1
suffixes of the sequence identifiers.
These are version numbers. Two observations: 1. It’s great that version
numbers are now used throughout the NCBI database. This is good database
engineering practice because it’s really important for reproducible
research that updates to database records are possible, but
recognizable. When working with data you always must
provide for the possibility of updates, and manage the changes
transparently and explicitly. Proper versioning should be a part of
all datamodels. In fact, the NCBI has recently phased
out its internal unique identifiers – the GI number – in favour of
accession-number.version IDs everywhere. 1. When searching, or for
general use, you can (and should) omit the version
number, i.e. use NP_010227
or P39678
not NP_010227.1
resp. P39678.1
. This way the
database system will resolve the identifier to the most current, highest
version number (unless you want the older one, of
course).
Task…
NP_010227.1
.As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.
Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail.
Task…
But it does not find all Mbp1 related literature.
[DP]
, [AU]
, [TI]
, and
[TA]
), how you use the History to combine
searches, and the use of AND
, OR
,
NOT
and brackets. Understand how you can restrict a search
to reviews only, and what the link to Related
citations… is useful for3.
PubMed usually includes links to full-text articles, but these are often behind a paywall, even though we have access through our library system (one of the top three in the world incidentally). Here is a bookmarklet (a portmanteau of “bookmark” and “applet”) to seamlessly redirect from a paywall page to full access through our library’s “my access” system. The key is to apply a bit of code that “rewrites” the original URL.
javascript:(function(){var url=window.location.href;var re=//([.]+)/(.*$)/;var match=url.match(re);var newURL=“http://"+match[1]+”.myaccess.library.utoronto.ca/“+match[2];window.location.href=newURL;})();void 0
No line breaks!
Then try it. Go to the following article from outside the university network …
http://science.sciencemag.org/content/303/5659/788.long
… you should see the abstract but you can’t view the full text without being an AAAS member. Then click on your bookmarklet. It should automatically rewrite the URL, take you to the UofT login screen, and take you to a page with full access to the article.
I hope you find this as useful as I do. The strategy lends itself to other nice ideas.
Task…
In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE. Navigate to the NCBI Protein page for the RefSeq entry of this protein. Explore the links that go out from the page. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with. The goal is to develop a sense for where a page like this one collects original information, and where it merely acts as a record of annotation transfer.
Task…
The goal of this short report is to develop a sense for how bioinformatics resources support questions of biological or medical interest. In the BIN-Storing_data unit you have found that protein of MYSPE that is most similar to yeast Mbp1. Navigate to the Genbank page for this protein. Explore the links that go out from the page to other databases and resources.
From these resources, choose one that appears to contain particularly useful information, and describe a plausible scenario how it would be used to answer a research question in a laboratory. What is the available data? What types of questions can it help answer? How would you interpret the annotations it supports?
Then submit your report as a formative feedback assignment.
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
If there is only a single match, you will be been taken directly to the page.↩︎
Actually the “real” SwissProt identifier would be
patterned like MBP1_YEAST
. P39678
is the
corresponding UniProt identifier.↩︎
A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.↩︎