Difference between revisions of "BIO Assignment Week 2"
m (→NCBI databases) |
|||
Line 222: | Line 222: | ||
Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism. | Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
Line 234: | Line 229: | ||
<small>Remember to document your activities.</small> | <small>Remember to document your activities.</small> | ||
− | # Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/ | + | # Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/ |
− | + | # In the search bar, enter <code>mbp1</code> and click '''Search'''. | |
− | + | # On the resulting page, look for the '''Protein''' section and click on it. What do you find? | |
− | # | + | }} |
− | + | ||
− | + | ||
− | + | The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an ''Arabidopsis'' protein, a ''Saccharomyces'' protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system. | |
− | # | + | |
− | + | ||
+ | {{task|1= | ||
+ | |||
+ | # Navigate to the [http://www.ncbi.nlm.nih.gov/books/NBK3837/ Entrez Help Page] and read about the Entrez system, especially about: | ||
+ | ##Boolean operators, | ||
+ | ##wildcards, | ||
+ | ##limits, and | ||
+ | ##filters. | ||
+ | # You should minimally understand: | ||
+ | ## How to search by keyword; | ||
+ | ## How to search by gene or protein name; | ||
+ | ## How to restrict a search to a particular organism. | ||
+ | |||
+ | Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them. | ||
+ | }} | ||
+ | |||
+ | |||
+ | Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages. | ||
+ | |||
+ | |||
+ | {{task|1= | ||
+ | |||
+ | Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter: | ||
+ | |||
+ | Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism] | ||
+ | |||
+ | }} | ||
+ | |||
+ | |||
+ | This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result—a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there. | ||
+ | |||
+ | All well and good - but didn't we want to find '''RefSeq''' entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it: | ||
+ | |||
+ | |||
+ | {{task|1= | ||
+ | |||
+ | # In the right-hand margin of the record, you will find a section of '''Identical proteins ...''': click on '''See all..."" to list them all. Among these, find the entry with an accession number like <code>NP_123456</code>. This is a RefSeq ID. Follow the link. | ||
+ | # Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider! | ||
+ | # Note down the RefSeq ID, you will probably need it later on. | ||
+ | }} | ||
+ | |||
+ | |||
+ | All well and good, and the Mbp1 protein is going to accompany us throughout the term—but we were actually trying to find related proteins in YFO. Let's give that a try. | ||
+ | |||
+ | |||
+ | {{task|1= | ||
+ | |||
+ | # Again in the right hand margin, find the section on '''Related Information''' and follow the link to '''Related Sequences'''. There are many. More than 8,000 actually. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results. | ||
+ | # Click on the '''Advanced''' link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like | ||
+ | #4 AND "Magnaporthe grisea"[organism] | ||
+ | ... or whatever your command-history number resp. YFO name suggests. | ||
+ | |||
+ | You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem. | ||
+ | }} | ||
+ | |||
+ | |||
+ | This os '''one''' way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating sequences in YFO. | ||
+ | |||
+ | |||
+ | ====PubMed==== | ||
+ | |||
+ | |||
+ | Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously. | ||
+ | |||
+ | |||
+ | {{task|1= | ||
+ | |||
+ | # Return back to the '''MBP1''' RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again. | ||
+ | # Find the '''PubMed''' links under '''Related information''' in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds '''all''' Mbp1 related literature. | ||
+ | # Again, enter the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for. | ||
+ | # Now find publications with Mbp1 '''in the title'''. In the result list, follow the links for the two Biochemistry papers by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, we will need them later. | ||
}} | }} |
Revision as of 04:08, 21 September 2012
Assignment for Week 2
Scenario, Databases, Search and Retrieve
Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
The Scenario
Baker's yeast, Saccharomyces cerevisiae, is perhaps the most important model organism. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.
This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:
- Do related proteins exist in other organisms?
- What functional features can we detect in the related proteins?
- Do we have evidence that they may bind to similar sequence motifs?
- Do we believe they may function in a similar way?
Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database and read the summary paragraph on the protein's function!
(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in Lodish's Molecular Cell Biology and./or read Nobel laureate Paul Nurse's review of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but it's obviously more satisfying to work with concepts that actually make some sense.)
For reference, this is the FASTA formatted sequence of Mbp1 from Saccharomyces cerevisiae:
>gi|6320147|ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMET
KRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQL
PSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQ
QSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS
IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVL
SKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQM
MIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQ
MASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSS
LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA
I have highlighted the protein's APSES domain (also known as a KilA-N domain), which is the DNA binding element of the sequence. Of course, such coloring is not part of the actual FASTA file which contains only a header and sequence letters.
Choosing YFO (Your Favourite Organism)
The first task is to choose a species in which to conduct your explorations.
Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species
- that has transcription factors containing APSES domains;
- whose genome has been completely sequenced;
- for which records exist in the RefSeq database, NCBI's unique sequence collection.
Next, I would like to assign species from this list randomly to each student, but I'd also like to avoid having to make a fresh table of assignments every year.
Here is R code to accomplish this:
Task:
- Read, try to understand and then execute the following R-code.
pickSpecies <- function(ID) {
# this function randomly picks a fungal species
# from a list. It is seeded by a student ID. Therefore
# the pick is random, but reproducible.
# first, define a list of species:
Species <- c(
"Ajellomyces dermatitidis (AJEDE)",
"Arthroderma gypseum (ARTGY)",
"Ashbya gossypii (ASHGO)",
"Aspergillus clavatus (ASPCL)",
"Aspergillus flavus (ASPFL)",
"Botryotinia fuckeliana (BOTFU)",
"Candida glabrata (CANGL)",
"Chaetomium globosum (CHAGL)",
"Clavispora lusitaniae (CLALU)",
"Coccidioides immitis (COCIM)",
"Coprinopsis cinerea (COPCI)",
"Debaryomyces hansenii (DEBHA)",
"Gibberella zeae (GIBZE)",
"Kluyveromyces lactis (KLULA)",
"Komagataella pastoris (KOMPA)",
"Laccaria bicolor (LACBI)",
"Lachancea thermotolerans (LACTH)",
"Lodderomyces elongisporus (LODEL)",
"Magnaporthe oryzae (MAGOR)",
"Malassezia globosa (MALGL)",
"Meyerozyma guilliermondii (MEYGU)",
"Nectria haematococca (NECHA)",
"Neosartorya fischeri (NEOFI)",
"Paracoccidioides brasiliensis (PARBR)",
"Penicillium chrysogenum (PENCH)",
"Puccinia graminis (PUCGR)",
"Pyrenophora teres (PYRTE)",
"Scheffersomyces stipitis (SCHST)",
"Schizophyllum commune (SCHCO)",
"Phaeospheria nodorum (PHANO)",
"Schizosaccharomyces japonicus (SCHJA)",
"Sclerotinia sclerotiorum (SCLSC)",
"Talaromyces stipitatus (TALST)",
"Trichophyton rubrum (TRIRU)",
"Uncinocarpus reesii (UNCRE)",
"Vanderwaltozyma polyspora (VANPO)",
"Verticillium albo-atrum (VERAL)",
"Yarrowia lipolytica (YARLI)",
"Zygosaccharomyces rouxii (ZYGRO)"
)
l <- length(Species) # number of elements in the list
set.seed(ID) # seed the random number generator
# with the student ID
i <- runif(1, 0, 1) # pick one random number between 0 and 1
i <- l * i # multiply with number of elements
i <- ceiling(i) # round up to nearest integer
choice <- Species[i] # pick the i'th element from list
return(choice)
}
- Execute the function
pickSpecies()
with your student ID as its parameter. Example:
> pickSpecies(991234567)
[ 1] "Candida glabrata (CANGL)"
- Note down the species name and its five letter abbreviation. Use this species whenever this or future assignments refer to YFO.
Keeping a notebook
NCBI databases
Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism.
Entrez
Task:
Remember to document your activities.
- Access the NCBI website at http://www.ncbi.nlm.nih.gov/
- In the search bar, enter
mbp1
and click Search. - On the resulting page, look for the Protein section and click on it. What do you find?
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an Arabidopsis protein, a Saccharomyces protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
Task:
- Navigate to the Entrez Help Page and read about the Entrez system, especially about:
- Boolean operators,
- wildcards,
- limits, and
- filters.
- You should minimally understand:
- How to search by keyword;
- How to search by gene or protein name;
- How to restrict a search to a particular organism.
Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them.
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages.
Task:
Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter:
Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]
This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result—a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there.
All well and good - but didn't we want to find RefSeq entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it:
Task:
- In the right-hand margin of the record, you will find a section of Identical proteins ...: click on See all..."" to list them all. Among these, find the entry with an accession number like
NP_123456
. This is a RefSeq ID. Follow the link. - Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider!
- Note down the RefSeq ID, you will probably need it later on.
All well and good, and the Mbp1 protein is going to accompany us throughout the term—but we were actually trying to find related proteins in YFO. Let's give that a try.
Task:
- Again in the right hand margin, find the section on Related Information and follow the link to Related Sequences. There are many. More than 8,000 actually. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results.
- Click on the Advanced link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like
#4 AND "Magnaporthe grisea"[organism]
... or whatever your command-history number resp. YFO name suggests.
You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem.
This os one way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating sequences in YFO.
PubMed
Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously.
Task:
- Return back to the MBP1 RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again.
- Find the PubMed links under Related information in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds all Mbp1 related literature.
- Again, enter the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for.
- Now find publications with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, we will need them later.
Sequence retrieval
Cross-reference
Structure search
Visit the RCSB PDB website at http://www.pdb.org/ , explore the database and familiarize yourself with its contents.
- Look for the "Getting started" page and explore the page.
- Explore the links on the "Education" page to see where you might fill in gaps in your knowledege of structural molecular biology, such as the Biological Units tutorial; read up on one or two the excellent molecule of the month articles, such as the TATA binding protein (July 2005).
- From the homepage, search for the yeast Mbp1 protein (by keyword) and explore the information that is available in one of the entries that was retrieved.
Structure retrieval
Visualize in VMD
VMD
Task:
- Access the VMD page.
- Install the program as per the instructions in the section: "Installing VMD".
- In the tutorial section work through
- Part 1 (Introduction), and
- Part 2 (Working with a single molecule).
Stereo vision (1 mark):=
Task:
Access the Stereo Vision tutorial and practice viewing molecular structures in stereo.
Practice at least ...
- two times daily,
- for 3-5 minutes each session,
Keep up your practice throughout the course. Stereo viewing will be required in the final exam, but more importantly, it is a wonderful skill that will greatly support any activity of yours related to structural molecular biology. Practice with different molecules and try out different colours and renderings.
Note: do not go through your practice sessions mechanically. If you are not making any progress with stereo vision, contact me so we can help you on the right track.
R
The R statistics environment and programming language is an exceptionally well engineered, free (as in free speech) and free (as in free beer) platform for data manipulation and analysis. The number of functions that are included by default is large, there is a very large number of additional, community-generated analysis modules that can be simply imported from dedicated sites (e.g. the Bioconductor project for molecular biology data), or via the CRAN network, and whatever function is not available can be easily programmed. The ability to filter and manipulate data to prepare it for analysis is an absolute requirement in research-centric fields such as ours, where the strategies for analysis are constantly shifting and prepackaged solutions become obsolete almost faster than they can be developed. Besides numerical analysis, R has very powerful and flexible functions for plotting graphical output.
R is not a main focus of the course, but an important tool I would like you to pick up "on the side".
Task:
- Access the R tutorial on this site.
- Work through the sections Installation, User interface, and Packages.