Difference between revisions of "BIO Assignment Week 2"

Revision as of 04:08, 21 September 2012

Assignment for Week 2
Scenario, Databases, Search and Retrieve

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

The Scenario

Baker's yeast, Saccharomyces cerevisiae, is perhaps the most important model organism. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.

This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.

One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:

Do related proteins exist in other organisms?
What functional features can we detect in the related proteins?
Do we have evidence that they may bind to similar sequence motifs?
Do we believe they may function in a similar way?

Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database and read the summary paragraph on the protein's function!

(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in Lodish's Molecular Cell Biology and./or read Nobel laureate Paul Nurse's review of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but it's obviously more satisfying to work with concepts that actually make some sense.)

For reference, this is the FASTA formatted sequence of Mbp1 from Saccharomyces cerevisiae:

>gi|6320147|ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMET
KRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQL
PSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQ
QSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS
IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVL
SKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQM
MIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQ
MASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSS
LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA

I have highlighted the protein's APSES domain (also known as a KilA-N domain), which is the DNA binding element of the sequence. Of course, such coloring is not part of the actual FASTA file which contains only a header and sequence letters.

Choosing YFO (Your Favourite Organism)

The first task is to choose a species in which to conduct your explorations.

Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species

that has transcription factors containing APSES domains;
whose genome has been completely sequenced;
for which records exist in the RefSeq database, NCBI's unique sequence collection.

Expand for detailsTo prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1 and compiled the names of organisms that contain them.

Expand for details

Next, I would like to assign species from this list randomly to each student, but I'd also like to avoid having to make a fresh table of assignments every year.

Here is R code to accomplish this:

Task:

Read, try to understand and then execute the following R-code.

pickSpecies <- function(ID) {
	# this function randomly picks a fungal species
	# from a list. It is seeded by a student ID. Therefore
	# the pick is random, but reproducible.
	
	# first, define a list of species:
	Species <- c(
		"Ajellomyces dermatitidis (AJEDE)",
		"Arthroderma gypseum (ARTGY)",
		"Ashbya gossypii (ASHGO)",
		"Aspergillus clavatus (ASPCL)",
		"Aspergillus flavus (ASPFL)",
		"Botryotinia fuckeliana (BOTFU)",
		"Candida glabrata (CANGL)",
		"Chaetomium globosum (CHAGL)",
		"Clavispora lusitaniae (CLALU)",
		"Coccidioides immitis (COCIM)",
		"Coprinopsis cinerea (COPCI)",
		"Debaryomyces hansenii (DEBHA)",
		"Gibberella zeae (GIBZE)",
		"Kluyveromyces lactis (KLULA)",
		"Komagataella pastoris (KOMPA)",
		"Laccaria bicolor (LACBI)",
		"Lachancea thermotolerans (LACTH)",
		"Lodderomyces elongisporus (LODEL)",
		"Magnaporthe oryzae (MAGOR)",
		"Malassezia globosa (MALGL)",
		"Meyerozyma guilliermondii (MEYGU)",
		"Nectria haematococca (NECHA)",
		"Neosartorya fischeri (NEOFI)",
		"Paracoccidioides brasiliensis (PARBR)",
		"Penicillium chrysogenum (PENCH)",
		"Puccinia graminis (PUCGR)",
		"Pyrenophora teres (PYRTE)",
		"Scheffersomyces stipitis (SCHST)",
		"Schizophyllum commune (SCHCO)",
		"Phaeospheria nodorum (PHANO)",
		"Schizosaccharomyces japonicus (SCHJA)",
		"Sclerotinia sclerotiorum (SCLSC)",
		"Talaromyces stipitatus (TALST)",
		"Trichophyton rubrum (TRIRU)",
		"Uncinocarpus reesii (UNCRE)",
		"Vanderwaltozyma polyspora (VANPO)",
		"Verticillium albo-atrum (VERAL)",
		"Yarrowia lipolytica (YARLI)",
		"Zygosaccharomyces rouxii (ZYGRO)"
		)
	l <- length(Species)    # number of elements in the list
	set.seed(ID)            # seed the random number generator
	                        # with the student ID
	i <- runif(1, 0, 1)     # pick one random number between 0 and 1
	i <- l * i              # multiply with number of elements
	i <- ceiling(i)         # round up to nearest integer
	choice <- Species[i]    # pick the i'th element from list
	return(choice)
}

Execute the function pickSpecies() with your student ID as its parameter. Example:

 > pickSpecies(991234567)
 [ 1] "Candida glabrata (CANGL)"

Note down the species name and its five letter abbreviation. Use this species whenever this or future assignments refer to YFO.

Keeping a notebook

Expand for detailsConsider it a part of your assignment to document your activities. This will be helpful, because the assignment is more or less integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research.

You should write your documentation like a lab notebook—not a formal lab report, but a point-form record of your actual activities. Write such documentation as notes to your (future) self. Obviously, since much of the work will be done on the Web, an electronic notebook makes more sense than a paper notebook.

For each task:

Write a header and give it a unique number.

It is useful to refer to the header number in later text.

State the objective.

In one brief sentence, restate what your task is supposed to achieve.

Document the procedure.

Note what you have done, as concisely as possible. Give enough information so that anyone could reproduce unambiguously what you have done— your future student, or even your future self.

Document your results.

You can distinguish different types of results -

- Static data does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a genbank record into your documentation, it is sufficient to note the accession number or the GI number.
- Variable data can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be selective in what you record. For example you should not paste the entire set of results of a BLAST search into your asignment, but only those matches that were important for your conclusions. Indiscriminate pasting of irrelevant information will make your notes unusable.
- Analysis results

The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.

Note your conclusions.

An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion possible, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis gives you the data, in your conclusion you provide the interpretation of what the data means in the context of your objective. Sometimes your assignment task will ask you to elaborate on an analysis and conclusion. But this does not mean that when the assignment does not explicitly mention it, you don't need to interpret your data.

Prepare your images well

Don't paste uncompressed screendumps into your notes. Save images in a compressed file format. Then e.g. if you are using MSWord documents, use the Insert → Picture → From File ... function of MSWord to insert the image into your file.

Use the right image types.

In principle, images can be stored uncompressed as .tiff or .bmp, or compressed as .gif or .jpg or .png. .gif is useful for images with large, monochrome areas and sharp, high-contrast edges because the LZW compression algorithm it uses works especially well on such data; .jpg (or .jpeg) is preferred for images with shades and halftones such as the structure views you should prepare for several assignments, JPEG has excellent application support and is the most versatile general purpose image file format currently in use; .tiff (or .tif) is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The .png format is an open source alternative for lossless, compressed images. Application support is growing but still variable. .bmp is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code and ubiquitous on Windows computers.

Image dimensions and resolution: Stereo images should have equivalent points approximately 6cm apart. It depends on your monitor how many pixels this corresponds to. The dimensions of an image are stated in pixels (width x height). My notebook screen has a native display resolution of 1440 x 900 pixels/23.5 x 21 cm. Therefore a 6cm separation on my notebook corresponds to ~260 pixels. However on my desktop monitor, 260 pixels is 6.7 cm across. For the assignments: adjust your stereo images so they are approximately at the the right separation and approximately 500 to 600 pixels across. Also, scale your molecules so they fill the available window and are not just dim blobs losing themselves in murky shadows.

Considerations for print (manuscripts etc.) are slightly different: for print output you can specify the output resolution in dpi (dots per inch). A typical print resolution is about 300 dpi: 6 cm separation at 300dpi is about 700 pixels. Print images should therefore be about three times as large in width and height as screen images.

Preparation of stereo views: When assignments require you to create molecular images, always create stereo views.

Keep your images uncluttered and expressive: Turn off the axes if they are not needed and scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting coloring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.

If you have technical difficulties, post your questions to the list and/or contact me.

NCBI databases

Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism.

Entrez

Task:
Remember to document your activities.

Access the NCBI website at http://www.ncbi.nlm.nih.gov/
In the search bar, enter mbp1 and click Search.
On the resulting page, look for the Protein section and click on it. What do you find?

The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an Arabidopsis protein, a Saccharomyces protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.

Task:

Navigate to the Entrez Help Page and read about the Entrez system, especially about:
1. Boolean operators,
2. wildcards,
3. limits, and
4. filters.
You should minimally understand:
1. How to search by keyword;
2. How to search by gene or protein name;
3. How to restrict a search to a particular organism.

Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them.

Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages.

Task:
Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter:

Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]

This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result—a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there.

All well and good - but didn't we want to find RefSeq entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it:

Task:

In the right-hand margin of the record, you will find a section of Identical proteins ...: click on See all..."" to list them all. Among these, find the entry with an accession number like NP_123456. This is a RefSeq ID. Follow the link.
Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider!
Note down the RefSeq ID, you will probably need it later on.

All well and good, and the Mbp1 protein is going to accompany us throughout the term—but we were actually trying to find related proteins in YFO. Let's give that a try.

Task:

Again in the right hand margin, find the section on Related Information and follow the link to Related Sequences. There are many. More than 8,000 actually. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results.
Click on the Advanced link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like

#4 AND "Magnaporthe grisea"[organism]

... or whatever your command-history number resp. YFO name suggests.

You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem.

This os one way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating sequences in YFO.

PubMed

Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously.

Task:

Return back to the MBP1 RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again.
Find the PubMed links under Related information in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds all Mbp1 related literature.
Again, enter the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for.
Now find publications with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, we will need them later.

Sequence retrieval

Cross-reference

Structure search

Visit the RCSB PDB website at http://www.pdb.org/ , explore the database and familiarize yourself with its contents.

Look for the "Getting started" page and explore the page.
Explore the links on the "Education" page to see where you might fill in gaps in your knowledege of structural molecular biology, such as the Biological Units tutorial; read up on one or two the excellent molecule of the month articles, such as the TATA binding protein (July 2005).
From the homepage, search for the yeast Mbp1 protein (by keyword) and explore the information that is available in one of the entries that was retrieved.

Structure retrieval

Visualize in VMD

VMD

Task:

Access the VMD page.
Install the program as per the instructions in the section: "Installing VMD".
In the tutorial section work through
- Part 1 (Introduction), and
- Part 2 (Working with a single molecule).

Stereo vision (1 mark):=

Task:

Access the Stereo Vision tutorial and practice viewing molecular structures in stereo.

Practice at least ...

two times daily,
for 3-5 minutes each session,

Keep up your practice throughout the course. Stereo viewing will be required in the final exam, but more importantly, it is a wonderful skill that will greatly support any activity of yours related to structural molecular biology. Practice with different molecules and try out different colours and renderings.

Note: do not go through your practice sessions mechanically. If you are not making any progress with stereo vision, contact me so we can help you on the right track.

R

The R statistics environment and programming language is an exceptionally well engineered, free (as in free speech) and free (as in free beer) platform for data manipulation and analysis. The number of functions that are included by default is large, there is a very large number of additional, community-generated analysis modules that can be simply imported from dedicated sites (e.g. the Bioconductor project for molecular biology data), or via the CRAN network, and whatever function is not available can be easily programmed. The ability to filter and manipulate data to prepare it for analysis is an absolute requirement in research-centric fields such as ours, where the strategies for analysis are constantly shifting and prepackaged solutions become obsolete almost faster than they can be developed. Besides numerical analysis, R has very powerful and flexible functions for plotting graphical output.

R is not a main focus of the course, but an important tool I would like you to pick up "on the side".

Task:

Access the R tutorial on this site.
Work through the sections Installation, User interface, and Packages.

@@ Line 222: / Line 222: @@
 Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism.
-The NCBI administers some of the world's most important databases, such as GenBank. In this section you should
-*Explore the NCBI Web site, familiarize yourself with its key databases and explore the resources to become confident that you will find information that you are looking for.
-*Follow a protein's annotations into PubMed and familiarize yourself with PubMed's query syntax.
-*Explore the Entrez search page, and learn how to limit queries and restrict searches
@@ Line 234: / Line 229: @@
 <small>Remember to document your activities.</small>
-# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/ Look for the '''site-map''' and browse the contents of this large site; find which databases and services are hosted here. Expect to spend at least half an hour to familiarize yourself with the site.
+# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/
-# Access the '''Map viewer''' (under the '''Genomes''' section of the '''Databases''' division). Click on the link under ''Saccharomyces cerevisiae'' (Build 2.1) for a whole genome view, then click on the icon for chromosome IV for a more detailed view. Enter the region between 340,000 and 380,000 in the "Region shown" fields on the left. How many genes does this region contain? How many of these are protein genes?
+# In the search bar, enter <code>mbp1</code> and click '''Search'''.
-## Click on '''MBP1''' to follow the link to its Entrez Gene page. Study the contents of the page. If you are not clear what the sections show you, click on one of the question marks. If you are still not clear, ask on our mailing list.
+# On the resulting page, look for the '''Protein''' section and click on it. What do you find?
-### Follow the link to '''PubMed''' for this gene. You should find (at least) 27 publications. Click on the '''History''' tab to find the index of the query that got you here (eg. "#4"). Now search for those papers in your query that were published in 2008: enter <tt>#4 AND 2008[DP]</tt> into the search field and click "Go". Make yourself familiar with the [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip Search field descriptions and tags] (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets.
+}}
-## Back at the MapViewer pager, click on '''pr''' in the same row as the MBP1 gene to find a list of '''GenPept''' (protein) records for this gene. Follow the link to the '''RefSeq''' record for this protein: <tt>'''NP_010227'''</tt>. This is a flat-file record for the Mbp1 gene. Study the fields and the format. Then use the "Display" option in the header to show this protein sequence in a FASTA format, choose "send to ... Text" to get '''only''' the FASTA format. Make sure you understand the difference between GenBank/GenPept and RefSeq, between GI number, accession and locus (refer to the lecture slides as soon as they are posted).
-# In the header bar of the MapViewer, click on the link to '''Entrez'''. Enter <tt>mbp1</tt> into the search field of the Entrez page and click "GO".
-## Increase the relevance of returned items by '''restricting your search''' to a particular organism. Access and read the [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.section.EntrezHelp.Entrez__the_Life_Sci Help pages for Entrez] and make sure you understand how to use limits and how to search in search field indexes. You will already have encountered similar concepts when you visited PubMed.
+The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an ''Arabidopsis'' protein, a ''Saccharomyces'' protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
-##Enter: <tt>mbp1 AND "saccharomyces cerevisiae"[organism]</tt> into the Entrez search field and click "GO". Click on the CoreNucleotide link of the results.
-## The RefSeq record listed in the results contains the entire yeast chromosome IV (1.5 Mbp) which you probably don't want to explore unless you actually want to. The result is correct, since ''mbp1'' is one of the 787 genes annotated on that chromosome, but perhaps not what we had in mind when we queried for a nucleotide sequence of the ''mbp1'' gene. Check the results for a different record that contains only the ''mbp1'' gene's (full-length) nucleotide sequence. There are (as of this writing) two such records. Explore either one of the two, these are nucleotide sequences in the GenBank flat file format.
+{{task|1=
+# Navigate to the [http://www.ncbi.nlm.nih.gov/books/NBK3837/ Entrez Help Page] and read about the Entrez system, especially about:
+##Boolean operators,
+##wildcards,
+##limits, and
+##filters.
+# You should minimally understand:
+## How to search by keyword;
+## How to search by gene or protein name;
+## How to restrict a search to a particular organism.
+Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them.
+}}
+Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages.
+{{task|1=
+Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter:
+ Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]
+}}
+This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result&mdash;a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there.
+All well and good - but didn't we want to find '''RefSeq''' entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it:
+{{task|1=
+# In the right-hand margin of the record, you will find a section of '''Identical proteins ...''': click on '''See all..."" to list them all. Among these, find the entry with an accession number like <code>NP_123456</code>. This is a RefSeq ID. Follow the link.
+# Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider!
+# Note down the RefSeq ID, you will probably need it later on.
+}}
+All well and good, and the Mbp1 protein is going to accompany us throughout the term&mdash;but we were actually trying to find related proteins in YFO. Let's give that a try.
+{{task|1=
+# Again in the right hand margin, find the section on '''Related Information''' and follow the link to '''Related Sequences'''. There are many. More than 8,000 actually. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results.
+# Click on the '''Advanced''' link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like
+ #4 AND "Magnaporthe grisea"[organism]
+... or whatever your command-history number resp. YFO name suggests.
+You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem.
+}}
+This os '''one''' way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating sequences in YFO.
+====PubMed====
+Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously.
+{{task|1=
+# Return back to the '''MBP1''' RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again.
+#  Find the '''PubMed''' links under '''Related information''' in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds '''all''' Mbp1 related literature.
+# Again, enter the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember.  Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for.
+# Now find publications with Mbp1 '''in the title'''. In the result list, follow the links for the two Biochemistry papers by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, we will need them later.
 }}

Difference between revisions of "BIO Assignment Week 2"

Revision as of 04:08, 21 September 2012

Contents

The Scenario

Choosing YFO (Your Favourite Organism)

Keeping a notebook

NCBI databases

Entrez

PubMed

Sequence retrieval

Cross-reference

Structure search

Structure retrieval

Visualize in VMD

VMD

Stereo vision (1 mark):=

R

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools