Difference between revisions of "BIO Assignment Week 2"

From "A B C"
Jump to navigation Jump to search
Line 79: Line 79:
 
:(2) Reformatted the document to provide an Entrez species selection command. With this string NCBI search tools can be constrained to a set of species we are interested in. One could type this list by hand, or use search/replace functions of a text editor on the original list. I used the following Perl one-liner which I give here merely for your edification<ref>If you are curious how this works, ask me.</ref>.
 
:(2) Reformatted the document to provide an Entrez species selection command. With this string NCBI search tools can be constrained to a set of species we are interested in. One could type this list by hand, or use search/replace functions of a text editor on the original list. I used the following Perl one-liner which I give here merely for your edification<ref>If you are curious how this works, ask me.</ref>.
  
  perl -e 'while(<STDIN>){m/  
+
  perl -e 'while(<STDIN>){/^(.+?)\t/;print"\"$1\"[organism] OR \n"}' < genomes_overview.txt
  
 
... giving me the Entrez selection command (with over 400 species):
 
... giving me the Entrez selection command (with over 400 species):

Revision as of 19:12, 18 September 2014

Assignment for Week 2
Scenario, Labnotes on the Wiki, R-functions, Databases, Search and Retrieve


Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

The Scenario

Baker's yeast, Saccharomyces cerevisiae, is perhaps the most important model organism. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.

This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.

One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:

  • Do related proteins exist in other organisms?
  • What functional features can we detect in the related proteins?
  • Do we have evidence that they may bind to similar sequence motifs?
  • Do we believe they may function in a similar way?

Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database and read the summary paragraph on the protein's function!

(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in Lodish's Molecular Cell Biology and./or read Nobel laureate Paul Nurse's review of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but it's obviously more satisfying to work with concepts that actually make some sense.)

For reference, this is the FASTA formatted sequence of Mbp1 from Saccharomyces cerevisiae:

>gi|6320147|ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMET
KRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQL
PSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQ
QSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS
IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVL
SKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQM
MIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQ
MASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSS
LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA

I have highlighted the protein's APSES domain (also known as a KilA-N domain), which is the DNA binding element of the sequence. Of course, such colouring is not part of the actual FASTA file which contains only a header and sequence letters. This is the domain we will focus on most in the following assignments.


Choosing YFO (Your Favourite Organism)

The first task is to choose a species in which to conduct your explorations.


Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species

  • that has transcription factors containing APSES domains;
  • whose genome has been completely sequenced;
  • for which records exist in the RefSeq database, NCBI's unique sequence collection.


To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1 and compiled the names of organisms that contain them.

 

Next, I would like to assign species from this list to each student. This process should be random, but reproducible.

Here is R code to accomplish this:

Task:


  • Read, try to understand and then execute the following R-code.
pickSpecies <- function(ID) {
	# this function randomly picks a fungal species
	# from a list. It is seeded by a student ID. Therefore
	# the pick is random, but reproducible.
	
	# first, define a list of species:
	Species <- c(
		"Ajellomyces dermatitidis  (AJEDE)",
		"Arthroderma benhamiae  (ARTBE)",
		"Arthroderma gypseum  (ARTGY)",
		"Ashbya gossypii  (ASHGO)",
		"Aspergillus clavatus  (ASPCL)",
		"Aspergillus fumigatus  (ASPFU)",
		"Aspergillus nidulans  (ASPNI)",
		"Aspergillus niger  (ASPNI)",
		"Aspergillus terreus  (ASPTE)",
		"Candida albicans  (CANAL)",
		"Candida dubliniensis  (CANDU)",
		"Candida glabrata  (CANGL)",
		"Candida orthopsilosis  (CANOR)",
		"Candida tropicalis  (CANTR)",
		"Chaetomium globosum  (CHAGL)",
		"Clavispora lusitaniae  (CLALU)",
		"Coccidioides immitis  (COCIM)",
		"Coccidioides posadasii  (COCPO)",
		"Debaryomyces hansenii  (DEBHA)",
		"Eremothecium cymbalariae  (ERECY)",
		"Kazachstania africana  (KAZAF)",
		"Kluyveromyces lactis  (KLULA)",
		"Komagataella pastoris  (KOMPA)",
		"Lachancea thermotolerans  (LACTH)",
		"Lodderomyces elongisporus  (LODEL)",
		"Magnaporthe oryzae  (MAGOR)",
		"Malassezia globosa  (MALGL)",
		"Meyerozyma guilliermondii  (MEYGU)",
		"Millerozyma farinosa  (MILFA)",
		"Myceliophthora thermophila  (MYCTH)",
		"Naumovozyma castellii  (NAUCA)",
		"Naumovozyma dairenensis  (NAUDA)",
		"Nectria haematococca  (NECHA)",
		"Neosartorya fischeri  (NEOFI)",
		"Neurospora crassa  (NEUCR)",
		"Paracoccidioides sp.  (PARSP)",
		"Puccinia graminis  (PUCGR)",
		"Pyrenophora teres  (PYRTE)",
		"Pyrenophora tritici-repentis  (PYRTR)",
		"Saccharomyces cerevisiae (SACCE)",
		"Saccharomyces cerevisiae  (SACCE)",
		"Scheffersomyces stipitis  (SCHST)",
		"Schizosaccharomyces japonicus  (SCHJA)",
		"Sclerotinia sclerotiorum  (SCLSC)",
		"Sordaria macrospora  (SORMA)",
		"Talaromyces marneffei  (TALMA)",
		"Talaromyces stipitatus  (TALST)",
		"Tetrapisispora blattae  (TETBL)",
		"Tetrapisispora phaffii  (TETPH)",
		"Thielavia terrestris  (THITE)",
		"Torulaspora delbrueckii  (TORDE)",
		"Trichophyton rubrum  (TRIRU)",
		"Trichophyton verrucosum  (TRIVE)",
		"Uncinocarpus reesii  (UNCRE)",
		"Vanderwaltozyma polyspora  (VANPO)",
		"Verticillium alfalfae  (VERAL)",
		"Yarrowia lipolytica  (YARLI)",
		"Zygosaccharomyces rouxii  (ZYGRO)",
		"Zymoseptoria tritici  (ZYMTR)"
		)
	l <- length(Species)    # number of elements in the list
	set.seed(ID)            # seed the random number generator
	                        # with the student ID
	i <- runif(1, 0, 1)     # pick one random number between 0 and 1
	i <- l * i              # multiply with number of elements
	i <- ceiling(i)         # round up to nearest integer
	choice <- Species[i]    # pick the i'th element from list
	return(choice)
}
  • Execute the function pickSpecies() with your student ID as its parameter. Example:
 > pickSpecies(991234567)
 [ 1] "Candida glabrata (CANGL)"
  • Note down the species name and its five letter label on your student Wiki page. Use this species whenever this or future assignments refer to YFO.


Task:

  • While you already have R open, access the R tutorial and work through the section on Simple commands. It is short, and should help you understand the code above.


 

Keeping a notebook on your Wiki

Consider it a part of your assignment to document your activities on your Wiki page. This will be helpful, because the assignment is more or less integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research.


 

NCBI databases

Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism.


Entrez

Task:
Remember to document your activities.

  1. Access the NCBI website at http://www.ncbi.nlm.nih.gov/
  2. In the search bar, enter mbp1 and click Search.
  3. On the resulting page, look for the Protein section and click on it. What do you find?


The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an Arabidopsis protein, a Saccharomyces protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.


Task:

  1. Navigate to the Entrez Help Page and read about the Entrez system, especially about:
    1. Boolean operators,
    2. wildcards,
    3. limits, and
    4. filters.
  2. You should minimally understand:
    1. How to search by keyword;
    2. How to search by gene or protein name;
    3. How to restrict a search to a particular organism.

Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them.


Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages.


Protein

 

Task:
Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter:

Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]


This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result—a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there.

All well and good - but didn't we want to find RefSeq entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it:


Task:

  1. In the right-hand margin of the record, you will find a section of Identical proteins ...: click on See all..."" to list them all. Among these, find the entry with an accession number like NP_123456. This is a RefSeq ID. Follow the link.
  2. Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider!
  3. Note down the RefSeq ID, you will probably need it later on.


All well and good, and the Mbp1 protein is going to accompany us throughout the term—but we were actually trying to find related proteins in YFO. Let's give that a try.


Task:

  1. Again in the right hand margin, find the section on Related Information and follow the link to Related Sequences. There are many. More than 8,000 actually. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results.
  2. Click on the Advanced link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like
#4 AND "Magnaporthe grisea"[organism]

... or whatever your command-history number resp. YFO name suggests.

You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem.


This is one way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating sequences in YFO.

PubMed

Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously.


Task:

  1. Return back to the MBP1 RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again.
  2. Find the PubMed links under Related information in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds all Mbp1 related literature.
  3. Again, enter the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for.
  4. Now find publications with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, we will need them later.


Structure search

The search options in the PDB structure database are as sophisticated as those at the NCBI. For now, we will try a simple keyword search to get us started.


Task:

  1. Visit the RCSB PDB website at http://www.pdb.org/
  2. Briefly orient yourself regarding the database contents and its information offerings and services.
  3. Enter Mbp1 into the search field.
  4. In your journal, note down the PDB IDs for the three Saccharomyces cerevisiae Mbp1 transcription factor structures your search has retrieved.
  5. Click on one of the entries and explore the information and services linked from that page.

VMD

 

Task:

  1. Open VMD.
  2. Load one the yeast Mbp1 fragment structures for which you have noted the PDB ID (simply enter the ID into the appropriate field of a File → New Molecule window).
  3. Display the protein in New Cartoon display and familiarize yourself with its topology of helices and strands.
  4. Using the sequence viewer window, identify the part of the sequence that corresponds to the APSES domain I have highlighted in the FASTA record above.
  5. Generate a stereo-view that shows the molecule well, in which the the APSES domain residues are red and the remaining residues are white.
  6. Save the image in your journal.

Stereo vision

Task:

Continue with your stereo practice.

Practice at least ...

  • two times daily,
  • for 3-5 minutes each session,

Keep up your practice throughout the course. Once again: do not go through your practice sessions mechanically. If you are not making constant progress in your practice sessions, contact me so we can help you on the right track.


Modeling small molecules (optional)

As an optional part of the assignment, here is a small tutorial for modeling and visualizing "small-molecule" structures.


Defining a molecule

A number of public repositories make small molecule information available, such as PubChem at the NCBI, the ligand collection at the PDB, the ChEBI database at the European Bioinformatics Institute, or the NCI database browser at the US National Cancer Institute. One general way to export topology information from these services is to use SMILES strings—a shorthand notation for the composition and topology of chemical compounds.


Task:

  1. Access each of the databases mentioned above.
  2. Enter "caffeine" as a search term.
  3. Explore the contents of the result, in particular note and copy the SMILES string for the compound.


Alternatively, you can sketch your own compound. Versions of Peter Ertl's Java Molecular Editor (JME) are offered on several websites (e.g. click on Transfer to Java Editor on a NCI results page), and PubChem offers this functionality via its Sketcher tool.

Task:

  1. Navigate to PubChem.
  2. Follow the link to Chemical structure search (in the right hand menu).
  3. Click on the 3D conformer tab and on the Launch button to launch the molecular editor in its own window.
  4. Sketch the structure of caffeine. I find the editor quite intuitive but if you need help, just use the Help button in the editor.
  5. Save the SMILES string of your compound.
  6. Also Export your result in SMILES format as a file.


Translating SMILES to structure

Online services exist to translate SMILES to (idealized) coordinates[2].


Task:

  1. Access the online SMILES translation service at the NCI.
  2. Paste a caffeine SMILES string into the form, choose the PDB radio button, click on Translate and download your file.
  3. Load the molecule in VMD.


That is all.


 

Links and resources

 


Footnotes and references

  1. If you are curious how this works, ask me.
  2. It should in principle be possible to read SMILES strings and many other 2D format outputs directly in VMD. VMD has a Babel plugin for this purpose. In principle it works like this: (1) download and install Babel; (2) find where the babel executable has been installed by typing which babel into a terminal; (3) Launch VMD and set the correct environment variable by typing set env(VMDBABELBIN) /usr/local/bin/babel (or whatever your path is) into the VMD tcl console; (4) Use the File → New Molecule dialogue to locate your SMILES file, then select SMILES as a "Convert from:" option (accessed via the last entry of the menu of file types in the Molecule File Browser window). However, on my machine this works in principle, but gives coordinates that are scaled too large by a factor of 10. Therefore, no bonds are recognized and drawn between the atoms. I don't have the time right now to troubleshoot this, if anyone has a solution, I would be happy to learn.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.