Difference between revisions of "BIO Assignment Week 2"

From "A B C"
Jump to navigation Jump to search
m
Line 19: Line 19:
 
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important {{WP|Model_organism|model organism}}. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.  
 
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important {{WP|Model_organism|model organism}}. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.  
  
This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes.
+
This and the following assignments will revolve around a {{WP|Transcription factor|transcription factor}} that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.
  
 
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:
 
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:
Line 50: Line 50:
 
  LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA
 
  LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA
  
I have highlighted the protein's <span style="color:#DD0000;">'''APSES''' domain</span> (also known as a KilA-N domain), the DNA binding element of the sequence. Of course, such coloring is not part of the actual FASTA file which contains only a header and sequence letters.  
+
I have highlighted the protein's <span style="color:#DD0000;">'''APSES''' domain</span> (also known as a {{WP|KilA-N domain}}), which is the DNA binding element of the sequence. Of course, such coloring is not part of the actual {{WP|FASTA_format|FASTA}} file which contains only a header and sequence letters.  
  
  
 
===Choosing YFO (Your Favourite Organism)===
 
===Choosing YFO (Your Favourite Organism)===
 +
  
 
The first task is to choose a species in which to conduct your explorations.
 
The first task is to choose a species in which to conduct your explorations.
 +
  
 
Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species
 
Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species
Line 63: Line 65:
  
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1.   
+
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1 and compiled the names of organisms that contain them.   
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
 
# Performed a [http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome PSI BLAST] search with the Mbp1 APSES domain sequence shown above. Restricted the search to the '''refseq_protein database''' and an '''Entrez query limit of fungi''' (taxid: 4751). This search was iterated a few times and retrieves all sequence-similar proteins from the RefSeq database - the result contains examples from all fungal species.
 
# Performed a [http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome PSI BLAST] search with the Mbp1 APSES domain sequence shown above. Restricted the search to the '''refseq_protein database''' and an '''Entrez query limit of fungi''' (taxid: 4751). This search was iterated a few times and retrieves all sequence-similar proteins from the RefSeq database - the result contains examples from all fungal species.
 
# In the header of the results page, there is a link to '''[Taxonomy reports]''' This contains a list of all hits, sorted by species. We can see the number of hits, but not whether the hits came from a genome sequence or have been contributed ''ad hoc'' as individual sequences. In the latter case, not all of the species' APSES domain proteins might be included in the RefSeq database.
 
# In the header of the results page, there is a link to '''[Taxonomy reports]''' This contains a list of all hits, sorted by species. We can see the number of hits, but not whether the hits came from a genome sequence or have been contributed ''ad hoc'' as individual sequences. In the latter case, not all of the species' APSES domain proteins might be included in the RefSeq database.
# To confirm the sequencing status, navigated to the table of [http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi organims available for genomic BLAST]. Clicked on the link to the eukaryotic genomes tree. For each species name in the taxonomy report, confirm that the species' genome sequence is available, has been annotated, and the protein sequences have been included in RefSeq (in that table, species for which this is true are marked with a red '''P''').
+
# To confirm the sequencing status, navigated to the table of [http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi organims available for genomic BLAST]. Clicked on the link to the eukaryotic genomes tree. For each species name in the taxonomy report, confirmed that the species' genome sequence is available, has been annotated, and the protein sequences have been included in RefSeq (in that table, species for which this is true are marked with a red '''P''').
 
# I included only species with at least three hits in the search results.
 
# I included only species with at least three hits in the search results.
 +
 +
This is a fairly typical example of gathering information across different data sources.
 
</div>
 
</div>
 
</div>
 
</div>
Line 74: Line 78:
 
&nbsp;
 
&nbsp;
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">
I would like to assign species from this list randomly to each student, but I'd also like to avoid having to make a fresh table of assignments every year. Here's an idea: we could use the student ID ( a '''unique identifier''') to pick entries from the list! Indeed, the functions provided in '''R''' can easily be used to randomly but reproducibly choose an element from a list. Essentially we can write a function thatcreates a many-faced die, with a piece of text&mdash;the species' names&mdash; on every face. It will fall differently for each student ID, but will fall the same every time the same ID is encountered.  
+
Next, I would like to assign species from this list randomly to each student, but I'd also like to avoid having to make a fresh table of assignments every year.
<div class="mw-collapsible-content">
+
 
 +
<div class="mw-collapsible-content"> Here's an idea: we could use the student ID ( a '''unique identifier''') to pick entries from the list! Indeed, the functions provided in '''R''' can easily be used to randomly but reproducibly choose an element from a list. Essentially we can write a function thatcreates a many-faced die, with a piece of text&mdash;the species' names&mdash; on every face. It will fall differently for each student ID, but will fall the same every time the same ID is encountered.  
 
This makes use of the fact that "random" numbers generated by a computer algorithm aren't really random: they are "pseudorandom", generated by a deterministic algorithm. Such an algorithm takes a number&mdash;a ''seed''&mdash; and mangles it until the resulthas no recognizable connection to the seed. The result actually is indistinguishable from a random number, except that if we use the same seed, we will always get the same result. So a random pick can be programmed with the following steps:
 
This makes use of the fact that "random" numbers generated by a computer algorithm aren't really random: they are "pseudorandom", generated by a deterministic algorithm. Such an algorithm takes a number&mdash;a ''seed''&mdash; and mangles it until the resulthas no recognizable connection to the seed. The result actually is indistinguishable from a random number, except that if we use the same seed, we will always get the same result. So a random pick can be programmed with the following steps:
 
# Create a list
 
# Create a list
Line 158: Line 163:
 
}}
 
}}
  
 +
 +
&nbsp;
 
===Keeping a notebook===
 
===Keeping a notebook===
  
==== Documentation: Lab-Notebook Style ====
+
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">Consider it a part of your assignment to document your activities. This will be helpful, because the assignment is more or less integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research.
</div>
+
<div class="mw-collapsible-content">
 
+
You should write your documentation like a lab notebook&mdash;not a formal lab report, but a point-form record of your actual activities. Write such documentation as notes to your (future) self. Obviously, since much of the work will be done on the Web, an electronic notebook makes more sense than a paper notebook.
As one part of your assignment, you should submit documentation of your activities. Do this like you would write a lab notebook. This is not intended to be a formal lab report, but a point-form record of your actual activities. Write such documentation as notes to yourself.
 
  
 
For each task:
 
For each task:
*;Write a header.
+
*;Write a header and give it a unique number.
:: Please use the same header number and text of the assignment and do not change the sequence of tasks given in the assignment. Keep distinct tasks in separate paragraphs.
+
:: It is useful to refer to the header number in later text.
  
 
*;State the objective.
 
*;State the objective.
:: In one brief sentence, restate what this task is to achieve.
+
:: In one brief sentence, restate what your task is supposed to achieve.
  
 
*;Document the procedure.
 
*;Document the procedure.
:: Note what you have done, as concisely as possible. Give enough information so that anyone could reproduce unambiguously what you have done.
+
:: Note what you have done, as concisely as possible. Give enough information so that anyone could reproduce unambiguously what you have done&mdash; your future student, or even your future self.
  
 
*;Document your results.
 
*;Document your results.
Line 179: Line 185:
  
 
**'''Static data''' does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a genbank record into your documentation, it is sufficient to note the accession number or the GI number.
 
**'''Static data''' does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a genbank record into your documentation, it is sufficient to note the accession number or the GI number.
**'''Variable data''' can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be '''selective''' in what you record. For example you should not paste the entire set of results of a BLAST search into your asignment, but only those matches that were important for your conclusions. '''Indiscriminate pasting of irrelevant information may cause deduction of marks.'''
+
**'''Variable data''' can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be '''selective''' in what you record. For example you should not paste the entire set of results of a BLAST search into your asignment, but only those matches that were important for your conclusions. '''Indiscriminate pasting of irrelevant information will make your notes unusable.'''
 
**'''Analysis results'''
 
**'''Analysis results'''
 
::The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.
 
::The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.
  
 
*;Note your conclusions.
 
*;Note your conclusions.
::An analysis is not complete unless you conclude something from the results. Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis gives you the data, in your '''conclusion''' you provide the interpretation of what the data means '''in the context of your objective'''. Sometimes we will ask you to elaborate on an analysis and conclusion. But this does not mean that when we do not ask, you don't need to interpret your data.
+
::An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion possible, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis gives you the data, in your '''conclusion''' you provide the interpretation of what the data means '''in the context of your objective'''. Sometimes your assignment task will ask you to elaborate on an analysis and conclusion. But this does not mean that when the assignment does not explicitly mention it, you don't need to interpret your data.
 
 
 
 
==== Preparation of images ====
 
</div>
 
 
 
 
 
:Don't paste uncompressed screendumps into your assignment. Save images separately in a compressed file format. Then use the '''Insert &rarr; Picture &rarr; From File ...''' function of MSWord to insert the image into your file.
 
  
 +
*;Prepare your images well
 +
::Don't paste uncompressed screendumps into your assignment. Save images in a compressed file format. Then e.g. if you are using MSWord documents, use the '''Insert &rarr; Picture &rarr; From File ...''' function of MSWord to insert the image into your file.
  
;Image types.
+
*;Use the right image types.
:In principle, images can be stored uncompressed as <code>.tiff</code> or <code>.bmp</code>, or compressed as <code>.gif</code> or <code>.jpg</code> or <code>.png</code>. [http://en.wikipedia.org/wiki/JPEG <code>.gif</code>] is useful for images with large, monochrome areas and sharp, high-contrast edges due to the LZW compression algorithm it uses; '''[http://en.wikipedia.org/wiki/JPEG <code>.jpg</code> or <code>.jpeg</code>] is preferred for images with shades and halftones such as the structure views required in the course assignments,''' it has excellent application support and is the most versatile general purpose image file format currently in use; [http://en.wikipedia.org/wiki/Tagged_Image_File_Format <code>.tiff</code> or <code>.tif</code>] is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The [http://en.wikipedia.org/wiki/Portable_Network_Graphics <code>.png</code>] format is an [http://en.wikipedia.org/wiki/Open_source open source] alternative for lossless, compressed images. Application support is growing but still variable. [http://en.wikipedia.org/wiki/BMP_file_format <code>.bmp</code>] is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code.
+
::In principle, images can be stored uncompressed as <code>.tiff</code> or <code>.bmp</code>, or compressed as <code>.gif</code> or <code>.jpg</code> or <code>.png</code>. {{WP|GIF|<code>.gif</code>}} is useful for images with large, monochrome areas and sharp, high-contrast edges due to the LZW compression algorithm it uses; {{WP|JPEG|'''<code>.jpg</code>'''}} (or <code>.jpeg</code>) is preferred for images with shades and halftones such as the structure views you should prepare for your course assignments, '''JPEG has excellent application support and is the most versatile general purpose image file format currently in use; [http://en.wikipedia.org/wiki/Tagged_Image_File_Format <code>.tiff</code> or <code>.tif</code>] is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The [http://en.wikipedia.org/wiki/Portable_Network_Graphics <code>.png</code>] format is an [http://en.wikipedia.org/wiki/Open_source open source] alternative for lossless, compressed images. Application support is growing but still variable. [http://en.wikipedia.org/wiki/BMP_file_format <code>.bmp</code>] is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code.
  
 
;Image dimensions and resolution
 
;Image dimensions and resolution

Revision as of 00:08, 21 September 2012

Assignment for Week 2
Scenario, Databases, Search and Retrieve

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

The Scenario

Baker's yeast, Saccharomyces cerevisiae, is perhaps the most important model organism. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.

This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.

One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:

  • Do related proteins exist in other organisms?
  • What functional features can we detect in the related proteins?
  • Do we have evidence that they may bind to similar sequence motifs?
  • Do we believe they may function in a similar way?

Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database and read the summary paragraph on the protein's function!

(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in Lodish's Molecular Cell Biology and./or read Nobel laureate Paul Nurse's review of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but it's obviously more satisfying to work with concepts that actually make some sense.)

For reference, this is the FASTA formatted sequence of Mbp1 from Saccharomyces cerevisiae:

>gi|6320147|ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMET
KRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQL
PSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQ
QSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS
IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVL
SKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQM
MIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQ
MASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSS
LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA

I have highlighted the protein's APSES domain (also known as a KilA-N domain), which is the DNA binding element of the sequence. Of course, such coloring is not part of the actual FASTA file which contains only a header and sequence letters.


Choosing YFO (Your Favourite Organism)

The first task is to choose a species in which to conduct your explorations.


Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species

  • that has transcription factors with APSES domains;
  • whose genome has been completely sequenced;
  • for which records exist in the RefSeq database, NCBI's unique sequence collection.


To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1 and compiled the names of organisms that contain them.

 

Next, I would like to assign species from this list randomly to each student, but I'd also like to avoid having to make a fresh table of assignments every year.

Here is R code to accomplish this:

Task:


  • Read, try to understand and then execute the following R-code.
pickSpecies <- function(ID) {
	# this function randomly picks a fungal species
	# from a list. It is seeded by a student ID. Therefore
	# the pick is random, but reproducible.
	
	# first, define a list of species:
	Species <- c(
		"Ajellomyces dermatitidis (AJEDE)",
		"Arthroderma gypseum (ARTGY)",
		"Ashbya gossypii (ASHGO)",
		"Aspergillus clavatus (ASPCL)",
		"Aspergillus flavus (ASPFL)",
		"Botryotinia fuckeliana (BOTFU)",
		"Candida glabrata (CANGL)",
		"Chaetomium globosum (CHAGL)",
		"Clavispora lusitaniae (CLALU)",
		"Coccidioides immitis (COCIM)",
		"Coprinopsis cinerea (COPCI)",
		"Debaryomyces hansenii (DEBHA)",
		"Gibberella zeae (GIBZE)",
		"Kluyveromyces lactis (KLULA)",
		"Komagataella pastoris (KOMPA)",
		"Laccaria bicolor (LACBI)",
		"Lachancea thermotolerans (LACTH)",
		"Lodderomyces elongisporus (LODEL)",
		"Magnaporthe oryzae (MAGOR)",
		"Malassezia globosa (MALGL)",
		"Meyerozyma guilliermondii (MEYGU)",
		"Nectria haematococca (NECHA)",
		"Neosartorya fischeri (NEOFI)",
		"Paracoccidioides brasiliensis (PARBR)",
		"Penicillium chrysogenum (PENCH)",
		"Puccinia graminis (PUCGR)",
		"Pyrenophora teres (PYRTE)",
		"Scheffersomyces stipitis (SCHST)",
		"Schizophyllum commune (SCHCO)",
		"Phaeospheria nodorum (PHANO)",
		"Schizosaccharomyces japonicus (SCHJA)",
		"Sclerotinia sclerotiorum (SCLSC)",
		"Talaromyces stipitatus (TALST)",
		"Trichophyton rubrum (TRIRU)",
		"Uncinocarpus reesii (UNCRE)",
		"Vanderwaltozyma polyspora (VANPO)",
		"Verticillium albo-atrum (VERAL)",
		"Yarrowia lipolytica (YARLI)",
		"Zygosaccharomyces rouxii (ZYGRO)"
		)
	l <- length(Species)    # number of elements in the list
	set.seed(ID)            # seed the random number generator
	                        # with the student ID
	i <- runif(1, 0, 1)     # pick one random number between 0 and 1
	i <- l * i              # multiply with number of elements
	i <- ceiling(i)         # round up to nearest integer
	choice <- Species[i]    # pick the i'th element from list
	return(choice)
}
  • Execute the function pickSpecies() with your student ID as its parameter. Example:
 > pickSpecies(991234567)
 [ 1] "Candida glabrata (CANGL)"
  • Note down the species name and its five letter abbreviation. Use this species whenever this or future assignments refer to YFO.


 

Keeping a notebook

Consider it a part of your assignment to document your activities. This will be helpful, because the assignment is more or less integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research.