Difference between revisions of "BIO Assignment 2 2011"

From "A B C"
Jump to navigation Jump to search
 
(23 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<!-- div style="padding: 5px; background: #FF4560;  border:solid 2px #000000;">
+
{{Template:Active}}
'''Note!'''
+
<!-- {{Template:Inactive}} -->
This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
+
 
</div -->
+
 
 
&nbsp;
 
&nbsp;
 
 
&nbsp;
 
&nbsp;
  
Line 13: Line 12:
  
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
Assignment 2 - Search, retrieve and annnotate
+
Assignment 2 (last: 2011) - Search, retrieve and annotate
 
</div>
 
</div>
  
'''Please note: This assignment is currently active. All changes will be announced on the course mailing list.'''
+
&nbsp;
 +
&nbsp;
  
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
{{Template:Preparation|
Introduction
+
care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. If you did not notice that the above sentence was repeated, you are not reading carefully enough.|
</div>
+
num=2|
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important model organism since it is a eukaryote that has been studied genetically and biochemically in great detail for many decades and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.
+
ord=second|
 +
due = Monday, October 24 at 12:00 noon (before the quiz)}}
  
This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). It regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes.
 
  
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular machinery is present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of questions such as:
+
;Your documentation for the procedures you follow in this assignment will be worth 1 mark.
*Do homologous proteins exist in other organisms?
 
*Do we believe these may bind to similar sequence motifs?
 
*Do we believe they may function in a similar way?
 
*Do other organisms appear to have related systems?
 
  
Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database and read the summary paragraph on the protein's function!
 
  
(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.chapter.3432 Lodish's Molecular Cell Biology]. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but recommended, since it's obviously more fun to work with concepts that actually make some sense.)
+
&nbsp;
 +
&nbsp;
  
In this particular assignment you will go on a search and retrieve mission for information and annotation of Mbp1 homologues in a fungal genome, using common public databases and Web resources.
 
  
  
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
Preparation, submission and due date
+
Introduction
 
</div>
 
</div>
 +
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important [http://en.wikipedia.org/wiki/Model_organism model organism]. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.
  
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. If you did not notice that the above sentence was repeated, you are not reading carefully enough.
+
This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes.
  
You are encouraged to discuss the assignments on the mailing list with any questions you may have. This is not a test. However you must do the assignments yourself and clearly attribute and label all sources of information. Plagiarism will be considered academic misconduct and this goes specificaly for material that is copied from previous years' assignments
+
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular machinery is present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of questions such as:
 +
*What functional features can we detect in Mbp1?
 +
*Do homologous proteins exist in other organisms?
 +
*Do we believe these homologues may bind to similar sequence motifs?
 +
*Do we believe they may function in a similar way?
 +
*Do other organisms appear to have related cell-cycle control systems?
  
  
 +
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 +
*Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database and read the summary paragraph on the protein's function!
 +
</div>
  
Prepare a Microsoft Word document with a title page that contains:
+
(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.chapter.3432 Lodish's Molecular Cell Biology] and./or read Nobel laureate [http://www.cumc.columbia.edu/dept/eukaryotic/nurse.pdf Paul Nurse's review (pdf)] of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but recommended, since it's obviously more fun to work with concepts that actually make some sense.)
*your full name
 
*your Student ID
 
*your e-mail address
 
*the organism name you have been assigned (see below)
 
  
Follow the steps outlined below. You are encouraged to  write your answers in short answer form or point form, '''like you would document an analysis in a laboratory notebook'''. However, you must
+
In this particular assignment you will go on a search and retrieve mission for information on yeast Mbp1, using common public databases and Web resources.
*document what you have done,
 
*note what Web sites and tools you have used,
 
*paste important data sequences, alignments, information etc.
 
  
'''If you do not document the process of your work, we will deduct marks.'''  Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks.  Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.
 
  
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
<code>A2_{family name}.{given name}.doc</code>
+
==Search==
<small>(for example my first assignment would be named: A2_steipe.boris.doc - don't include the brackets this time, and don't switch the order of your given name and familyname please!)</small>
+
</div>
  
Finally e-mail the document to [mailto:boris.steipe@utoronto.ca Boris Steipe] before the due date.
 
  
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
 
  
With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
+
==Retrieve==
'''The due date for the assignment is Tuesday, October 9. at 10:00 in the morning.'''
 
 
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
Grading
 
 
</div>
 
</div>
  
Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you '''must''' arrange this before the due date.
 
  
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will
+
Much useful information on yeast Mbp1 is compiled at the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 SGD information page on Mbp1]. However we don't always have the luxury of such precompiled information. Let's look at the protein and it's features "the traditional way".  
* count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
 
* be divided by two for BCH1441 (graduates).
 
  
&nbsp;
 
&nbsp;
 
  
 +
<div style="padding: 5px; background: #EEEEEE;  border:solid 1px #AAAAAA;">
 +
*Navigate to the NCBI homepage (you probably have bookmarked it anyway) and enter <code>Mbp1 AND "saccharomyces cerevisiae"[organism]</code> as an Entrez query.
 +
*Click on '''Protein''' and find the RefSeq record for the protein sequence.
 +
*From the NCBI RefSeq record, obtain a FASTA sequence of the protein and paste it into your assignment.
 +
</div>
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
==Retrieve==
 
</div>
 
&nbsp;
 
&nbsp;
 
  
 +
There are several sources for functional domain annotations of proteins. The NCBI has the [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml Conserved Domain Database], in Europe, the [http://smart.embl-heidelberg.de/ SMART database] provides such annotations. In terms of domains, both resources are very comparable. But SMART also analyses more general features such as low-complexity sequences and coiled coils. In order to use SMART however, we need the '''Uniprot accession number''' that corresponds to the refseq identifier. In a rational world, one would wish that such important crossreferences would simply be provided by the NCBI ... well, we have been wishing this for many years now. Fortunately ID-mapping services exist.
  
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<div style="padding: 5px; background: #EEEEEE;">
=== The Genome (1 mark)===
+
*Navigate to the [http://www.uniprot.org/?tab=mapping UniProt ID-Mapping service]. Enter the RefSeq identifier for the yeast Mbp1 protein and retrieve the corresponding UniProtKB Accession number. If this does not work, try the same mapping at the [http://pir.georgetown.edu/pirwww/search/idmapping.shtml PIR ID-mapping service]. Note the Uniprot accession number you find. (Should this work equally on both sites?)
 
</div>
 
</div>
  
Access the [[Organism_list_2007| Organism list]] to retrieve an organism name for this assignment. Navigate to the NCBI homepage at http://www.ncbi.nlm.nih.gov . Enter the systematic name into the search field, select the '''taxonomy database''' and identify the organism that you have been assigned.
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Record the taxonomy ID for the species or strain(s) that are associated with this organism name
 
</div>
 
  
Return to the the NCBI home-page and navigate to the "Genomic Biology" section, continue with the link for '''Fungi''' under the Genome Projects Database. This should take you to a tabular view of ongoing and completed fungal genome sequencing projects. Find your organism name in this table. There may be one or more sequencing projects associated with this organism.
+
Now navigate to [http://www.uniprot.org '''Uniprot'''], enter the ID you have found into the search field and select [Sequence Clusters(UniRef)] as the database to search in. There should be two sequences in the '''[UniRef100 ... (100% identical)]''' cluster. Compare them. One of them is a highly annotated Swiss-Prot record, the other is practically unannotated data that has been imported from a "third party" to UniProt. Unfortunately, that one is the sequence that the ID mapping service had found. No cross-references to the NCBI are included with Swiss-Prot records, nor do NCBI RefSeq records cross-reference NCBI holding. I consider this a sorry state of affairs. Therefore most of us actually run BLAST searches to find equivalent sequences in other databases and this is the most wasteful way imaginable to address the problem.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Decide  which project is the most suitable one for analysis and record your decision. Report the strain and Taxonomy-ID for this organism.
 
  
 +
<div style="padding: 5px; background: #EEEEEE;">
 +
*Note down the SwissProt ID and the UniProtKB Accession Number for yeast Mbp1.
 
</div>
 
</div>
  
If you can't identify the criteria that make one project more or less useful for your task, or you don't know which ones are more or less important, you are of course welcome to discuss your questions on the list.
 
  
Click on the organism name to navigate to the Genome Project information page.
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
==Annotate==
*Comment briefly on the status of the data you are working with: include information such as whether the entire genome is available or only a partial sequence? How many chromosomes does this genome have? What is the status of its genome assembly and annotation? Has the mitochondrial genome been sequenced as well?
 
 
</div>
 
</div>
 
&nbsp;
 
&nbsp;
 +
&nbsp;
 +
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
=== APSES domain transcription factors (1 mark)===
+
=== ''saccharomyces cerevisiae'' Mbp1 - domain annotations===
 
</div>
 
</div>
  
Mbp1 is a large multidomain protein; it binds DNA through a small domain called the APSES domain and many organisms have more than one transcription factor that has a domain homologous to other APSES domains. In the assignments, we will analyse how these APSES domains have evolved, to obtain a perspective on the evolution of regulatory systems in general. Accordingly we should first define an APSES domain sequence and then use it to find all its relatives in each target organism.
 
  
Use the NCBI Entrez system to search for the string "apses" in the "Conserved Domains" database and access the entry for the APSES domain. You should find a number of aligned sequences on that page, each with their own GI identifier.  
+
Now we can analyse Mbp1's domain in SMART, and use this information to annotate the sequence in detail.
 +
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
<div style="padding: 5px; background: #EEEEEE;">
*Identify the two sequences that come from ''Saccharomyces cerevisiae'' (the Mbp1 and Swi4 APSES domains).
+
*Navigate to the [http://smart.embl-heidelberg.de/ SMART database], use "Normal Mode", then in the "Sequence Analysis" form enter the UniProtKB yeast Mbp1 accession number, check the checkboxes for the aditional analyses that SMART offers and carefully review the results. <small>By that I mean I might ask at some point what a particular section of the result means and how it is interpreted re. its biological significance. If parts are not obvious to you: &rarr; mailing list.</small>
 +
*In your assignment, in the actual, full-length sequence, highlight or otherwise clearly identify the features that SMART has annotated. Minimally you should include '''KilA-N''', '''low complexity''', '''coils''', and '''Ankyrin domains''', taking the sequence coordinates from the SMART annotations. Make sure you highlight the whole length of the feature and get the boundaries right. <small>If features overlap, you could eg. highlight one red, the other green and the overlap yellow. Or underline one, format the other in italics... And don't forget to label ''''what''' you have highlighted.</small>
 
</div>
 
</div>
  
Access the Genbank record for the ''Saccharomyces cerevisiae'' Mbp1 protein.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
*Obtain the FASTA sequence for the whole, full-length protein, save it and paste it into your assignment.
 
</div>
 
 
 
Working from the APSES domain alignment in CDD, define the sequence of the entire APSES domain in the Mbp1 protein.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
=== APSES (KilA-N) domains ===
*Save the sequence of the APSES domain in FASTA format (i.e. give it with an appropriate header) and paste it into your assignment. Comment if and how it is different from the sequence you find on the CDD page.
 
 
</div>
 
</div>
  
Navigate back to the Genome Project Database table for fungi and click on the '''"B"''' link next to your organism. This takes you to a page with a BLAST search form. Run a BLAST search with the full-length ''Saccharomyces cervisiae'' Mbp1 protein sequence against the proteins of your organism only!
 
  
 +
As you see from the annotations, Mbp1 is a large protein comprising several domains; it binds DNA through a small domain called the APSES domain (this is reported as the KilA-N domain superfamily). Many organisms have transcription factors that have a domains homologous to other APSES domains. Since we are interested in related proteins, and all functional relatives would be expected to share such a DNA binding domain, we should define this domain in more detail in order to be able to use it later to search for homologous proteins in diverse organisms.
 +
 
  
<!-- (In the case of Schizosaccharomyces pombe, it appears the link is not included on that page but you have to go through the main BLAST page and use the drop down selection in the "Options" field to limit your query to S. pombe). -->
+
Use the NCBI Entrez system to search for the string "apses" in the "Conserved Domains" database and access the entry for the KilA-N domain superfamily. You should find 10 aligned sequences on that page, each with their own GI identifier. To find the actual boundaries of the domain annotation, do the following:
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Record the parameters you have used for the search and the relevant search results.
 
  
*List the accession numbers and names of all putative homologues. How many are there? How many (if any) do you expect? What do you conclude ?
+
<div style="padding: 5px; background: #EEEEEE;">
 +
*Navigate back to the RefSeq Protein record <small>(You did record that link in your documentation, right?).</small> In the right-hand menu, find the section "Related information" and click on CDD Search results.
 +
* Click on the colored box for one of the annotated domains to appreciate the level of detailedinformation thatnis available here.
 +
*Back in the CDD annotation window, click on the [+] next to KilA-N super family to access the actual alignment of the PFAM domain definition with the MBP1 sequence.
 +
*Check and record (e.g. by highlighting) whether the NCBI and the SMART definition of the APSES domain in Mbp1 coincide exactly. If they don't, explain briefly what that means.
 +
*Make sure you understand how the sequences displayed on the CDD page and the actual domain sequences differ. <small>Hint: not all sequences are displayed in their full-length.</small>
 
</div>
 
</div>
  
Run a second BLAST search using only the ''Saccharomyces cervisiae'' Mbp1 APSES domain sequence.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
*Record the parameters you have used for the search and the relevant search results.
 
  
*List the accession numbers and names of all putative homologues. How many are there? Are the results different from your previous search? How? What do you conclude ?
+
=== APSES domain structure ===
 
</div>
 
</div>
  
&nbsp;
+
We can expect that the structures of all homologous APSES domains should be similar, i.e. if the structure of one is known, we should be able to conclude the approximate three-dimensional structure of any APSES domain. Indeed, structural information ''is'' available for APSES domains!
  
;(Please contact me immediately in case you cannnot find any significant alignments - you cannot continue with the assignment if you get stuck at this point.)
 
  
&nbsp;
+
There are several possible approaches you could pursue to identify candidate files:
  
 +
*there may be cross references to structures/PDB on any of the pages you have visited;
 +
*you could search the PDB itself for the keyword Mbp1;
 +
*you could use the domain sequence you have defined and ;
 +
**BLAST it against the database of PDB sequences (select it as an option on the BLAST form);
 +
**perform an "advanced serach" for similar sequences on the PDB Website.
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
  
==Align==
+
In any case, you should find more than one coordinate files that contain MBP1-like structures. Therefore you need to make a choice which one is the "best" for further analysis. Your choice of the "best" file for study could be based on:
 +
*experimental method (X-ray or NMR);
 +
*quality of the structure (resolution, refinement);
 +
*size (coverage) of the structure (number of amino acids for which structure coordinates have been determined).
 +
 
 +
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 +
*Record how you have identified the file you consider the "best" and what criteria you have used to define whether it is better suited for analysis than others.
 
</div>
 
</div>
&nbsp;
+
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
=== Sequences and accession numbers (1 mark)===
+
 
 +
=== DNA binding site ===
 
</div>
 
</div>
  
Retrieve the entire protein sequences for those significant hits that you have found with the APSES domain search in your organism. The easist way to do this is to click on the links on the BLAST results page. The NCBI does most of their internal cross-referencing with GI numbers, however these are less useful for crossreferencing to other databases.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
The Mbp1 APSES domain has been shown to bind to DNA and the residues involved in DNA binding have been characterized. ([http://www.ncbi.nlm.nih.gov/pubmed/10747782 Taylor ''et al.'' (2000) ''Biochemistry'' '''39''': 3943-3954] and [http://www.ncbi.nlm.nih.gov/pubmed/18491920 Deleeuw ''et al.'' (2008) Biochemistry. '''47''':6378-6385]) . In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.
Find  and record
 
* the GI number
 
*the GenPept accession number(s) ('''NCBI'''),
 
*the RefSeq identifier(s) if available,
 
*and the Uniprot accession number(s) ('''EBI''')
 
*and as always, report what you have done to find this information.
 
</div>
 
  
You can find these database identifiers in some cases when the appropriate cross-references have been entered into the annotations. Alternatively you can run running a BLAST search with a sequence to find the exact same sequence in the other database. Depending on what you are looking for, either search at the [http://www.ncbi.nlm.nih.gov/blast NCBI] or at the [http://www.ebi.ac.uk/blast/ EBI]. But (!) restrict the search to your assigned organism! You will have to figure out how to do that and of course report the parameters you have used. <small>''I know that this BLAST method appears to be a maximally inefficient way to retrieve a cross-reference for a sequence. However, frequently the databases are simply not providing the appropriate cross-references and the detour through a BLAST search is the only practical way to get them. Most unfortunate.''</small>
 
  
<small>'''Discovering''' a significantly better (or at least significantly more interesting) way to obtain the cross-references '''and''' being the first one to post it on the mailing list probably will merit a bonus point.</small>
+
<div style="padding: 5px; background: #FFCC99;">
 +
;Analysis (0.5 marks)
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
* Using VMD, generate a parallel stereo view of the protein structure that clearly shows the proposed Mbp1 DNA recognition domain, distinctly coloured differently from the rest of the protein. Use a representation that includes the sidechains.
  
=== Sequence alignment (1 mark)===
+
* Generate a second VMD stereo image as above, but use a representation that emphasizes the secondary structure of the structure (tube or cartoon representation, colouring by structure).
</div>
 
  
Retrieve the FASTA sequence of the protein in your organism that you have found to be most similar to yeast Mbp1.
+
* Generate a third VMD stereo image  that shows three representations combined: (1) the backbone, (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface. '''Note:''' VMD makes smart use of GPU capabilities of your computer. Try setting your graphics parameters to visualize with GLSL - your transparent surface may look '''much''' better.  
  
Use an online tool to generate an optimal full-length (global) alignment between the most-similar protein and  ''S. cerevisiae''Mbp1. (BLAST does not generate ''optimal'' alignments! Use the correct one of the EMBOSS tools instead.). You have to figure out where to find a Web service that does such alignments, which algorithm to use and to how to define reasonable parameters for the alignment.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
Include the images into your assignment but be careful not to exceed the width and size restrictions I have defined in the submission guidelines. Include your "selections statements" (e.g. <code>protein and not resid 23 to 34</code>) so that it is easy for you to reproduce what you have done. Also note any important parameters you have changed from the default.
*Report your procedure, parameters, alignment and results, and comment on the quality of the alignment. Is the protein a full-length homologue of Mbp1?
 
 
</div>
 
</div>
  
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.
  
==Analyse==
 
</div>
 
&nbsp;
 
&nbsp;
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
=== Sequence annotation (2 marks)===
 
</div>
 
  
Annotate the amino acid sequence of your organism's Mbp1 homologue with the following online tools (most of these can be found via links on http://www.expasy.org/ ):
+
<div style="padding: 5px; background: #FFCC99;">
 +
;Analysis (1 mark)
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*predicted molecular weight
 
*presence of transmembrane helices (TMpred or TMHMM)
 
*presence of internal repeats
 
*presence of signal sequences (SignalP 3.0)
 
*prediction for localization of the protein (PSORT II - make sure you use the right psort program!)
 
*prediction of functional motifs and patterns (ScanProsite or InterproScan)
 
*coiled coils and leucine zippers ([http://2zip.molgen.mpg.de/index.html 2Zip server])
 
*RPS BLAST
 
  
Briefly state for each analysis your procedure, the important results (if not obvious from the output), what the results mean, and whether your results are consistent with your expectations about this protein.
+
*Report whether this is the case here and which residues might be included.
 +
*Do the DNA binding residues form a contiguous surface that is compatible with a binding interface? Justify your conclusions.
 
</div>
 
</div>
 +
<small>Be '''specific''' in your analysis: write exactly which residue does what. For example don't write
 +
:''there are many lysines...''
 +
but write something like
 +
:''K33, R35 and K76 form a patch of positively charged residues close to the C-terminus of the putative recognition helix.''
 +
This is the '''interpretation''' of results and therefore the '''most important step''' of your entire analysis.</small>
  
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
=== Homologous structure (1 mark)===
+
== More statistics with R==  
 
</div>
 
</div>
  
The presence of a conserved APSES domain demonstrates that the sequences of ''your'' protein and all other APSES domains are homologous. We can expect that the structures of all homologous APSES domains should be similar, i.e. if the structure of even only one is known, we should be able to conclude the approximate three-dimensional structure of any one of them. Indeed, structural information ''is'' available for APSES domains!
 
  
Identify and download the most appropriate coordinate file to study the structure, function and conservation of APSES domains from the PDB.
+
Time for a break:
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Record how you have identified the file, what criteria you have used to define whether it is better suited for analysis than others, and paste the <tt>HEADER</tt>,  <tt>TITLE</tt>,  <tt>COMPND</tt> and  <tt>SOURCE</tt> records from the file into your assignment.
 
</div>
 
  
 +
<div style="padding: 5px; background: #FFCC99;">
 +
;Second step (0.5 marks)
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
  
=== DNA binding site (3 marks)===
+
Access the [http://www.cyclismo.org/tutorial/R/ Clarkson University R tutorial]. Work through part two of the tutorial (Data types).
 
</div>
 
</div>
  
The Mbp1 APSES domain has been shown to bind to DNA and the residues involved in DNA binding have been characterized. ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor et al. (2000) Biochemistry 39: 3943-3954]) . In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
<!-- div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
* Generate a parallel stereo view with VMD that clearly shows the Mbp1 DNA recognition domain residues distinctly coloured.
 
  
* Generate a second VMD stereo image that shows the secondary structure of the DNA recognition domain.
+
  == Onward: the Genome of Interest ==
 
+
</div -->
* Generate a third VMD stereo image that shows three representations combined: (1) the backbone, (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface.
 
  
Paste the images into your assignment in a compressed format (not windows BMP!) use medium resolution JPEG, PNG or LWZ-compressed TIFF formats. Briefly(!) summarize the VMD forms and parameters you have used.
+
<!-- Up to now, we have looked at the model-organism gene to obtain a baseline of information we are interested in. To move on, we need to access the genome of an organism we are interested in. In this course, the organism of interest is assigned to you.
</div>
 
  
 +
The systematic name and strain of a fungus is listed with the [[Group project|project group]] that you have been assigned to. Navigate to the NCBI homepage &rarr; "Genomic Biology" &rarr; "Fungal Genomes Central" &rarr; "Genome Sequencing Projects". This should take you to a tabular view of ongoing and completed fungal genome sequencing projects. Find your organism name in this table. There may be one or more sequencing projects associated with the organism, but there should be only one project for the specific strain.
  
DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.  
+
Click on the organism name to navigate to the Genome Project information page.
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*Report whether this is the case here and which residues might be included.
+
*Review the status of the data you are working with - such as
 +
**whether the entire genome is available or only a partial sequence;
 +
**How many chromosomes does this genome have?
 +
**What is the status of its genome assembly and annotation?
 +
**Has the mitochondrial genome been sequenced as well?
 +
**Why is this organism deemed important enough to be sequenced?
 +
</div>
  
*Do the DNA binding residues form a contiguous surface that is compatible with a binding interface?
+
&nbsp; -->
  
*Consider the surface exposed residues that could form part of the DNA binding interface of Mbp1 (i.e. the cationic residues you have described above and the '''exposed''' sidechains inbetween): are they conserved between Mbp1 and your protein?
 
</div>
 
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
Line 289: Line 250:
 
</div>
 
</div>
  
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
+
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2011@googlegroups.com Course Mailing List]

Latest revision as of 23:31, 21 September 2012

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 


   


   

Assignment 2 (last: 2011) - Search, retrieve and annotate

   


Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. If you did not notice that the above sentence was repeated, you are not reading carefully enough.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, October 24 at 12:00 noon (before the quiz).

   


Your documentation for the procedures you follow in this assignment will be worth 1 mark.


   


Introduction

Baker's yeast, Saccharomyces cerevisiae, is perhaps the most important model organism. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.

This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes.

One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular machinery is present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of questions such as:

  • What functional features can we detect in Mbp1?
  • Do homologous proteins exist in other organisms?
  • Do we believe these homologues may bind to similar sequence motifs?
  • Do we believe they may function in a similar way?
  • Do other organisms appear to have related cell-cycle control systems?


 

  • Access the information page on Mbp1 at the Saccharomyces Genome Database and read the summary paragraph on the protein's function!

(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in Lodish's Molecular Cell Biology and./or read Nobel laureate Paul Nurse's review (pdf) of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but recommended, since it's obviously more fun to work with concepts that actually make some sense.)

In this particular assignment you will go on a search and retrieve mission for information on yeast Mbp1, using common public databases and Web resources.


Search


Retrieve


Much useful information on yeast Mbp1 is compiled at the SGD information page on Mbp1. However we don't always have the luxury of such precompiled information. Let's look at the protein and it's features "the traditional way".


  • Navigate to the NCBI homepage (you probably have bookmarked it anyway) and enter Mbp1 AND "saccharomyces cerevisiae"[organism] as an Entrez query.
  • Click on Protein and find the RefSeq record for the protein sequence.
  • From the NCBI RefSeq record, obtain a FASTA sequence of the protein and paste it into your assignment.


There are several sources for functional domain annotations of proteins. The NCBI has the Conserved Domain Database, in Europe, the SMART database provides such annotations. In terms of domains, both resources are very comparable. But SMART also analyses more general features such as low-complexity sequences and coiled coils. In order to use SMART however, we need the Uniprot accession number that corresponds to the refseq identifier. In a rational world, one would wish that such important crossreferences would simply be provided by the NCBI ... well, we have been wishing this for many years now. Fortunately ID-mapping services exist.


  • Navigate to the UniProt ID-Mapping service. Enter the RefSeq identifier for the yeast Mbp1 protein and retrieve the corresponding UniProtKB Accession number. If this does not work, try the same mapping at the PIR ID-mapping service. Note the Uniprot accession number you find. (Should this work equally on both sites?)


Now navigate to Uniprot, enter the ID you have found into the search field and select [Sequence Clusters(UniRef)] as the database to search in. There should be two sequences in the [UniRef100 ... (100% identical)] cluster. Compare them. One of them is a highly annotated Swiss-Prot record, the other is practically unannotated data that has been imported from a "third party" to UniProt. Unfortunately, that one is the sequence that the ID mapping service had found. No cross-references to the NCBI are included with Swiss-Prot records, nor do NCBI RefSeq records cross-reference NCBI holding. I consider this a sorry state of affairs. Therefore most of us actually run BLAST searches to find equivalent sequences in other databases and this is the most wasteful way imaginable to address the problem.


  • Note down the SwissProt ID and the UniProtKB Accession Number for yeast Mbp1.


Annotate

   


saccharomyces cerevisiae Mbp1 - domain annotations


Now we can analyse Mbp1's domain in SMART, and use this information to annotate the sequence in detail.


  • Navigate to the SMART database, use "Normal Mode", then in the "Sequence Analysis" form enter the UniProtKB yeast Mbp1 accession number, check the checkboxes for the aditional analyses that SMART offers and carefully review the results. By that I mean I might ask at some point what a particular section of the result means and how it is interpreted re. its biological significance. If parts are not obvious to you: → mailing list.
  • In your assignment, in the actual, full-length sequence, highlight or otherwise clearly identify the features that SMART has annotated. Minimally you should include KilA-N, low complexity, coils, and Ankyrin domains, taking the sequence coordinates from the SMART annotations. Make sure you highlight the whole length of the feature and get the boundaries right. If features overlap, you could eg. highlight one red, the other green and the overlap yellow. Or underline one, format the other in italics... And don't forget to label 'what you have highlighted.


APSES (KilA-N) domains


As you see from the annotations, Mbp1 is a large protein comprising several domains; it binds DNA through a small domain called the APSES domain (this is reported as the KilA-N domain superfamily). Many organisms have transcription factors that have a domains homologous to other APSES domains. Since we are interested in related proteins, and all functional relatives would be expected to share such a DNA binding domain, we should define this domain in more detail in order to be able to use it later to search for homologous proteins in diverse organisms.  

Use the NCBI Entrez system to search for the string "apses" in the "Conserved Domains" database and access the entry for the KilA-N domain superfamily. You should find 10 aligned sequences on that page, each with their own GI identifier. To find the actual boundaries of the domain annotation, do the following:


  • Navigate back to the RefSeq Protein record (You did record that link in your documentation, right?). In the right-hand menu, find the section "Related information" and click on CDD Search results.
  • Click on the colored box for one of the annotated domains to appreciate the level of detailedinformation thatnis available here.
  • Back in the CDD annotation window, click on the [+] next to KilA-N super family to access the actual alignment of the PFAM domain definition with the MBP1 sequence.
  • Check and record (e.g. by highlighting) whether the NCBI and the SMART definition of the APSES domain in Mbp1 coincide exactly. If they don't, explain briefly what that means.
  • Make sure you understand how the sequences displayed on the CDD page and the actual domain sequences differ. Hint: not all sequences are displayed in their full-length.


APSES domain structure

We can expect that the structures of all homologous APSES domains should be similar, i.e. if the structure of one is known, we should be able to conclude the approximate three-dimensional structure of any APSES domain. Indeed, structural information is available for APSES domains!


There are several possible approaches you could pursue to identify candidate files:

  • there may be cross references to structures/PDB on any of the pages you have visited;
  • you could search the PDB itself for the keyword Mbp1;
  • you could use the domain sequence you have defined and ;
    • BLAST it against the database of PDB sequences (select it as an option on the BLAST form);
    • perform an "advanced serach" for similar sequences on the PDB Website.


In any case, you should find more than one coordinate files that contain MBP1-like structures. Therefore you need to make a choice which one is the "best" for further analysis. Your choice of the "best" file for study could be based on:

  • experimental method (X-ray or NMR);
  • quality of the structure (resolution, refinement);
  • size (coverage) of the structure (number of amino acids for which structure coordinates have been determined).

 

 

  • Record how you have identified the file you consider the "best" and what criteria you have used to define whether it is better suited for analysis than others.


DNA binding site


The Mbp1 APSES domain has been shown to bind to DNA and the residues involved in DNA binding have been characterized. (Taylor et al. (2000) Biochemistry 39: 3943-3954 and Deleeuw et al. (2008) Biochemistry. 47:6378-6385) . In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.


Analysis (0.5 marks)
  • Using VMD, generate a parallel stereo view of the protein structure that clearly shows the proposed Mbp1 DNA recognition domain, distinctly coloured differently from the rest of the protein. Use a representation that includes the sidechains.
  • Generate a second VMD stereo image as above, but use a representation that emphasizes the secondary structure of the structure (tube or cartoon representation, colouring by structure).
  • Generate a third VMD stereo image that shows three representations combined: (1) the backbone, (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface. Note: VMD makes smart use of GPU capabilities of your computer. Try setting your graphics parameters to visualize with GLSL - your transparent surface may look much better.


Include the images into your assignment but be careful not to exceed the width and size restrictions I have defined in the submission guidelines. Include your "selections statements" (e.g. protein and not resid 23 to 34) so that it is easy for you to reproduce what you have done. Also note any important parameters you have changed from the default.


DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.


Analysis (1 mark)


  • Report whether this is the case here and which residues might be included.
  • Do the DNA binding residues form a contiguous surface that is compatible with a binding interface? Justify your conclusions.

Be specific in your analysis: write exactly which residue does what. For example don't write

there are many lysines...

but write something like

K33, R35 and K76 form a patch of positively charged residues close to the C-terminus of the putative recognition helix.

This is the interpretation of results and therefore the most important step of your entire analysis.


More statistics with R


Time for a break:


Second step (0.5 marks)


Access the Clarkson University R tutorial. Work through part two of the tutorial (Data types).



[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List