Difference between revisions of "ABC-INT-Genome annotation"

From "A B C"
Jump to navigation Jump to search
m
m
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<div id="ABC">
 
<div id="ABC">
<div style="padding:5px; border:1px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;">
+
<div style="padding:5px; border:4px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Integrator Unit: Genome annotation
 
Integrator Unit: Genome annotation
 
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; ">
 
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; ">
Line 21: Line 21:
 
<b>Deliverables:</b><br />
 
<b>Deliverables:</b><br />
 
<section begin=deliverables />
 
<section begin=deliverables />
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-integrator" -->
+
<li><b>Integrator unit</b>: Deliverables can be submitted for course marks. See below for details.</li>
*<b>Integrator unit</b>: Deliverables can be submitted for course marks. See below for details.
 
 
<section end=deliverables />
 
<section end=deliverables />
 
<!-- ============================  -->
 
<!-- ============================  -->
Line 28: Line 27:
 
<section begin=prerequisites />
 
<section begin=prerequisites />
 
<b>Prerequisites:</b><br />
 
<b>Prerequisites:</b><br />
<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
+
This unit builds on material covered in the following prerequisite units:<br />
This unit builds on material covered in the following prerequisite units:
 
 
*[[BIN-FUNC-Annotation|BIN-FUNC-Annotation (Function Annotation)]]
 
*[[BIN-FUNC-Annotation|BIN-FUNC-Annotation (Function Annotation)]]
 
*[[BIN-Genome-Browsers|BIN-Genome-Browsers (Genome Browsers)]]
 
*[[BIN-Genome-Browsers|BIN-Genome-Browsers (Genome Browsers)]]
Line 49: Line 47:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "./components/ABC-INT-Genome_annotation.components.txt", section: "evaluation" -->
+
This "Integrator Unit" should be submitted for evaluation for a maximum of 13 marks if one of the written deliverables is chosen, resp. 24 marks if you choose this for your oral test<ref>Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.</ref>.
This "Integrator Unit" should be submitted for evaluation for a maximum of 8 marks if one of the written deliverables is chosen, resp. 16 marks for the oral exam<ref>Note: the oral exam will focus on the unit content but will also cover other material that leads up to it.</ref>.
+
:Please note the evaluation types that are available as options for this unit.
:Please note the evaluation types that are available as options for this unit. Choose one evaluation type that you have not chosen for another Integrator Unit. (Each submitted Integrator Unit must be evaluated in a different way and one of your evaluations - but not your first one - must be an oral exam).
+
:Be mindful of the [[ABC-Rubrics| '''Marking rubrics''']].
 +
:If this is submitted for your oral test, please read the [[BCH441 Oral Test instructions|Oral test instructions]] before you begin.
 +
:If your submission includes R code, please read the [[BCH441 Code submisson instructions|Code submission instructions]] before you begin.
 +
 
 +
Once you have chosen an option ...
 +
<ol>
 +
<li>Create a new page on the student Wiki as a subpage of your User Page.</li>
 +
<li>Put all of your writing to submit on this one page.</li>
 +
 
 +
<li>When you are done with everything, go to the [https://q.utoronto.ca/courses/180416/assignments Quercus '''Assignments''' page] and open the appropriate '''Integrator Unit''' assignment. Paste the URL of your Wiki page into the form, and click on '''Submit Assignment'''.</li>
 +
</ol>
 +
 
 +
Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.
 +
 
 
{{Smallvspace}}
 
{{Smallvspace}}
<!--
+
 
 
;Report option
 
;Report option
* Work through the tasks described in the scenario.
+
* Work through the tasks described below.
* Document your results in a short report on a subpage of your User page on the Student Wiki. Describe your methods (R-code!) in an appendix;
+
* Document your results in a short technical report on a subpage of your User page on the Student Wiki. Describe your methods in your report to an appropriate level of detail that your analysis can be exactly reproduced. If you write R-code, include the code in your report;
* When you are done with everything, add the following category tag to the page:
+
* When you are done, submit the link to your page via Quercus as described above.
::<code><nowiki>[[Category:EVAL-INT-Genome_annotation]]</nowiki></code>
+
 
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
 
 
{{Smallvspace}}
 
{{Smallvspace}}
-->
+
 
 +
<!--
 
;Interview option
 
;Interview option
 
: Identify a laboratory whose work includes genome annotation, or re-annotation. Get in touch with the PI, a postdoc or senior graduate student in the laboratory and interview them in person or by eMail. Find out
 
: Identify a laboratory whose work includes genome annotation, or re-annotation. Get in touch with the PI, a postdoc or senior graduate student in the laboratory and interview them in person or by eMail. Find out
Line 72: Line 83:
 
:* add information that may be required to understand the methodology;
 
:* add information that may be required to understand the methodology;
 
:* make sure that you have included important literature references.
 
:* make sure that you have included important literature references.
:* When you are done with everything, add the following category tag to the page:
+
:* When you are done with everything, add the following category tag '''to the end of page''':
::<code><nowiki>[[Category:EVAL-INT-Genome_annotation]]</nowiki></code>
+
::<code><nowiki>[[Category:EVAL-INT-Genome_annotation]]</nowiki></code>.
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
+
 
 +
Once the page has been saved with this tag, it is considered "submitted".
 +
'''Do not''' change your submission after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
 +
-->
 
{{Smallvspace}}
 
{{Smallvspace}}
 +
 
;Literature research option
 
;Literature research option
 
:This option requires that a primary publication is available for the MYSPE genome sequence; if there is none, this option is not available.
 
:This option requires that a primary publication is available for the MYSPE genome sequence; if there is none, this option is not available.
:* Write a report on the annotation methodology that was used for the MYSPE genome. Note: this is not a review, but a report. Think of a "whitepaper", not a publication. Write to a specialist technical audience - imagine collaborators who want to use the same methods - and be specific to provide actionable information.
+
:* Write a report on the annotation methodology that was used for the MYSPE genome. Note: this is not a review, but a report. Think of a "whitepaper", not a publication. Write to a specialist technical audience - imagine collaborators who want to use the same methods - and be specific to provide actionable information (links, instructions, resource requirements ...).
 +
:* Include a sketch of the workflow;
 
:* write your report on a subpage of your User page of the Student Wiki;
 
:* write your report on a subpage of your User page of the Student Wiki;
 
:* make sure that you have included all references and citations.
 
:* make sure that you have included all references and citations.
:* When you are done with everything, add the following category tag to the page:
+
:* the level of detail should be sufficient to allow an undergraduate project student to reproduce the analysis.
::<code><nowiki>[[Category:EVAL-INT-Genome_annotation]]</nowiki></code>
+
* When you are done, submit the link to your page via Quercus as described above.
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
+
 
 
{{Smallvspace}}
 
{{Smallvspace}}
;Oral exam option
+
 
* Work through the tasks described in the scenario. Remember to document your work in your journal.
+
;Oral test option
* Part of your task will involve writing an R script, place that code in a subpage of your User page on the Student Wiki and link to it from your Journal. (Do not add an evaluation category tag to that code).
+
* Work through the tasks described below. Remember to document your work in your journal, but there is no need to format this specially as a report.
* Your work must be complete before 21:00 on the day before your exam.
+
* Describe your methods in your report to an appropriate level of detail that your analysis can be exactly reproduced. If you write R-code, include the code in your report;
* Schedule an oral exam by editing the [http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Signup-Oral_exams_2017 '''signup page on the Student Wiki''']. Enter the unit that you are signing up for, and your name. You must have signed-up for an exam slot before 21:00 on the day before your exam.
+
* You should be prepared to explain and interpret your findings in the test.
 +
* Note that the work must be completed [[BCH441 Oral Test instructions| '''before''' your actual test date.]]
 +
 
 
{{Smallvspace}}
 
{{Smallvspace}}
 +
<!--
 
;Genome sequence analysis option
 
;Genome sequence analysis option
 
* Start a subpage of your User page on the Student Wiki to document your analysis;
 
* Start a subpage of your User page on the Student Wiki to document your analysis;
 
* Work through the tasks described in the scenario, download sequence data and develop an analysis script as required. Keep your script generic, so that you could easily adapt it to analyze a different gene. Keep careful Journal notes of your activities with your analysis.
 
* Work through the tasks described in the scenario, download sequence data and develop an analysis script as required. Keep your script generic, so that you could easily adapt it to analyze a different gene. Keep careful Journal notes of your activities with your analysis.
* When you are done with everything, add the following category tag to the page:
+
* ...
::<code><nowiki>[[Category:EVAL-INT-Genome_annotation]]</nowiki></code>
+
-->
:'''Do not''' change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
+
 
 
== Contents ==
 
== Contents ==
<!-- included from "./components/ABC-INT-Genome_annotation.components.txt", section: "contents" -->
+
 
 
{{Smallvspace}}
 
{{Smallvspace}}
 +
 
===Scenario===
 
===Scenario===
 +
 
{{Smallvspace}}
 
{{Smallvspace}}
You know that MYSPE has an Mbp1 orthologue. The key questions of functional genome annotation would be: does it work in the same way in MYSPE as in yeast? Does it have the same target genes? Is it regulated by orthologues to other yeast genes that imply the same feedback mechanisms and genetic regulatory circuits? Here we will try to deduce just one part of such questions: is the binding motif for Mbp1 conserved? If that is the case, we could automate the task to find genes that are potentially regulated by MBP1_MYSPE, if not, we would need to pursue a different strategy of binding site discovery.
 
  
Here is how we assess the conservation of the Mbp1 DNA binding motif in MYSPE, working from the orthologue of  Cdc6, a pre-replicative complex component:
+
You know that MYSPE has an Mbp1 orthologue. Key questions of functional genome annotation could be: does it work in the same way in MYSPE as in yeast? Does it have the same target genes? Is it regulated by orthologues to other yeast genes that imply the same feedback mechanisms and genetic regulatory circuits? Here we will try to deduce just one part of such questions: is the binding motif for Mbp1 conserved? If that is the case, we could automate the task to find genes that are potentially regulated by MBP1_MYSPE, if not, we would need to pursue a different strategy of binding site discovery.
* Find the MYSPE orthologue for yeast Cdc6.
+
 
* Fetch 500 nucleotides of upstream genome sequence. (Demonstrate that this is the correct sequence by showing the first 10 translated Cdc6 codons with your sequence.)
+
Here is how we assess the conservation of the Mbp1 DNA binding motif in MYSPE, working from the orthologue of  CDC6, a pre-replicative complex component that is one of Mbp1's target genes:
* The yeast Mbp1 canonical binding site is defined by the regular expression <tt>[AT]CGCG[AT]</tt>.
+
* Find the MYSPE orthologue for yeast CDC6 and '''document''' your search and result.
 +
* Fetch a contiguous segment of genome sequence: 500 nucleotides of upstream genome sequence plus the first thirty nucleotides of coding sequence. Use a method that will work at scale, given chromosomal coordinates: a link to the NCBI genome record as in the example below will be fine, similar links could be generated from UCSC or ensembl resources, or with a few lines of <code>biomart::</code> code. '''Manual selection and copy/paste from a sequence database record is not acceptable for this assignment'''.
 +
* Demonstrate that this is the correct sequence by showing and annotating the 530 nucleotides in your submission (refer to the example below for contents and formatting). Add the translation of the first 10 codons of the CDC6-orthologue to your annotation. Make sure that you are showing the correct reverse complement in case your orthologue is transcribed from the (-)-strand!<ref>Please note: if you can't demonstrate that you are working with the correct sequence, there is no point in continuing to search for putative binding motifs. Even if you would find one, that would be meaningless, because it would be in the wrong context. Please resist any temptation to edit or otherwise manipulate the sequence: that would be an academic offence. The sequence you show must be exactly the sequence you have downloaded from the database, and your links must work and produce exactly the correct sequence. If you can't get this to work, contact me to resolve the problem.</ref>
 +
* In your submission:
 +
** You must include the correct database identifiers on which you are basing your analysis, linked to their respective sources;;
 +
** There must be a link to the genome sequence source (with chromosomal coordinates) and it must span exactly 530 nucleotides<ref>Be wary of off-by-one errors: the range <tt>10..20</tt> spans eleven nucleotides, not ten.</ref>;
 +
** There must be a link to the protein sequence and it must start with the translated amino acids;
 +
** The FASTA header of the downloaded nucleotide sequence must be included;
 +
** Upstream sequence must be listed in ten lines of 50 nucleotides each;
 +
** There must be ten codons on the next line;
 +
** The first of the ten codons must be the CDC6-orthologue start codon, and the translation must be shown;
 +
** The motifs you find and discuss must be indicated in the annotated sequence listing.
 +
* The yeast Mbp1 canonical binding site is defined by the regular expression <tt>"[AT]CGCG[AT]"</tt>. (Please review [[RPR-RegEx]] if you are not sure about the meaning of <tt>"["</tt> and <tt>"."</tt> in a regular expression.)
 
* Are there <tt>CGCG</tt> motifs present in your nucleotide sequence?
 
* Are there <tt>CGCG</tt> motifs present in your nucleotide sequence?
* Identify them using a regular expression search. You may find the following code useful:
+
* Identify them using a regular expression search. Refer to [[RPR-RegEx]] to review the use of <tt>gregexpr()</tt> and <tt>regmatches()</tt>. The following code-sample may get you started:
<source lang="R">
+
 
 +
<pre>
 
patt <- "..CGCG.."
 
patt <- "..CGCG.."
 
m <- gregexpr(patt, mySeq)
 
m <- gregexpr(patt, mySeq)
 
regmatches(mySeq, m)[[1]]
 
regmatches(mySeq, m)[[1]]
</source>
+
</pre>
* Are there <tt>[AT]CGCG</tt> or <tt>CGCG[AT]</tt> motifs? What about </tt>[AT]CGCG[AT]</tt>?
+
 
* Where are they located? Do they cluster? Are they arranged in a similar way as the yeast binding sites that you visited at UCSC?
+
* Are there <tt>[AT]CGCG</tt> or <tt>CGCG[AT]</tt> motifs? What about <tt>[AT]CGCG[AT]</tt>?
* Interpret your finding. Does this support or refute the idea that MBP1_MYSPE has the same DNA sequence binding specificity as MBP1_SACCEE?
+
* Where are the motifs located? Do they cluster? Are they arranged in a similar way as the yeast binding sites that you visited at UCSC?<ref>Just claiming "yes" or "no" is not sufficient to discuss a ''similar arrangement'': you need to give specifics, such as number of sites and their quality, distance to start, distance to each other, overlap ... etc.</ref>
 +
* Interpret your finding by contrasting your observation to the situation with yeast. Does your analysis support or refute the idea that the CDC-orthologue in MYSPE is regulated by a transcription factor with the same DNA sequence binding specificity as MBP1_SACCE? Can you make an argument whether that transcription factor could or could not be the Mbp1-orthologue in MYSPE?
  
 
{{Vspace}}
 
{{Vspace}}
  
== Self-evaluation ==
+
=== Sample annotation ===
<!--
+
 
=== Question 1===
+
;(Demonstrating the required level of detail for a valid submission)
 +
 
 +
* MYSPE: ''Sporothrix Schenckii'' ([https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?&id=1397361 1397361])
 +
* [https://www.yeastgenome.org/locus/S000003730 CDC6] ([https://www.ncbi.nlm.nih.gov/protein/NP_012341 NP_012341]) orthologue (by RBM): [https://www.ncbi.nlm.nih.gov/protein/XP_016592126 XP_016592126]<br/>
 +
<small>(coverage: 72%; E: 4e-27; ID: 26.08%)<br/>
 +
Reverse search in taxID:4932 finds NP_012341 as the top hit.</small>
 +
* [https://www.ncbi.nlm.nih.gov/protein/XP_016592126.1?report=fasta Protein FASTA of XP_016592126]
 +
* Translation-start <tt>ATG</tt>: range 1255377 .. 1255379
 +
* [https://www.ncbi.nlm.nih.gov/nuccore/NW_015971139.1?report=fasta&from=1254877&to=1255406 Link to Genomic sequence (FASTA)] (Range: 1254877..1255406)
  
Question ...
+
<pre>
 +
>ref|NW_015971139.1|:1254877-1255406 Sporothrix schenckii 1099-18 chromosome Unknown Cont38, whole genome shotgun sequence
  
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
+
  5'-TCCACCAAACTAGTCGGGCGAGCTGAACTATGTCGTCCGCCATTTAAAGC
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
  
</div>
+
    CCACTGTACGAATAGCGCAATACTGTAGACGACCGCACAGTGTATCTGTG
  </div>
 
  
  {{Vspace}}
+
    GCTAGTGTGCAAGCACGCGCCACGGCAGCTGGGCGGGTCTGGGGTCAATC
 +
                  =====x
 +
    CTCCCACGTACGCGTAAAACCGCCAACGCGTCCAGCAATGGCAGGGGTAA
 +
              ======
 +
    GTCAGTCGCGCTTTCTTCGCGTAAAGTGGTTCCTCTATTTGGCGCGCGCT
 +
          =====x
 +
    TCCTCATTAAATCTTGTACCTCCCTTGGCCACCATCTTGAACTTTCCTTC
  
-->
+
    GTGCTTTCCACGTTTGACTTCATTCCCTGTTACTTCCATTTTGTCCATTC
== Notes ==
 
<!-- included from "./components/ABC-INT-Genome_annotation.components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
<references />
 
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
{{Vspace}}
+
    TTGCGACTGTCTATTCTTTCTTTGCGAGCATCTACGCATCTATCCATCGT
  
 +
    TCTTTCCGTTGTATGCATCTACGTCGCTGTTCTTGCCATTGCTTTACCCC
  
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
+
    TTTCTTTAAACCCTTCCTCCTTTGCTCTTTCCTCACCACACACTACAAAC
  
----
+
    ATG GTT GCT TCC TCG CTC GGA AAG CGG ATC.....      -3'
 +
      M  V  A  S  S  L  G  K  R  I  ...
 +
</pre>
  
{{Vspace}}
 
  
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
  
----
+
== Notes ==
 +
<references />
  
 
{{Vspace}}
 
{{Vspace}}
 +
  
 
<div class="about">
 
<div class="about">
Line 170: Line 210:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-11-19
+
:2020-10-07
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.0
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2 Edit policy update
 +
*1.1 2020 Updates; add example annotated sequence; sequence fetch must not be copy/paste.
 +
*1.0.1 Capitalize CDC6
 
*1.0 First live version
 
*1.0 First live version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{INTEGRATOR}}
 +
{{LIVE}}
 +
{{EVAL}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 05:37, 7 October 2020

Integrator Unit: Genome annotation

(Integrator unit: annotate sequences in a genome)


 


Abstract:

This page assesses the learning units for data management and sequence analysis of genomic sequence data.


Deliverables:

  • Integrator unit: Deliverables can be submitted for course marks. See below for details.

  • Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    This "Integrator Unit" should be submitted for evaluation for a maximum of 13 marks if one of the written deliverables is chosen, resp. 24 marks if you choose this for your oral test[1].

    Please note the evaluation types that are available as options for this unit.
    Be mindful of the Marking rubrics.
    If this is submitted for your oral test, please read the Oral test instructions before you begin.
    If your submission includes R code, please read the Code submission instructions before you begin.

    Once you have chosen an option ...

    1. Create a new page on the student Wiki as a subpage of your User Page.
    2. Put all of your writing to submit on this one page.
    3. When you are done with everything, go to the Quercus Assignments page and open the appropriate Integrator Unit assignment. Paste the URL of your Wiki page into the form, and click on Submit Assignment.

    Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.


     
    Report option
    • Work through the tasks described below.
    • Document your results in a short technical report on a subpage of your User page on the Student Wiki. Describe your methods in your report to an appropriate level of detail that your analysis can be exactly reproduced. If you write R-code, include the code in your report;
    • When you are done, submit the link to your page via Quercus as described above.


     


     
    Literature research option
    This option requires that a primary publication is available for the MYSPE genome sequence; if there is none, this option is not available.
    • Write a report on the annotation methodology that was used for the MYSPE genome. Note: this is not a review, but a report. Think of a "whitepaper", not a publication. Write to a specialist technical audience - imagine collaborators who want to use the same methods - and be specific to provide actionable information (links, instructions, resource requirements ...).
    • Include a sketch of the workflow;
    • write your report on a subpage of your User page of the Student Wiki;
    • make sure that you have included all references and citations.
    • the level of detail should be sufficient to allow an undergraduate project student to reproduce the analysis.
    • When you are done, submit the link to your page via Quercus as described above.


     
    Oral test option
    • Work through the tasks described below. Remember to document your work in your journal, but there is no need to format this specially as a report.
    • Describe your methods in your report to an appropriate level of detail that your analysis can be exactly reproduced. If you write R-code, include the code in your report;
    • You should be prepared to explain and interpret your findings in the test.
    • Note that the work must be completed before your actual test date.


     

    Contents

     

    Scenario

     

    You know that MYSPE has an Mbp1 orthologue. Key questions of functional genome annotation could be: does it work in the same way in MYSPE as in yeast? Does it have the same target genes? Is it regulated by orthologues to other yeast genes that imply the same feedback mechanisms and genetic regulatory circuits? Here we will try to deduce just one part of such questions: is the binding motif for Mbp1 conserved? If that is the case, we could automate the task to find genes that are potentially regulated by MBP1_MYSPE, if not, we would need to pursue a different strategy of binding site discovery.

    Here is how we assess the conservation of the Mbp1 DNA binding motif in MYSPE, working from the orthologue of CDC6, a pre-replicative complex component that is one of Mbp1's target genes:

    • Find the MYSPE orthologue for yeast CDC6 and document your search and result.
    • Fetch a contiguous segment of genome sequence: 500 nucleotides of upstream genome sequence plus the first thirty nucleotides of coding sequence. Use a method that will work at scale, given chromosomal coordinates: a link to the NCBI genome record as in the example below will be fine, similar links could be generated from UCSC or ensembl resources, or with a few lines of biomart:: code. Manual selection and copy/paste from a sequence database record is not acceptable for this assignment.
    • Demonstrate that this is the correct sequence by showing and annotating the 530 nucleotides in your submission (refer to the example below for contents and formatting). Add the translation of the first 10 codons of the CDC6-orthologue to your annotation. Make sure that you are showing the correct reverse complement in case your orthologue is transcribed from the (-)-strand![2]
    • In your submission:
      • You must include the correct database identifiers on which you are basing your analysis, linked to their respective sources;;
      • There must be a link to the genome sequence source (with chromosomal coordinates) and it must span exactly 530 nucleotides[3];
      • There must be a link to the protein sequence and it must start with the translated amino acids;
      • The FASTA header of the downloaded nucleotide sequence must be included;
      • Upstream sequence must be listed in ten lines of 50 nucleotides each;
      • There must be ten codons on the next line;
      • The first of the ten codons must be the CDC6-orthologue start codon, and the translation must be shown;
      • The motifs you find and discuss must be indicated in the annotated sequence listing.
    • The yeast Mbp1 canonical binding site is defined by the regular expression "[AT]CGCG[AT]". (Please review RPR-RegEx if you are not sure about the meaning of "[" and "." in a regular expression.)
    • Are there CGCG motifs present in your nucleotide sequence?
    • Identify them using a regular expression search. Refer to RPR-RegEx to review the use of gregexpr() and regmatches(). The following code-sample may get you started:
    patt <- "..CGCG.."
    m <- gregexpr(patt, mySeq)
    regmatches(mySeq, m)[[1]]
    
    • Are there [AT]CGCG or CGCG[AT] motifs? What about [AT]CGCG[AT]?
    • Where are the motifs located? Do they cluster? Are they arranged in a similar way as the yeast binding sites that you visited at UCSC?[4]
    • Interpret your finding by contrasting your observation to the situation with yeast. Does your analysis support or refute the idea that the CDC-orthologue in MYSPE is regulated by a transcription factor with the same DNA sequence binding specificity as MBP1_SACCE? Can you make an argument whether that transcription factor could or could not be the Mbp1-orthologue in MYSPE?


     

    Sample annotation

    (Demonstrating the required level of detail for a valid submission)

    (coverage: 72%; E: 4e-27; ID: 26.08%)
    Reverse search in taxID:4932 finds NP_012341 as the top hit.

    >ref|NW_015971139.1|:1254877-1255406 Sporothrix schenckii 1099-18 chromosome Unknown Cont38, whole genome shotgun sequence
    
      5'-TCCACCAAACTAGTCGGGCGAGCTGAACTATGTCGTCCGCCATTTAAAGC
    
         CCACTGTACGAATAGCGCAATACTGTAGACGACCGCACAGTGTATCTGTG
    
         GCTAGTGTGCAAGCACGCGCCACGGCAGCTGGGCGGGTCTGGGGTCAATC
                       =====x
         CTCCCACGTACGCGTAAAACCGCCAACGCGTCCAGCAATGGCAGGGGTAA
                  ======
         GTCAGTCGCGCTTTCTTCGCGTAAAGTGGTTCCTCTATTTGGCGCGCGCT
              =====x
         TCCTCATTAAATCTTGTACCTCCCTTGGCCACCATCTTGAACTTTCCTTC
    
         GTGCTTTCCACGTTTGACTTCATTCCCTGTTACTTCCATTTTGTCCATTC
    
         TTGCGACTGTCTATTCTTTCTTTGCGAGCATCTACGCATCTATCCATCGT
    
         TCTTTCCGTTGTATGCATCTACGTCGCTGTTCTTGCCATTGCTTTACCCC
    
         TTTCTTTAAACCCTTCCTCCTTTGCTCTTTCCTCACCACACACTACAAAC
    
         ATG GTT GCT TCC TCG CTC GGA AAG CGG ATC.....      -3'
          M   V   A   S   S   L   G   K   R   I   ...
    


    Notes

    1. Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.
    2. Please note: if you can't demonstrate that you are working with the correct sequence, there is no point in continuing to search for putative binding motifs. Even if you would find one, that would be meaningless, because it would be in the wrong context. Please resist any temptation to edit or otherwise manipulate the sequence: that would be an academic offence. The sequence you show must be exactly the sequence you have downloaded from the database, and your links must work and produce exactly the correct sequence. If you can't get this to work, contact me to resolve the problem.
    3. Be wary of off-by-one errors: the range 10..20 spans eleven nucleotides, not ten.
    4. Just claiming "yes" or "no" is not sufficient to discuss a similar arrangement: you need to give specifics, such as number of sites and their quality, distance to start, distance to each other, overlap ... etc.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-10-07

    Version:

    1.2

    Version history:

    • 1.2 Edit policy update
    • 1.1 2020 Updates; add example annotated sequence; sequence fetch must not be copy/paste.
    • 1.0.1 Capitalize CDC6
    • 1.0 First live version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.