ABC-INT-Genome annotation

Integrator Unit: Genome annotation

(Integrator unit: annotate sequences in a genome)

Abstract:

This page assesses the learning units for data management and sequence analysis of genomic sequence data.

Deliverables:

Integrator unit: Deliverables can be submitted for course marks. See below for details.

Prerequisites:
This unit builds on material covered in the following prerequisite units:

Create a new page on the student Wiki as a subpage of your User Page.
Put all of your writing to submit on this one page.
When you are done with everything, go to the Quercus Assignments page and open the appropriate Integrator Unit assignment. Paste the URL of your Wiki page into the form, and click on Submit Assignment.

Do not change your Wiki page after you have submitted your assignment, until it has been graded.

Report option

Work through the tasks described below.
Document your results in a short technical report on a subpage of your User page on the Student Wiki. Describe your methods in your report to an appropriate level of detail that your analysis can be exactly reproduced. If you write R-code, include the code in your report;
When you are done, submit the link to your page via Quercus as described above.

Literature research option

This option requires that a primary publication is available for the MYSPE genome sequence; if there is none, this option is not available.

Write a report on the annotation methodology that was used for the MYSPE genome. Note: this is not a review, but a report. Think of a "whitepaper", not a publication. Write to a specialist technical audience - imagine collaborators who want to use the same methods - and be specific to provide actionable information (links, instructions, resource requirements ...).
Include a sketch of the workflow;
write your report on a subpage of your User page of the Student Wiki;
make sure that you have included all references and citations.
the level of detail should be sufficient to allow an undergraduate project student to reproduce the analysis.

When you are done, submit the link to your page via Quercus as described above.

Oral test option

Work through the tasks described below. Remember to document your work in your journal, but there is no need to format this specially as a report.
Describe your methods in your report to an appropriate level of detail that your analysis can be exactly reproduced. If you write R-code, include the code in your report;
You should be prepared to explain and interpret your findings in the test.
Note that the work must be completed before your actual test date.

You know that MYSPE has an Mbp1 orthologue. Key questions of functional genome annotation could be: does it work in the same way in MYSPE as in yeast? Does it have the same target genes? Is it regulated by orthologues to other yeast genes that imply the same feedback mechanisms and genetic regulatory circuits? Here we will try to deduce just one part of such questions: is the binding motif for Mbp1 conserved? If that is the case, we could automate the task to find genes that are potentially regulated by MBP1_MYSPE, if not, we would need to pursue a different strategy of binding site discovery.

Here is how we assess the conservation of the Mbp1 DNA binding motif in MYSPE, working from the orthologue of CDC6, a pre-replicative complex component that is one of Mbp1's target genes:

Find the MYSPE orthologue for yeast CDC6 and document your search and result.
Fetch 500 nucleotides of upstream genome sequence. (Demonstrate that this is the correct sequence by showing the first 10 translated CDC6 codons with your sequence.) Make sure that you are reverse complementing the sequence in case your orthologue is transcribed from the (-)-strand.^[2]
Precisely demonstrate that this is the correct sequence by including the same information as in the sample annotation below. In particular
- There must be ten lines of 50 nucleotides each;
- There must be ten codons on the next line;
- therefore the download link must span exactly 530 nucleotides;
- The first of the ten codons must be a start codon, and the translation must be shown;
- There must be a link to the genome sequence source (with chromosomal coordinates);
- There must be a link to the protein sequence and it must start with the translated amino acids;
- The motifs you find and discuss must be indicated in the annotated sequence listing.
The yeast Mbp1 canonical binding site is defined by the regular expression "[AT]CGCG[AT]". (Please review RPR-RegEx if you are not sure about the meaning of "[" and "." in a regular expression.)
Are there CGCG motifs present in your nucleotide sequence?
Identify them using a regular expression search. Refer to RPR-RegEx to review the use of gregexpr() and regmatches(). The folowing code-sample may get you started:

patt <- "..CGCG.."
m <- gregexpr(patt, mySeq)
regmatches(mySeq, m)[[1]]

Are there [AT]CGCG or CGCG[AT] motifs? What about [AT]CGCG[AT]?
Where are the motifs located? Do they cluster? Are they arranged in a similar way as the yeast binding sites that you visited at UCSC?
Interpret your finding by contrasting your observation to the situation with yeast. Does ypur analysis support or refute the idea that MBP1_MYSPE has the same DNA sequence binding specificity as MBP1_SACCE?

Sample annotation

(Demonstrating the required level of detail for a valid submission)

MYSPE: Sporothrix Schenckii (1397361)
CDC6 (NP_012341) orthologue (by RBM): XP_016592126

(coverage: 72%; E: 4e-27; ID: 26.08%) (Reverese search in taxID:4932 finds NP_012341)

Protein FASTA of XP_016592126
ATG: 1255377 .. 1255379
Link to Genomic sequence (FASTA) (Range: 1254877..1255406)

>ref|NW_015971139.1|:1254877-1255406 Sporothrix schenckii 1099-18 chromosome Unknown Cont38, whole genome shotgun sequence

  5'-TCCACCAAACTAGTCGGGCGAGCTGAACTATGTCGTCCGCCATTTAAAGC

     CCACTGTACGAATAGCGCAATACTGTAGACGACCGCACAGTGTATCTGTG

     GCTAGTGTGCAAGCACGCGCCACGGCAGCTGGGCGGGTCTGGGGTCAATC
                   =====x
     CTCCCACGTACGCGTAAAACCGCCAACGCGTCCAGCAATGGCAGGGGTAA
              ======
     GTCAGTCGCGCTTTCTTCGCGTAAAGTGGTTCCTCTATTTGGCGCGCGCT
          =====x
     TCCTCATTAAATCTTGTACCTCCCTTGGCCACCATCTTGAACTTTCCTTC

     GTGCTTTCCACGTTTGACTTCATTCCCTGTTACTTCCATTTTGTCCATTC

     TTGCGACTGTCTATTCTTTCTTTGCGAGCATCTACGCATCTATCCATCGT

     TCTTTCCGTTGTATGCATCTACGTCGCTGTTCTTGCCATTGCTTTACCCC

     TTTCTTTAAACCCTTCCTCCTTTGCTCTTTCCTCACCACACACTACAAAC

     ATG GTT GCT TCC TCG CTC GGA AAG CGG ATC.....      -3'
      M   V   A   S   S   L   G   K   R   I   ...

Notes

↑ Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.
↑ Please note: if you can't demonstrate that you are working with the correct sequence, there is no point in continuing to search for putative binding motifs. Even if you would find one, that would be meaningless, because it would be in the wrong context. Please resist any temptation to edit or otherwise manipulate the sequence: that would be an academic offence. The sequence you show must be exactly the sequence you have downloaded from the database, and your links must work and produce exactly the correct sequence. If you can't get this to work, contact me to resolve the problem.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-10-01

Version:

1.1

Version history:

1.1 2020 Updates; add example annotated sequence.
1.0.1 Capitalize CDC6
1.0 First live version
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.

[2] Please note: if you can't demonstrate that you are working with the correct sequence, there is no point in continuing to search for putative binding motifs. Even if you would find one, that would be meaningless, because it would be in the wrong context. Please resist any temptation to edit or otherwise manipulate the sequence: that would be an academic offence. The sequence you show must be exactly the sequence you have downloaded from the database, and your links must work and produce exactly the correct sequence. If you can't get this to work, contact me to resolve the problem.

[1]

[2]

ABC-INT-Genome annotation

Contents

Evaluation

Contents

Scenario

Sample annotation

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools