Integrator Unit: Genome annotation

Contents
Questions, comments
References

Expected Preparations:

	[BIN-FUNC] Annotation		[BIN-GENOME] Genome_Browsers
	The units listed above are part of this course and contain important preparatory material.

Keywords: Integrator unit: annotate sequences in a genome

Objectives:

Outcomes:

Deliverables:

Integrator unit: Deliverables can be submitted for course marks. See below for details.

Evaluation:

Material based on this Integrator Unit can be submitted for summative feedback (course marks). It will be marked for a maximum of 18 marks for a regular submission, resp. 36 marks if you choose this for your Oral Test ¹.

For your report:

Create a new document in your shared Google drive folder.
Call your document ABC-INT-Genome-annotation-<your name>-2022
Work through the tasks described below.
Document your work and your results. Write this at a technical level, like a lab report and include all details that are needed to make your work reproducible. Follow the additional instructions for submitting R-code.
Include a (CC) license at the end of your document, as instructed at the beginning of the course.
When you are done with everything, go to the Assignments page on Quercus and open the appropriate Integrator Unit submission category. Paste the URL of your report document into the form, and click on Submit Assignment. Your link can be submitted only once and not edited. Also: do not edit your document after it has been submitted.

If you choose this unit for your Oral Test option:

Prepare your report as above.
Be prepared to discuss your findings during the test.
Make sure the report is submitted before your test date (cf. Oral Test instructions).

If your report includes R code …

… then please follow these additional instructions.

Submitted R code must be placed into a code appendix.
We will load the code for evaluation. Therefore the code in your appendix must be delimited with two special tags.
- The first line of your code must be
  # begin code
- The last line must be
  # end code
- Use only these tags and make sure they appear only once in your document.
The second line of your code must be a comment that identifies you as the author and the title of your report.
You can run fetchGoogleDocRcode(<your doc URL>) to check that your code can be loaded into R.
The submitted code must be exactly the code that you have used to obtain your results. Altering the code in your documentation - even if only for cosmetic purposes - is an academic offence.
The reported results must be exactly the results your code has produced at the time you ran it. Altering the results you obtained - for any reason whatsoever - is an academic offence.
It must be possible to reproduce your exact results from your posted code. Therefore:
- Make sure your code is complete and can stand alone when we run it. All data and other assets must be included with the code.
- Do not create or delete files in your code.
- If you need special R packages, only packages from CRAN or BioConductor may be used.
- Use set.seed(<some integer>) for reproducible randomness wherever appropriate (and verify that your results can be reproduced).

Please remember that code submissions have their own marking rubrics.

This page integrates concepts and methods for data management and sequence analysis of genomic sequence data.

Scenario

You know that MYSPE has an orthologue of yeast Mbp1. That’s very useful knowledge: yeast is a well studied model organism, and its target genes for most transcription factors have been experimentally determined. If MYSPE has regulatory genetic circuits that are conserved among fungi, you could perform functional genome annotation based on orthology to yeast genes. Thus you might ask questions like: does regulation work in the same way in MYSPE as in yeast? Does it have the same target genes? Are MYSPE target genes of MBP1_MYSPE co-regulated by orthologues to other yeast genes, which would imply conserved feedback mechanisms and genetic regulatory circuits?

Here we will try to deduce just one part of such inquiry: is the binding motif for yeast Mbp1 conserved in a MYSPE orthologue of a S. cerevisiae target gene? If that is the case, we could automate the task to find genes that are potentially regulated by MBP1_MYSPE, if not, we would need to pursue a different strategy of binding site discovery.

Analysis and Documentation

Here is how we could develop an analysis of the conservation of the Mbp1 DNA binding motif in MYSPE manually.

Navigate to SGD (the Saccharomyces Genome database).
Find the annotation page for MBP1 / YDL056W.
Look for the list of Mbp1 target genes, linked from the section on regulation.
Choose one target gene (but not CDC6). We will call this your ” putative Mbp1 target ” below.
Find the MYSPE orthologue for your putative Mbp1 target and document your search and result.
Fetch a contiguous segment of genome sequence of your putative Mbp1 target: 500 nucleotides of upstream genome sequence plus the first thirty nucleotides of coding sequence. Use a method that will work at scale, given chromosomal coordinates: a link to the NCBI genome record as in the example below will be fine, similar links could be generated from UCSC or ensembl resources, or with a few lines of biomart:: code. Manual selection and copy/paste from a sequence database record is not acceptable for this task.
Demonstrate that this is the correct sequence by showing and annotating the 530 nucleotides in your submission (refer to the example below for contents and formatting). Add the translation of the first 10 codons of your putative Mbp1 target to your annotation. Make sure that you are showing the correct reverse complement in case your orthologue is transcribed from the (-)-strand!²
In your submission:
- You must include the correct database identifiers on which you are basing your analysis, linked to their respective sources;
- There must be a link to the genome sequence source (with chromosomal coordinates) and it must span exactly 530 nucleotides;³
- There must be a link to the protein sequence and it must start with the translated amino acids;
- The FASTA header of the downloaded nucleotide sequence must be included;
- Upstream sequence must be listed in ten lines of 50 nucleotides each;
- There must be ten codons on the next line;
- The first of the ten codons must be the start codon of your putative Mbp1 target, and the translation must be shown;
- The motifs you find and discuss must be indicated in the annotated sequence listing as in the example below.

The yeast Mbp1 canonical binding site is defined by the regular expression "[AT]CGCG[AT]". (Please review RPR-RegEx if you are not sure about the meaning of "[" and "." in a regular expression.) In your report note:

Are there CGCG motifs present in your nucleotide sequence?
Identify them using a regular expression search. Refer to RPR-RegEx to review the use of gregexpr() and regmatches(). The following code-sample may get you started:

patt <- "..CGCG.."
m <- gregexpr(patt, mySeq)
regmatches(mySeq, m)[[1]]

Are there [AT]CGCG or CGCG[AT] motifs? What about [AT]CGCG[AT]?
Where are the motifs located? Do they cluster? Are they arranged in a similar way as the yeast binding sites that you visited at UCSC?⁴

Interpretation

Additionally: Interpret your finding by contrasting your observation to the situation with yeast. Does your analysis support or refute the idea that your putative Mbp1 target in MYSPE is regulated by a transcription factor with the same DNA sequence binding specificity as MBP1_SACCE? Can you make an argument whether that transcription factor could or could not be the Mbp1-orthologue in MYSPE?

Automation?

Finally: Write (in pseudo code), the analysis workflow that you would need to automate the procedure for all yeast Mbp1_target genes, and to determine whether a yeast-like binding motif is enriched in the upstream regulatory sequences of potential MBP1_MYSPE targets.

Sample annotation

The annotation below is based on the Sporothrix Schenckii orthologue of CDC6. CDC is a pre-replicative complex component that is one of Mbp1’s target genes, and it is highly conserved. This sample demonstrates the required formatting and level of detail for a valid submission.

MYSPE: Sporothrix Schenckii (1397361)
CDC6 (NP_012341) orthologue (by RBM): XP_016592126
(coverage: 72%; E: 4e-27; ID: 26.08%)
Reverse search in taxID:4932 finds NP_012341 as the top hit.
Protein FASTA of XP_016592126
Translation-start ATG: range 1255377 .. 1255379
Link to Genomic sequence (FASTA) (Range: 1254877..1255406)

>ref|NW_015971139.1|:1254877-1255406 Sporothrix schenckii 1099-18 chromosome Unknown Cont38, whole genome shotgun sequence

  5'-TCCACCAAACTAGTCGGGCGAGCTGAACTATGTCGTCCGCCATTTAAAGC

     CCACTGTACGAATAGCGCAATACTGTAGACGACCGCACAGTGTATCTGTG

     GCTAGTGTGCAAGCACGCGCCACGGCAGCTGGGCGGGTCTGGGGTCAATC
                   =====x
     CTCCCACGTACGCGTAAAACCGCCAACGCGTCCAGCAATGGCAGGGGTAA
              ======
     GTCAGTCGCGCTTTCTTCGCGTAAAGTGGTTCCTCTATTTGGCGCGCGCT
          =====x
     TCCTCATTAAATCTTGTACCTCCCTTGGCCACCATCTTGAACTTTCCTTC

     GTGCTTTCCACGTTTGACTTCATTCCCTGTTACTTCCATTTTGTCCATTC

     TTGCGACTGTCTATTCTTTCTTTGCGAGCATCTACGCATCTATCCATCGT

     TCTTTCCGTTGTATGCATCTACGTCGCTGTTCTTGCCATTGCTTTACCCC

     TTTCTTTAAACCCTTCCTCCTTTGCTCTTTCCTCACCACACACTACAAAC

     ATG GTT GCT TCC TCG CTC GGA AAG CGG ATC.....      -3'
      M   V   A   S   S   L   G   K   R   I   ...

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

About this page …

[END]

Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.↩︎
Please note: if you can’t demonstrate that you are working with the correct sequence, there is no point in continuing to search for putative binding motifs. Even if you would find one, that would be meaningless, because it would be in the wrong context. Please resist any temptation to edit or otherwise manipulate the sequence: that would be an academic offence. The sequence you show must be exactly the sequence you have downloaded from the database, and your links must work and produce exactly the correct sequence. If you can’t get this to work, contact me to resolve the problem.↩︎
Be wary of off-by-one errors: the range 10..20 spans eleven nucleotides, not ten.↩︎
Just claiming “yes” or “no” is not sufficient to discuss a similar arrangement: you need to give specifics, such as number of sites and their quality, distance to start, distance to each other, overlap … etc.↩︎