BIO Assignment Week 10
Assignment for Week 10
Genome Browsers
Note! This assignment is currently active. All significant changes will be announced on the mailing list.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
Introduction
Large scale genome sequencing and annotation has made a wealth of information available that is all related to the same biological objects: the DNA. The information however can be of very different types, it includes:
- the actual sequence
- sequence variants (SNPS and CNVs)
- conservation between related species
- genes (with introns and exons)
- mRNAs
- expression levels
- regulatory features such as transcription factor bindings sites
and more.
But since all of this information relates to specific chromosomal locations, displaying it in tracts alongside the chromosomal coordinates is useful to integrate and visualize the information, and to make it accessible. That is what Genome Browsers are for. Quite a number of such browsers exist, and most work on the same principle: server hosted databases are queried through a Web interface; the resulting data is displayed graphically in a Web browser window. The large data centres each have their own browsers, but arguably the best engineered, most informative and mostly widely used one is provided by the University of California Santa Cruz (UCSC) Genome Browser Project.
In this assignment you will explore some of the browsers and we will go through an exercise that relates fungal replication genes to human genes. We have previously focused a lot on Mbp1 homologs, but these have no clear equivalences in "higher" eukaryotes. However one of the key target genes of Mbp1 is the cell cycle protein Cdc6, and CDC6 is universally conserved in eukaryotes and has a human homolog.
GBrowse
GBrowse - the Generic genome Browser - is the browser developed by the Generic Model Organism Database project that aims to make industry-strength make bioinformatics tools and software available for the model organism community. One of the many databases that uses GMod tools is the Saccharomyces Genome Database.
Task:
In this task you will access the SGD GBrowse page for Cdc6 and explore some of the options.
- Navigate to the the Saccharomyces Genome Database, enter Cdc6 into the site search field and on the result page click on GBrowse at the Chromosome location heading.
- Locate CDC6 (YJL194W) as a red bar in the graph. Note that the triangle at the end points in the direction of transcription.
- Note how you can click/hold the graph and slide it let and right, and how this changes the overview indicator that shows where on the chromosome the currently displayed window of sequence is located.
- Zoom in by selecting Show 5 kbp at the scroll/zoom controls.
- Click on the Select Tracks tab. This gives you access to a fine-grained selection of all tracks that have been created as genome annotations.
- Find the section for Transcription Factors. Click on the star next to TF ChIP chip to mark this experiement as a "favorite". Then click on Show Favorites Only at the top of the page. Finally check All on for the Transcription Factors track and Back to browser.
This view shows you the ChIP-chip validated TF-binding sites in the upstream regulatory region of Cdc6. Note that Mbp1 is among them. Curiously, Swi6 is also listed there - but you know that Swi6 does not actually bind DNA directly, but forms a complex with the APSES domain transcription factors Mbp1/Swi4 which form the MBF complex. However, crosslinking of the complex and immunoprecipitation with anti-Swi6 would certainly identify this region. You should be aware that an annotation of a protein in a ChIP-chip experiment is not the same as demonstrating a protein's physical interaction with DNA.
NCBI Map Viewer
Task:
In this task you will locate and display a map view at the NCBI for the yeast Cdc6 gene.
- Navigate to the NCBI home page and follow the link to Genomes & maps in the left-hand menu.
- Click on the Tools tab and find the link to the Map Viewer
- In the Fungi section, click on the latest "build" of the Saccharomycs cerevisiae genome. This takes you to an overview page of the status of the Genome project. Each chromosome is linked to its map. If you would not know what chromosome to look for, you would need to search by keyword, or gene name in the nucleotide database. Regarding Cdc6, you remember from the task above that it is located on Chromosome X (i.e the roman numeral ten, not the "X-Chromosome"). You will arrive at the actual mapview of the entire Chromosome with the RefSeq accession number
NC_001142.9
. This large nucleotide record containing the entire chromosomal sequence underlies the display. - Enter Cdc6 into the Search field and click the Find in This View button. Then zoom in one level.
The resulting view shows you the location and orientation of the gene on the chromosome. A number of links to various NCBI databases are given for each gene. Note that this is primarily a tool for database crossreferencing, not for integrating and displaying annotations.
Ensembl
The EBI offers its own version of genome browsers through the Ensembl project. A large number of genomes have been annotated, cross-referenced and made available for viewing. The EBI has spent a lot of effort on automated curation of their genome offerings. The ensemble offerings are therefore more comprehensive and complete than those of other sources. In particular, you will find a genome view for YFO.
Task:
In this task you will review the ensembl view of the YFO ortholog to yeast CDC6.
- Navigate to the EnsemblFungi page (easy to find via Google).
- Select Saccharomyces cerevisiae from the species list.
- Search for Cdc6 as a search term.
- Click on CDC6 (YJL194W)
You will be taken to a browser view of the genome. Tracts can be switched on and off through the menu on the left hand side.
- Find the link to Orthologues under the Fungal Compara section in the menu.
- In the resulting page, find the YFO orthologue and click on the Location link.
- On the Browser page, click on the cogwheel icon of the lower view to configure tracks.
- On the configuration page, click on Sequence in the menu and turn Contigs off and Translated sequence on. Click the checkmark in the top-right corner of the configuration window to return to the browser view.
- Zoom in until you see the display of the actual nucleotides and the six reading frames.
This is a very comprehensive offering in terms of sequences. However, ensemble too offers little in terms of annotations of DNA elements, expression levels and the like. Nevertheless, since it is the only database that has YFO annotated, it would be the tool to go to if you were to compare syntenic regions or genomic context between different species.
The UCSC genome browser
The University of California Santa Cruz (UCSC) Genome Browser Project has the largest offering of annotation information. However it is strictly model-organism oriented and you will probably not find YFO among its curated genomes. Nevertheless, if you are studying eg. human genes, or yeast, the UCSC browser should be your first choice.
Task:
In this task you will the UCSC genome browser view of the yeast Cdc6 gene and its human orthologue. You will explore some of the very large number of tracks that are available for both and compare transcription factor binding regions.
- Navigate to the UCSC Genome Bioinformatics entry page and follow the link to the Genome Browser in the left-hand menu.
- From the available menus, access the S. cerevisiae information (Clade → other) and enter Cdc6 as the search term.
- Click on the link to the Cdc6 gene on chromosome X.
- Click on the button to zoom out 3x - we want to see the upstream regulatory region.
- In the subsection for Expression and Regulation, find the menu for Regulatory Code and select full; select hide for all other expression tracks. Click refresh.
Up to now, this looks very similar to the SGD genome browser.
- Open a second window, and access the UCSC Genome browser for the human genome. Search for CDC6 and click the link to the Homo sapiens cell division cycle 6 homolog (S. cerevisiae) (CDC6) on chromosome 17.
- Study the Genome Browser view of the CDC6 homolog.
- In particular, note the extensive functional annotations of DNA and the alignments of vertebrate syntenic regions that allow detailed genomic comparisons.
- Distinguish between exon and intron sequence.
- Note that the mammal Conservation track has high values for all of the exons, but not only for exons.
- Zoom out 1.5x and click/slide the gene to the right to view the upstream regulatory region.
- On the page, note the large number of available tracks that have been integrated into this view. Most of them are switched off. Find the Regulation section, and click on ENCODE Transcription Factor Binding Tracks to access the information page on where exactly this data originates from. Note that you can switch individual experiments on or off on this page, as well as setting the display format for all of the results. Leave all of the experiments checked, set the display to show and click the Submit button.
- Get a sense of the amount of information that is displayed here and note that all experiments agree on a regulatory region that ranges from about 1.5kb upstream to 0.5 kb downstream of the transcription start.
- Go back to the ENCODE Transcription Factor Binding Tracks page uncheck all of the data sources except for the ENCODE/Stanford/Yale/USC/Harvard Chip-seq experiment (SYDH TFBS), set the format to full, Display mode: show and click submit.
- The resulting tracks are an excellent view of the kind of information that is provided by ChIP-seq experiments in which bound transcription factors are crosslinked to the DNA, immuno-precipitated with transcription factor specific antibodies, and the co-precipitated DNA sequenced with high-throughput sequencing methods. Note that most sequence tags are found in a unimodal distribution close to the transcription start, but some TFs (e.g. Rad21) apparently have more than one binding site.
- Now scroll down to the track sections, hide the ENCODE TF binding data and show the full view of the TFBS conserved track - a consensus of human/mouse and rat annotated TF binding sites. Click on the small vertical bar in the
V$E2F_02
row, this will take you to a detailed information page on this transcription factor, with cross-references to the databases.
Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.
The UCSC browser has a sometimes bewildering amount of information available. But its curators are aware of the need for educating users regarding the utility of their tools.
Task:
In this task you will access some of the tutorial information that UCSC provides.
- Return to the UCSC Genome Bioinformatics entry page and follow the link to Training in the left-hand menu.
- Follow the link to the [http://www.openhelix.com/ucsc OpenHelix UCSC tutorials.
- Download the Hands-on exercise PDF file and work through Exercise 2
This exercise includes a number of interesting options to work with the UCSC data - the BLAT tool for genomic region alignment and the selective display of SNP annotations.
- Optional
- Work through exercise one and three of the OpenHelix UCSC introduction.
- Access the OpenHelix ENCODE tutorial, download the Hands-on Exercises pdf and work through the exercises. Exercise 3 is particularly valuable, as it teaches you how to create results from complex intersections of queries.
- Study the User's guide to ENCODE paper linked below.
Links and resources
ENCODE Project Consortium (2011) A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9:e1001046. (pmid: 21526222) |
[ PubMed ] [ DOI ] The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome. |
Footnotes and references
Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.