Difference between revisions of "BIO Assignment Week 9"

Latest revision as of 04:12, 13 December 2016

Assignment for Week 9
Genomics

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

Introduction

Large scale genome sequencing and annotation has made a wealth of information available that is all related to the same biological objects: the DNA. The information however can be of very different types, it includes:

the actual sequence
sequence variants (SNPs and CNVs)
conservation between related species
genes (with introns and exons)
mRNAs
expression levels
regulatory features such as transcription factor bindings sites

and much more.

Since all of this information relates to specific positions or ranges on the chromosome, displaying it alongside the chromosomal coordinates is a useful way to integrate and visualize it. We call such strips of annotation tracts and display them in genome browsers. Quite a number of such browsers exist and most work on the same principle: server hosted databases are queried through a Web interface; the resulting data is displayed graphically in a Web browser window. The large data centres each have their own browsers, but arguably the best engineered, most informative and mostly widely used one is provided by the University of California Santa Cruz (UCSC) Genome Browser Project.

Compiling the data requires a massive annotation effort, which has not been completed for all genome-sequenced species. In particular, not all of our YFOs have been included in the major model-organism annotation efforts. The general strategy for analysis of a gene in YFO is thus to map it to homologous genes in model organisms. In this assignment you will explore the UCSC genome browser and we will go through an exercise that relates fungal replication genes to human genes. We have previously focused a lot on Mbp1 homologs, but these have no clear equivalences in "higher" eukaryotes. However one of the key target genes of Mbp1 is the cell cycle protein Cdc6, which is well conserved in fungi and other eukaryotes eukaryotes and has a human homolog. Since generally speaking the annotation level for human genes is the highest, we will have a closer look at that gene.

The UCSC genome browser

The University of California Santa Cruz (UCSC) Genome Browser Project has the largest offering of annotation information. However it is strictly model-organism oriented and you will probably not find YFO among its curated genomes. Nevertheless, if you are studying eg. human genes, or yeast, the UCSC browser will probably be your first choice.

Task:
In this task you will access the UCSC genome browser view of the human Cdc6 gene. You will explore some of the very large number of tracks that are available and study the transcription factor binding region.

Navigate to the UCSC Genome Bioinformatics entry page and follow the link to the Genome Browser in the "Our tools" section.
Click on the link to humans. Note that this is the hg38 assembly.
Enter CDC6 into the "Position/Search Term" field and click "Go". You should get a list of entries, click on the top link, the gene on chromosome 17: CDC6 (uc002huj.2) at chr17:40287633-40304657

Zoom out 1.5x to view the upstream regulatory region: the end of the adjacent WIPF2 gene should have just come into view on the left.
Study the Genome Browser view of the human CDC6 homolog.
1. In particular, note the extensive functional annotations of DNA and the alignments of vertebrate syntenic regions that allow detailed genomic comparisons.
2. Distinguish between exon and intron sequence.
3. Note that the mammal Conservation track has high values for all of the exons, but not only for exons.
4. Find more information on the "Layered H3K27Ac" tract.

Note the large number of available tracks that have been integrated into this view. Most of them are switched off. Find the Regulation section, and follow the link to the "ORegAnno" information to see what that is about. Note that you can switch individual annotations on or off on this page, as well as set the display format for all of the results. Select the check-box only for "transcription factor binding site" to be on, select the "Display mode" to full and click submit.
Study this information and note:
1. There is a cluster of TFBS just upstream of the transcription initiation site.
2. This cluster coincides with the highest H3K27Ac density.
3. If you <control>-click (right-click?) on the top orange bar of this cluster, a contextual menu opens from which you can access the details page for OREG1791811 in a new window. Follow the link to the RBL2 transcription factor via ENST00000379935 ... from where you can access transcript and gene and expression and protein family and GO and all other information.
Go back to the Genome Browser and set the ORegAnno tract to "pack" and click "refresh".
Slide the SNP track to just beneath the RefSeq genes track that contains the introns and exons. You will notice that one of the SNPs is green, and two are red. Why? Set the "Common SNPs" track display mode to "pack" and click "refresh".

Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors or factors in YFO. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.

Task:
Finally:

Print this page, but print the first page only.
With a red pen, mark and label the following four items on your print-out:
1. The first exon of CDC6.
2. The chromosomal coordinates of the current view.
3. The binding sites for the transcription factors that bind to the CDC6 promoter.
4. The locations of the missense-variant SNPs.
Write your name and Student number on this page and bring it to class to hand it in on Tuesday.

Links and resources

Wang et al. (2013) A brief introduction to web-based genome browsers. Brief Bioinformatics 14:131-43. (pmid: 22764121)

[ PubMed ] [ DOI ] Genome browser provides a graphical interface for users to browse, search, retrieve and analyze genomic sequence and annotation data. Web-based genome browsers can be classified into general genome browsers with multiple species and species-specific genome browsers. In this review, we attempt to give an overview for the main functions and features of web-based genome browsers, covering data visualization, retrieval, analysis and customization. To give a brief introduction to the multiple-species genome browser, we describe the user interface and main functions of the Ensembl and UCSC genome browsers using the human alpha-globin gene cluster as an example. We further use the MSU and the Rice-Map genome browsers to show some special features of species-specific genome browser, taking a rice transcription factor gene OsSPL14 as an example.

Sloan et al. (2016) ENCODE data at the ENCODE portal. Nucleic Acids Res 44:D726-32. (pmid: 26527727)

[ PubMed ] [ DOI ] The Encyclopedia of DNA Elements (ENCODE) Project is in its third phase of creating a comprehensive catalog of functional elements in the human genome. This phase of the project includes an expansion of assays that measure diverse RNA populations, identify proteins that interact with RNA and DNA, probe regions of DNA hypersensitivity, and measure levels of DNA methylation in a wide range of cell and tissue types to identify putative regulatory elements. To date, results for almost 5000 experiments have been released for use by the scientific community. These data are available for searching, visualization and download at the new ENCODE Portal (www.encodeproject.org). The revamped ENCODE Portal provides new ways to browse and search the ENCODE data based on the metadata that describe the assays as well as summaries of the assays that focus on data provenance. In addition, it is a flexible platform that allows integration of genomic data from multiple projects. The portal experience was designed to improve access to ENCODE data by relying on metadata that allow reusability and reproducibility of the experiments.

Pazin (2015) Using the ENCODE Resource for Functional Annotation of Genetic Variants. Cold Spring Harb Protoc 2015:522-36. (pmid: 25762420)

[ PubMed ] [ DOI ] This article illustrates the use of the Encyclopedia of DNA Elements (ENCODE) resource to generate or refine hypotheses from genomic data on disease and other phenotypic traits. First, the goals and history of ENCODE and related epigenomics projects are reviewed. Second, the rationale for ENCODE and the major data types used by ENCODE are briefly described, as are some standard heuristics for their interpretation. Third, the use of the ENCODE resource is examined. Standard use cases for ENCODE, accessing the ENCODE resource, and accessing data from related projects are discussed. Although the focus of this article is the use of ENCODE data, some of the same approaches can be used with data from other projects.

ENCODE Project Consortium (2011) A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9:e1001046. (pmid: 21526222)

[ PubMed ] [ DOI ] The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

Zarrei et al. (2015) A copy number variation map of the human genome. Nat Rev Genet 16:172-83. (pmid: 25645873)

[ PubMed ] [ DOI ] A major contribution to the genome variability among individuals comes from deletions and duplications - collectively termed copy number variations (CNVs) - which alter the diploid status of DNA. These alterations may have no phenotypic effect, account for adaptive traits or can underlie disease. We have compiled published high-quality data on healthy individuals of various ethnicities to construct an updated CNV map of the human genome. Depending on the level of stringency of the map, we estimated that 4.8-9.5% of the genome contributes to CNV and found approximately 100 genes that can be completely deleted without producing apparent phenotypic consequences. This map will aid the interpretation of new CNV findings for both clinical and research applications.

Footnotes and references

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

< Assignment 8

Assignment 10 >

@@ Line 2: / Line 2: @@
 <div class="b1">
 Assignment for Week 9<br />
-<span style="font-size: 70%">Genome Analysis</span>
+<span style="font-size: 70%">Genomics</span>
 </div>
 <table style="width:100%;"><tr>
@@ Line 9: / Line 9: @@
 </tr></table>
-{{Template:Active}}
+{{Template:Inactive}}
 Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
@@ Line 36: / Line 36: @@
 {{vspace}}
+<!--
 ==GBrowse==
 {{smallvspace}}
@@ Line 56: / Line 56: @@
 {{vspace}}
+-->
+<!--
 ==NCBI Map Viewer==
 {{smallvspace}}
@@ Line 74: / Line 75: @@
 {{vspace}}
+-->
+<!--
 ==Ensembl==
 {{smallvspace}}
@@ Line 104: / Line 106: @@
 {{vspace}}
+-->
 ==The UCSC genome browser==
 {{smallvspace}}
@@ Line 111: / Line 116: @@
 {{task|1=
-In this task you will access the UCSC genome browser view of the yeast Cdc6 gene and its human orthologue. You will explore some of the very large number of tracks that are available for both, and compare transcription factor binding regions.
+In this task you will access the UCSC genome browser view of the <!-- yeast Cdc6 gene and its human orthologue --> human Cdc6 gene. You will explore some of the very large number of tracks that are available and study the transcription factor binding region.
-# Navigate to the [http://genome.ucsc.edu/ '''UCSC''' Genome Bioinformatics entry page] and follow the link to the '''Genome Browser''' in the left-hand menu.
+# Navigate to the [http://genome.ucsc.edu/ '''UCSC''' Genome Bioinformatics entry page] and follow the link to the '''Genome Browser''' in the "Our tools" section.
+<!--
 # From the available menus, access the ''S. cerevisiae'' information ('''group &rarr; other''') and enter Cdc6 as the '''search term'''.
 # Click on the link to the [http://genome.ucsc.edu/cgi-bin/hgTracks?position=chrX:69338-70879&hgsid=311433759&sgdGene=pack&hgFind.matches=YJL194W, Cdc6 gene] on chromosome X.
@@ Line 122: / Line 128: @@
 # Open a second window, and access the UCSC Genome browser for the '''human genome'''. Search for CDC6 and click the link to [http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr17:38444146-38459413&hgsid=394751891_WVshDjZBOw5nRfbXOotacA9pGJn5&knownGene=pack&hgFind.matches=uc002huj.1, <code>Homo sapiens cell division cycle 6 (CDC6), mRNA</code>] on chromosome 17.
+-->
+# Click on the link to humans. Note that this is the hg38 assembly.
+# Enter CDC6 into the "Position/Search Term" field and click "Go". You should get a list of entries, click on the top link, the gene on chromosome 17: <tt>[http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr17:40287633-40304657&hgsid=570479629_xD9YY3QMJ4u2xrTagkgV7xJMqEen&knownGene=pack&hgFind.matches=uc002huj.2, CDC6 (uc002huj.2) at chr17:40287633-40304657]</tt>
+# Zoom out '''1.5x''' to view the upstream regulatory region: the end of the adjacent WIPF2 gene should have just come into view on the left.
 # Study the Genome Browser view of the human CDC6 homolog.
 ## In particular, note the extensive functional annotations of DNA and the alignments of vertebrate syntenic regions that allow detailed genomic comparisons.
 ## Distinguish between exon and intron sequence.
 ## Note that the mammal Conservation track has high values for all of the exons, but not only for exons.
-# Zoom out '''1.5x''' and click/slide the gene to the right to view the upstream regulatory region.
+## Find more information on the "Layered H3K27Ac" tract.
-# On the page, note the '''large''' number of available tracks that have been integrated into this view. Most of them are switched off. Find the '''Regulation''' section, and click on '''ENC TF Binding''' to access the information page on where this data originates from. Note that you can switch individual experiments on or off on this page, as well as setting the display format for all of the results. Set the selection for '''HAIB TFBS, SYDH TFBS,''' and '''UChicago TFBS''' to '''dense''', set the display to '''show''' and click the '''Submit''' button.
-# Get a sense of the amount of information that is displayed here and note that all experiments agree on a regulatory region that ranges from about 1.5kb upstream to 0.5 kb downstream of the transcription start.
+# Note the '''large''' number of available tracks that have been integrated into this view. Most of them are switched off. Find the '''Regulation''' section, and follow the link to the "ORegAnno" information to see what that is about. Note that you can switch individual annotations on or off on this page, as well as set the display format for all of the results. Select the check-box '''only''' for "transcription factor binding site" to be on, select the "Display mode" to '''full''' and click '''submit'''.
-# Go back to the '''ENCODE Transcription Factor Binding Tracks''' page uncheck all of the data sources except for the ENCODE/Stanford/Yale/USC/Harvard Chip-seq experiment (SYDH TFBS), set the format to '''full''', '''Display mode: show''' and click '''submit'''.
+# Study this information and note:
-# The resulting tracks are an excellent view of the kind of information that is provided by ChIP-seq experiments in which bound transcription factors are crosslinked to the DNA, immuno-precipitated with transcription factor specific antibodies, and the co-precipitated DNA sequenced with high-throughput sequencing methods. Note that most sequence tags are found in a unimodal distribution close to the transcription start.
+## There is a cluster of TFBS just upstream of the transcription initiation site.
-# Now scroll down to the track sections, '''hide''' the '''ENCODE TF binding data''' and show the '''full''' view of the '''TFBS conserved''' track - a consensus of human/mouse and rat annotated TF binding sites. Click on the small vertical bar in the <code>V$E2F_02</code> row, this will take you to a detailed information page on this transcription factor, with cross-references to the databases.
+## This cluster coincides with the highest H3K27Ac density.
+## If you &lt;control&gt;-click (right-click?) on the top orange bar of this cluster, a contextual menu opens from which you can access the details page for OREG1791811 in a new window. Follow the link to the RBL2 transcription factor via [http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000103479;r=16:53445781-53491648;t=ENST00000379935 ENST00000379935] ... from where you can access transcript and gene and expression and protein family and GO and all other information.
+# Go back to the Genome Browser and set the ORegAnno tract to "pack" and click "refresh".
+# Slide the SNP track to just beneath the RefSeq genes track that contains the introns and exons. You will notice that one of the SNPs is green, and two are red. Why? Set the "Common SNPs" track display mode to "pack" and click "refresh".
 }}
-Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.
+Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors or factors in YFO. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.
+<!--
 The UCSC browser has a sometimes bewildering amount of information available. But its curators are aware of the need for educating users regarding the utility of their tools.
@@ Line 155: / Line 169: @@
 * You can also work through the [http://www.nature.com/scitable/ebooks/guide-to-the-ucsc-genome-browser-16569863 Guide to the UCSC Genome Browser at "nature"] which gives an excellent, in-depth overview.
 * Study the ''User's guide to ENCODE'' paper linked below.
+-->
+{{task|1=
+Finally:
+# Print this page, but print the first page only.
+# With a red pen, mark and label the following four items on your print-out:
+## The first exon of CDC6.
+## The chromosomal coordinates of the current view.
+## The binding sites for the transcription factors that bind to the CDC6 promoter.
+## The locations of the missense-variant SNPs.
+# Write your name and Student number on this page and bring it to class to hand it in on Tuesday.
+}}

Difference between revisions of "BIO Assignment Week 9"

Latest revision as of 04:12, 13 December 2016

Contents

Introduction

The UCSC genome browser

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools