Difference between revisions of "BIO Assignment Week 9"

From "A B C"
Jump to navigation Jump to search
m
 
(17 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 9<br />
 
Assignment for Week 9<br />
<span style="font-size: 70%">Protein Ligand Complex</span>
+
<span style="font-size: 70%">Genomics</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_8|&lt;&nbsp;Assignment&nbsp;8]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_10|Assignment&nbsp;10&nbsp;&gt;]]</td>
 +
</tr></table>
  
{{Template:Active}}
+
{{Template:Inactive}}
  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
Line 12: Line 16:
 
__TOC__
 
__TOC__
  
 +
{{vspace}}
  
&nbsp;
 
 
==Introduction==
 
==Introduction==
 +
{{vspace}}
  
One of the really interesting questions we can discuss with reference to our model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
+
Large scale genome sequencing and annotation has made a wealth of information available that is all related to the same biological objects: the DNA. The information however can be of very different types, it includes:
 +
* the actual sequence
 +
* sequence variants (SNPs and CNVs)
 +
* conservation between related species
 +
* genes (with introns and exons)
 +
* mRNAs
 +
* expression levels
 +
* regulatory features such as transcription factor bindings sites
 +
and much more.
  
Since there is currently no software available that would accurately model such a complex from first principles, we will base a model of a bound complex on homology modeling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a structurally similar protein-DNA complex?
+
Since all of this information relates to specific positions or ranges on the chromosome, displaying it alongside the chromosomal coordinates is a useful way to integrate and visualize it. We call such strips of annotation ''tracts'' and display them in ''genome browsers''. Quite a number of such browsers exist and most work on the same principle: server hosted databases are queried through a Web interface; the resulting data is displayed graphically in a Web browser window. The large data centres each have their own browsers, but arguably the best engineered, most informative and mostly widely used one is provided by the University of California Santa Cruz (UCSC) Genome Browser Project.  
  
This assignment is based on the homology model you built. You will (1) identify similar structures of distantly related domains for which protein-DNA complexes are known, (2) assemble a hypothetical complex structure and (3) consider whether the available evidence allows you to distinguish between different modes of ligand binding,
+
Compiling the data requires a massive annotation effort, which has not been completed for all genome-sequenced species. In particular, not all of our YFOs have been included in the major model-organism annotation efforts. The general strategy for analysis of a gene in YFO is thus to map it to homologous genes in {{WP|Model organism|model organisms}}. In this assignment you will explore the UCSC genome browser and we will go through an exercise that relates fungal replication genes to human genes. We have previously focused a lot on Mbp1 homologs, but these have no clear equivalences in "higher" eukaryotes. However one of the key target genes of Mbp1 is the cell cycle protein {{WP|Cdc6}}, which is well conserved in fungi and other eukaryotes eukaryotes and has a {{WP|CDC6|human homolog}}. Since generally speaking the annotation level for human genes is the highest, we will have a closer look at that gene.
  
==Modeling a DNA ligand==
+
{{vspace}}
 +
<!--
 +
==GBrowse==
 +
{{smallvspace}}
  
&nbsp;
+
[http://gmod.org/wiki/GBrowse '''GBrowse'''] - the Generic genome Browser - is the browser developed by the [http://gmod.org/wiki/Main_Page Generic Model Organism Database] project that aims to make industry-strength bioinformatics tools and software available for the model organism community. One of the many databases that uses GMod tools is [http://www.yeastgenome.org/ the Saccharomyces Genome Database] but you will find the browser in use on many different sites.
  
&nbsp;
+
{{task|1=
 
+
In this task you will access the SGD GBrowse page for Cdc6 and explore some of the options.
 
+
# Navigate to the [http://www.yeastgenome.org/ the Saccharomyces Genome Database], enter Cdc6 into the site search field and on the result page, in the '''Sequence''' / '''Location''' box click on the [http://browse.yeastgenome.org/fgb2/gbrowse/scgenome/?name=YJL194W '''View in GBrowse'''] link.
===Finding a similar protein-DNA complex===
+
# Locate CDC6 (YJL194W) as a red bar in the graph. Note that the triangle at the end points in the direction of transcription.
 
+
# Note how the shape of the cursor changes over different regions of the window. For example, you can click/hold the graph and slide it left and right (this changes the overview indicator that shows where on the chromosome the currently displayed window of sequence is located). You can click on and follow annotation information. You can also select a stretch of nucleotides and dump it as FASTA (hover over the ruler in the ''Details'' pane). It should be obvious how this could e.g. be useful to study untranslated regions upstream of the stop-codon to validate translation start sites.
 
+
# Zoom in by selecting '''Show 5 kbp''' at the scroll/zoom controls.
&nbsp;<br>
+
# Click on the '''Select Tracks''' tab at the top (next to the '''Browser''' tab). This gives you access to a fine-grained selection of all tracks that have been created as genome annotations.
 
+
# Find the section for '''Transcription Factors''' (a subsection of '''Transcription Regulation'''). Click on the star next to '''TF ChIP chip''' to mark this experiment as a "favorite". Then click on '''Show Favorites Only''' at the top of the page. Finally check '''All on''' for the '''Transcription Factors''' track and '''Back to browser'''.
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.
+
}}
  
  
 +
This view shows you the ChIP-chip validated TF-binding sites in the upstream regulatory region of yeast Cdc6. Note that Mbp1 is among them. Curiously, Swi6 is also listed there - but you know that [http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=YLR182W Swi6] does not actually bind DNA directly, but forms a complex with the APSES domain transcription factors Mbp1/Swi4 which form the [http://www.yeastgenome.org/cgi-bin/GO/goTerm.pl?goid=0030907 MBF] complex. However, crosslinking of the complex and immunoprecipitation with anti-Swi6 would certainly identify this region. You should be aware that an annotation of a protein in a ChIP-chip experiment is not the same as demonstrating a protein's physical interaction with DNA.
  
 +
{{vspace}}
 +
-->
 +
<!--
 +
==NCBI Map Viewer==
 +
{{smallvspace}}
  
 +
{{task|1=
  
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is provided as a search tool for structural similarity search.  
+
In this task you will locate and display a map view at the NCBI for the yeast Cdc6 gene.
  
{{task|1=
+
# Navigate to the [http://www.ncbi.nlm.nih.gov/ '''NCBI''' home page] and follow the link to '''Genomes & maps''' in the left-hand menu.
# Navigate to the [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml '''VAST'''] search interface page.
+
# Click on the '''Tools''' tab and find the link to the [http://www.ncbi.nlm.nih.gov/mapview/ '''Map Viewer''']
# Enter <code>1bm8</code> as the PDB ID to search for and click '''Go'''.
+
# In the '''Fungi''' section, click on the latest "build" of the ''Saccharomycs cerevisiae'' genome. This takes you to an overview page of the status of the Genome project. Each chromosome is linked to its map. If you would not know what chromosome to look for, you would need to search by keyword, or gene name in the nucleotide database. Regarding Cdc6, you remember from the task above that it is located on [http://www.ncbi.nlm.nih.gov/projects/mapview/maps.cgi?taxid=4932&chr=X Chromosome X] (''i.e'' the {{WP|Roman numerals|roman numeral}} ten, not the "X-Chromosome"). You will arrive at the actual mapview of the entire Chromosome with the RefSeq accession number <code>NC_001142.9</code>. This large nucleotide record containing the entire chromosomal sequence underlies the display. 
# Study the result.
+
# Enter '''Cdc6''' into the Search field and click the '''Find in This View''' button. Then zoom in a few levels.
 
}}
 
}}
  
  
You will see that VAST finds a large number of partially similar structures, but it would be almost impossibly tedious to find structures of protein DNA complexes that are similar in the core of the interaction domain. It turns out that our search is not specific enough in two ways: we have structural elements in our PDB file that are unnecessary for the question at hand, and thus cause the program to find irrelevant matches. But, if we constrain ourselves to just a single helix and strand (i.e. the 50-74 subdomain that has been implicated in DNA binding, the search will become too non-specific. Also we have no good way to retrieve functional information from these hits: which ones are DNA-binding proteins, that bind DNA through residues of this subdomain and for which the structure of a complex has been solved? It seems we need to define our question more precisely.
+
The [http://www.ncbi.nlm.nih.gov/projects/mapview/maps.cgi?TAXID=4932&CHR=X&MAPS=cntg-r,genes%5B36220.54%3A43678.04%5D&QUERY=Cdc6&zoom=10 resulting view] shows you the location and orientation of the gene on the chromosome. A number of links to various NCBI databases are given for each gene. Note that this is primarily a tool for database crossreferencing, not for integrating and displaying annotations.
 
 
{{task|1=
 
# Open VMD and load the 1BM8 structure or your homology model.
 
# Display the backbone as a '''Trace''' (of CA atoms) and color by '''Index'''
 
# In the sequence viewer, highlight residues 50 to 74.
 
# In the representations window, find the yellow representation (with Color ID 4) that the sequence viewer has generated. Change the '''Drawing Method''' to '''NewCartoon'''.
 
# Now (using stereo), study the topology of the region. Focus on the helix at the N-terminus of the highlighted subdomain,  it is preceded by a turn and another helix. This first helix makes interactions with the beta hairpin at the C-terminal end of the subdomain and is thus important for the orientation of these elements. (This is what is referred to as a helix-turn-helix motif, or HtH motif, it is very common in DNA-binding proteins.)
 
# Holding the shift key in the alignment viewer, extend your selection until you cover all of the first helix, and the residues that contact the beta hairpin. I think that the first residue of interest here is residue 33.
 
# Again holding the shift key, extend the selection at the C-terminus to include the residues of the beta hairpin to where they contact the helix at the N-terminus. I think that the last residue of interest here is residue 79.
 
# Study the topology and arrangement of this compact subdomain. It contains the DNA-binding elements and probably most of the interactions that establish its three-dimensional shape. This subdomain even has a name: it is a ''winged helix'' DNA binding motif, a member of a very large family of DNA-binding domains. I have linked a review by Gajiwala and Burley to the end of this page; note that their definition of a canonical winged helix motif is a bit larger than what we have here, with an additional helix at the N-terminus and a second "wing". )
 
}}
 
  
 +
{{vspace}}
 +
-->
 +
<!--
 +
==Ensembl==
 +
{{smallvspace}}
  
Armed with this insight, we can attempt again to find meaningfully similar structures. At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] provides a convenient interface for structure searches for our purpose
+
The EBI offers its own version of genome browsers through the Ensembl project. A large number of genomes have been annotated, cross-referenced and made available for viewing. The EBI has spent a lot of effort on automated curation of their genome offerings. '''The ensemble offerings are therefore more comprehensive and  complete than those of other sources'''. In particular, you may find a genome view for YFO. Use any other fungus if YFO is not present.
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] search interface page.
 
# Enter <code>1bm8</code> for the '''PDB code''' and choose '''Select domain''' from the drop down menu. Select Secondary Structure elements 4 to 7 i.e. those elements that span the range you have previously defined..
 
# Note that you can enter the lowest acceptable match % separately for query and target. This means: what percentage of secondary structure elements would need to be matched in either query or target to produce a hit. Keep that value at 80 for our query, since we would want to find structures with almost all of the elements of the winged helix motif. Set the  match to 10 % for the target, since we are interested in such domains even if they happen to be small subdomains of large proteins.
 
# Keep the '''Precision''' at '''normal'''. Precision and % query match could be relaxed if we wanted to find more structures.
 
#  Finally click on: '''Submit your query'''.
 
# On the results page, click on the number next to the top hit ''that is not one of our familiar Mbp1 structures to get a detailed view of the result. Most likely this is <code>1no7:b</code>, a herpesvirus capsid domain. Click on '''View Superposed'''. This will open a window with the structure coordinates superimposed in the Jmol molecular viewer. Control-click anywhere in the window area to open a menu of viewing options. Select '''Style &rarr; Stereographic &rarr; Wall-eyed viewing'''. Then study the superposition. You will note that the secondary structure elements match quite well, but clearly, we have not found a winged-helix domain. If you consider the '''secondary structure alignment''' in the results page, you will notice that there are a significant number of gaps between the elements.
 
# Go back to the table of results and resort the table by '''number of gaps'''. This will bring a different protein to the top (again, excepting Mbp1 structures.) Explore its structure as well.
 
}}
 
 
 
All in all this appears to be well engineered software! It gives you many options to access result details for further processing. I think this may be useful. But for our problem, we would have to search through too many structures because, once again, we can't tell which ones of the hits are DNA binding domains, especially domains for which the structure of a complex has been solved.
 
  
 +
In this task you will review the ensembl view of the YFO ortholog to yeast CDC6.
  
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.]]
+
# Navigate to the [http://fungi.ensembl.org/index.html '''EnsemblFungi'''] page (easy to find via Google).
  
&nbsp;<br>
+
# Select ''Saccharomyces cerevisiae'' from the species list.
 +
# '''Search''' for  Cdc6 as a search term in the ''Search Saccharomyces cerevisiae ...'' field.
 +
# Click on [http://fungi.ensembl.org/Saccharomyces_cerevisiae/Gene/Summary?g=YJL194W;r=X:69338-70879;t=YJL194W;db=core CDC6 (YJL194W)]
  
APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of a protein-DNA complex. Superfamilies of such structural domains are compiled in the CATH database. Unfortunately CATH itself does not provide information about whether the structures have been determined as complexes. '''But''' we can search the PDB with CATH codes and restrict the results to complexes. Essentially, this should give us a list of all winged helix domains for which the structure of complexes with DNA have been determined. This works as follows:
+
You will be taken to a browser view of the genome. Tracts can be switched on and off through the menu on the left hand side.  
  
{{task|1=
+
# Find the link to [http://fungi.ensembl.org/Saccharomyces_cerevisiae/Gene/Compara_Ortholog?db=core;g=YJL194W;r=X:69338-70879;t=YJL194W '''Orthologues'''] under the '''Fungal Compara''' section in the menu.
* For reference, access [http://www.cathdb.info/version/latest/superfamily/1.10.10.10 CATH domain superfamily 1.10.10.10]; this is the CATH classification code we will l use to find protein-DNA complexes.
+
# In the resulting page, find the YFO orthologue and click on the link in the '''Location''' column.
 +
# On the Browser page, click on the cogwheel icon in the bottom left bar of the lower pane to configure tracks.
 +
# On the configuration page, in the '''Configure Region Image''' tab, click on '''Sequence and Assembly''' in the left-hand menu and click the (check)-boxes to turn '''Contigs''' off and '''Translated sequence''' on. Leave '''Sequence''' on. Click the checkmark in the top-right corner of the configuration window to close it and return to the browser view.
 +
# Zoom in until you see the display of the actual nucleotides and the six reading frames. This is a genome view of YFO at the actual nucleotide level.
  
# Navigate to the [http://www.pdb.org/ PDB home page] and follow the link to [http://www.pdb.org/pdb/search/advSearch.do Advanced Search]
 
# In the options menu for '''Choose a Query Type''' select '''Structure Features &rarr; CATH Classification Browser'''. A window will open that allows you to navigate down through the CATH tree. You can view the Class/Architecture/Topology names on the CATH page linked above. Click on '''the triangle icons''' (not the text) for '''Mainly Alpha &rarr; Orthogonal Bundle &rarr; ARC repressor mutant, subunit A''' then click on the link to '''winged helix repressor DNA binding domain'''. Or, just enter "winged helix" into the search field. This subquery should match more than 500 coordinate entries.
 
# Click on the '''(+)''' button behind '''Add search criteria''' to add an additional query. Select the option '''Structure Features &rarr; Macromolecule type'''. In the option menus that pop up, select '''Contains Protein&rarr;Yes, Contains DNA&rarr;Yes, Contains RNA&rarr;Ignore, Contains DNA/RNA hybrid&rarr;Ignore'''. This selects files that contain Protein-DNA complexes.
 
# Check the box below this subquery to '''Remove Similar Sequences at 90% identity''' and click on '''Submit Query'''. This query should retrieve more than 90 complexes.
 
# Scroll down to the beginning of the list of PDB codes and locate the '''Reports''' menu. Under the heading '''View''' select '''Gallery'''. This is a fast way to obtain an overview of the structures that have been returned. Adjust the number of '''Results''' to see all 90 of the images and choose '''Options&rarr;Resize medium'''.
 
# Finally we have a set of winged-helix domain/DNA complexes, for comparison. Scroll through the gallery and study how the protein binds DNA.
 
 
}}
 
}}
  
  
First of all you may notice that in fact not all of the structures are really different, despite despite having requested only to retrieve dissimilar sequences, and not all images show DNA. This appears to be a deficiency of the algorithm. But you can also easily recognize how in most of the the structures the '''recognition helix inserts into the major groove of B-DNA''' (eg. 1BC8, 1CF7) and the wing - if clearly visible at all in the image - appears to make accessory interactions with the DNA backbone.. There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way, through the beta-strands of the "wing". This is interesting since it suggests there is more than one way for winged helix domains to bind to DNA. We can therefore use structural superposition of '''your homology model''' and '''two of the winged-helix proteins''' to decide whether the canonical or the non-canonical mode of DNA binding seems to be more plausible for Mbp1 orthologues.  
+
ensembl provides a very comprehensive offering in terms of sequences, and it has a well thought-out and maintained [http://rest.ensemblgenomes.org/ REST API]. However, ensemble too offers little in terms of annotations of DNA elements, expression levels and the like. Nevertheless, since it is the database with the largest number of species annotated, it would be the tool to go to if you were to compare syntenic regions or genomic context between different species.
  
 +
{{vspace}}
  
 +
-->
  
&nbsp;
+
==The UCSC genome browser==
 
+
{{smallvspace}}
===Preparation and superposition of a canonical complex===
 
 
 
&nbsp;<br>
 
 
 
The structure we shall use as a reference for the '''canonical binding mode''' is the Elk-1 transcription factor.
 
 
 
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
 
  
The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, you should delete the second copy of the complex from the PDB file. (Remember that PDB files are simply text files that can be edited.)
+
The University of California Santa Cruz (UCSC) Genome Browser Project has the largest offering of annotation information. However it is strictly model-organism oriented and you will probably not find YFO among its curated genomes. Nevertheless, if you are studying eg. human genes, or yeast, the UCSC browser will probably be your first choice.
  
 
{{task|1=
 
{{task|1=
# Find the 1DUX structure in the image gallery and open the 1DUX structure explorer page in a separate window. Download the coordinates to your computer.
 
# Open the coordinate file in a text-editor (TextEdit or Notepad - '''NOT''' MS-Word!) and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
 
# Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which.
 
# You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
 
# In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
 
# In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
 
# You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that your '''model''''s side-chain orientations have not been determined experimentally but inferred from the '''template''', and that the template's structure was determined in the absence of bound DNA ligand.
 
  
# Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. You may want to keep a copy of the image for future reference. Consider which parts of the structure appear to superimpose best.  Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in this orientation.
+
In this task you will access the UCSC genome browser view of the <!-- yeast Cdc6 gene and its human orthologue --> human Cdc6 gene. You will explore some of the very large number of tracks that are available and study the transcription factor binding region.
}}
 
  
&nbsp;<br>
+
# Navigate to the [http://genome.ucsc.edu/ '''UCSC''' Genome Bioinformatics entry page] and follow the link to the '''Genome Browser''' in the "Our tools" section.
&nbsp;
+
<!--
 +
# From the available menus, access the ''S. cerevisiae'' information ('''group &rarr; other''') and enter Cdc6 as the '''search term'''.
 +
# Click on the link to the [http://genome.ucsc.edu/cgi-bin/hgTracks?position=chrX:69338-70879&hgsid=311433759&sgdGene=pack&hgFind.matches=YJL194W, Cdc6 gene] on chromosome X.
 +
# Click on the button to zoom out '''3x''' - we want to see the upstream regulatory region.
 +
# In the subsection for '''Expression and Regulation''', find the menu for '''Regulatory Code''' and select '''full'''; select '''hide''' for all other expression tracks. Click '''refresh'''.
  
 +
Up to now, this looks very similar to the SGD genome browser.
  
===Preparation and superposition of a non-canonical complex===
+
# Open a second window, and access the UCSC Genome browser for the '''human genome'''. Search for CDC6 and click the link to [http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr17:38444146-38459413&hgsid=394751891_WVshDjZBOw5nRfbXOotacA9pGJn5&knownGene=pack&hgFind.matches=uc002huj.1, <code>Homo sapiens cell division cycle 6 (CDC6), mRNA</code>] on chromosome 17.
 
+
-->
 
+
# Click on the link to humans. Note that this is the hg38 assembly.
The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
+
# Enter CDC6 into the "Position/Search Term" field and click "Go". You should get a list of entries, click on the top link, the gene on chromosome 17: <tt>[http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr17:40287633-40304657&hgsid=570479629_xD9YY3QMJ4u2xrTagkgV7xJMqEen&knownGene=pack&hgFind.matches=uc002huj.2, CDC6 (uc002huj.2) at chr17:40287633-40304657]</tt>
 
 
[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coresponds to the recognition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
 
 
 
 
 
Before we can work with this however, we have to fix an annoying problem. If you download and view the <code>1DP7</code> structure in VMD, you will notice that there is only a single strand of DNA! Where is the second strand of the double helix? It is not in the coordinate file, because it happens to be exactly equivalent to the frist starnd, rotated around a two-fold axis of symmetry in the crystal lattice. We need to download and work with the so-called '''Biological Assembly''' instead. But there is a problem related to the way the PDB stores replicates in biological assemblies. The PDB generates the additional chains as copies of the original and delineates them with <code>MODEL</code> and <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The PDB file thus contains the '''same molecule in two different orientations''', not '''two independent molecules'''. This is an important difference regarding how such molecules are displayed by VMD. '''If you try to use the biological unit file of the PDB, VMD does not recognize that there is a second molecule present and displays only one chain.''' And that looks exactly like the one we have seen before. We have to edit the file, extract the second DNA molecule, change its chain ID and then append it to the original 1DP7 structure...
 
 
 
{{task|1=
 
# On the structure explorer page for 1DP7, select the option '''Download Files''' &rarr; '''PDB File'''.
 
# Also select the option '''Download Files''' &rarr; '''Biological Assembly'''.
 
# Uncompress the biological assembly file.
 
# Open the file in a text editor.
 
# Delete everything except the '''second DNA molecule'''. This comes after the <code>MODEL  2</code> line and has chain ID '''D'''. Keep the <code>TER</code> and <code>END</code> lines. Save this with a new filename (e.g. <code>1DP7_DNAonly.pdb</code>).
 
# Also delete all <code>HETATM</code> records for <code>HOH</code>, <code>PEG</code> and <code>EDO</code>, as well as the entire second protein chain and the <code>MASTER</code> record. The resulting file should only contain the DNA chain and its copy and one protein chain. Save the file with a new name, eg. <code>1DP7_BDNA.PDB</code>.
 
# Use a similar procedure as [[BIO_Assignment_Week_8#R code: renumbering the model in the last assignment]] to change the  chain ID.
 
 
 
<source lang="rsplus">
 
PDBin <- "1DP7_DNAonly.pdb"
 
PDBout <- "1DP7_DNAnewChain.pdb"
 
  
pdb  <- read.pdb(PDBin)
+
# Zoom out '''1.5x''' to view the upstream regulatory region: the end of the adjacent WIPF2 gene should have just come into view on the left.
pdb$atom[,"chain"] <- "E"
+
# Study the Genome Browser view of the human CDC6 homolog.
write.pdb(pdb=pdb,file=PDBout)
+
## In particular, note the extensive functional annotations of DNA and the alignments of vertebrate syntenic regions that allow detailed genomic comparisons.
</source>
+
## Distinguish between exon and intron sequence.
 +
## Note that the mammal Conservation track has high values for all of the exons, but not only for exons.
 +
## Find more information on the "Layered H3K27Ac" tract.
  
# Use your text-editor to open both the <code>1DP7.pdb</code> structure file and the  <code>1DP7_DNAnewChain.pdb</code>. Copy the DNA coordinates, paste them into the original file before the <code>END</code> line and save.
+
# Note the '''large''' number of available tracks that have been integrated into this view. Most of them are switched off. Find the '''Regulation''' section, and follow the link to the "ORegAnno" information to see what that is about. Note that you can switch individual annotations on or off on this page, as well as set the display format for all of the results. Select the check-box '''only''' for "transcription factor binding site" to be on, select the "Display mode" to '''full''' and click '''submit'''.
# Open the edited coordinate file with VMD. You should see '''one protein chain''' and a '''B-DNA double helix'''. (Actually, the BDNA helix has a gap, because the R-library did not read the BRDU nucleotide as DNA). Switch to stereo viewing and spend some time to see how '''amazingly beautiful''' the complementarity between the protein and the DNA helix is (you might want to display ''protein'' and ''nucleic'' in separate representations and color the DNA chain by ''Position'' &rarr; ''Radial'' for clarity) ... in particular, appreciate how not all positively charged side chains contact the phosphate backbone, but some pnetrate into the helix and make detailed interactions with the nucleobases!
+
# Study this information and note:
# Then clear all molecules
+
## There is a cluster of TFBS just upstream of the transcription initiation site.
# In VMD, open '''Extensions&rarr;Analysis&rarr;MultiSeq'''. When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default, or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.  
+
## This cluster coincides with the highest H3K27Ac density.
# Choose '''File&rarr;Import Data''', browse to your directory and load one by one:
+
## If you &lt;control&gt;-click (right-click?) on the top orange bar of this cluster, a contextual menu opens from which you can access the details page for OREG1791811 in a new window. Follow the link to the RBL2 transcription factor via [http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000103479;r=16:53445781-53491648;t=ENST00000379935 ENST00000379935] ... from where you can access transcript and gene and expression and protein family and GO and all other information.
:: -Your model;
+
# Go back to the Genome Browser and set the ORegAnno tract to "pack" and click "refresh".
:: -The 1DUX complex;
+
# Slide the SNP track to just beneath the RefSeq genes track that contains the introns and exons. You will notice that one of the SNPs is green, and two are red. Why? Set the "Common SNPs" track display mode to "pack" and click "refresh".
:: -The 1DP7 complex.  
 
# Mark all three protein chains by selecting the checkbox next to their name and choose '''Tools&rarr; STAMP structural alignment'''.
 
# '''Align''' the '''Marked Structures''', choose a '''scanscore''' of '''2''' and '''scanslide''' of '''5'''. Also choose '''Slow scan'''. You may have to play around with the setting to get the molecules to superimpose: but the '''can''' be superimposed quite well - at lest the DNA-binding helices and the wings should line up.  
 
# In the graphical representations window, double-click on the cartoon representations that multiseq has generated to undisplay them, also undisplay the Tube representation of 1DUX. Then create a Tube representation for 1DP7, and select a Color by ColorID (a different color that you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
 
# Orient and scale your superimposed structures so that their structural similarity is apparent, and the differences in binding elements is clear. Perhaps visualizing a solvent accessible surface of the DNA will help understand the spatial requirements of the complex formation. You may want to keep a copy of the image for future reference. Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in the "alternative" conformation.
 
 
}}
 
}}
  
  
&nbsp;
+
Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors or factors in YFO. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.
  
 
<!--
 
<!--
===Coloring by conservation===
+
The UCSC browser has a sometimes bewildering amount of information available. But its curators are aware of the need for educating users regarding the utility of their tools.
 
 
With the superimposed coordinates, you can begin to get a sense whether either or both binding modes could be appropriate for a protein-DNA complex in your Mbp1 orthologue. But these are geometrical criteria only, and the protein in your species may be flexible enough to adopt a different conformation in a complex, and different again from your model. A more powerful way to analyze such hypothetical complexes is to look at conservation patterns. With VMD, you can import a sequence alignment into the MultiSeq extension and color residies by conservation. The protocol below assumes
 
 
 
*You have prealigned the reference Mbp1 proteins with your species' Mbp1 orthologue;
 
*You have saved the alignment in a CLUSTAL format.
 
 
 
You can use Jalview or any other MSA server to do so. You can even do this by hand - there should be few if any indels and the correct alignment is easy to see.
 
  
 
{{task|1=
 
{{task|1=
;Load the Mbp1 APSES alignment into MultiSeq.
 
 
:(A) In the MultiSeq Window, navigate to '''File &rarr; Import Data...'''; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable <code>ALN</code> files (these are CLUSTAL formatted multiple sequence alignments).
 
:(B) Open the alignment file, click on '''Ok''' to import the data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.
 
:(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).
 
 
You will see that the 1MB1 sequence and the APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.
 
 
;Bring the 1MB1 sequence in register with the APSES alignment.
 
:(A)MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequences you have imported. 
 
:(B) Select '''Edit &rarr; Enable Editing... &rarr; Gaps only''' to allow changing indels.
 
:(C) Pressing the spacebar once should insert a gap character before the '''selected column''' in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1: <code>S I M ...</code>
 
:(D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)
 
:(E) When you are done, it may be prudent to save the state of your alignment. Use '''File &rarr; Save Session...'''
 
 
;Color by similarity
 
:(A) Use the '''View &rarr; Coloring &rarr; Sequence similarity &rarr; BLOSUM30''' option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
 
:(B) You can adjust the color scale in the usual way by navigating to '''VMD main &rarr; Graphics &rarr; Colors...''', choosing the Color Scale tab and adjusting the scale midpoint.
 
:(C) Navigate to the '''Representations''' window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use '''User''' coloring of your ''Tube'' and ''Licorice'' representations to apply the sequence similarity color gradient that MultiSeq has calculated.
 
  
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+
In this task you will access some of the tutorial information that UCSC provides.
* Once you have colored the residues of your model by conservation, create another informative stereo-image and paste it into your assignment.
+
# Return to the [http://genome.ucsc.edu/ '''UCSC''' Genome Bioinformatics entry page] and follow the link to '''Training''' in the left-hand menu.
 +
# Follow the link to the [http://www.openhelix.com/ucsc '''OpenHelix UCSC tutorials'''].
 +
# Download the Hands-on exercise PDF file and work through '''Exercise 2''' (the rat leptin exercise).
 
}}
 
}}
  
&nbsp;
+
This exercise includes a number of interesting options to work with the UCSC data - the BLAT tool for genomic region alignment and the selective display of SNP annotations.
-->
 
 
 
== Interpretation==
 
<!--
 
Analysis of the ligand binding site:
 
 
 
* http://dnasite.limlab.ibms.sinica.edu.tw/
 
* http://proline.biochem.iisc.ernet.in/pocketannotate/
 
* http://www.biosolveit.de/PoseView/
 
 
 
*Comparison with seq2logo
 
{{#pmid: 19483101}}
 
*protedna server PMID: 19483101
 
* http://serv.csbb.ntu.edu.tw/ProteDNA/
 
* http://protedna.csie.ntu.edu.tw/
 
* Multi Harmony
 
{{#pmid: 20525785}}
 
  
 +
; Optional
 +
* Work through exercise one and three of the OpenHelix UCSC introduction.
 +
* Access the [http://www.openhelix.com/ENCODE2 OpenHelix '''ENCODE''' tutorial], download the '''Hands-on Exercises''' pdf and work through the exercises. Exercise 3 is particularly valuable, as it teaches you how to create results from complex intersections of queries.
 +
* You can also work through the [http://www.nature.com/scitable/ebooks/guide-to-the-ucsc-genome-browser-16569863 Guide to the UCSC Genome Browser at "nature"] which gives an excellent, in-depth overview.
 +
* Study the ''User's guide to ENCODE'' paper linked below.
 
-->
 
-->
  
 +
{{task|1=
  
 
+
Finally:
{{task|1=
+
# Print this page, but print the first page only.
# Spend some time studying the complex.
+
# With a red pen, mark and label the following four items on your print-out:
# You should clearly think about the following question: considering the position of the two DNA helices relative to the YFO structural model, which binding mode appears to be more plausible for protein-DNA interactions in the YFO Mbp1 APSES domains? Is it the canonical, or the non-canonical binding mode? Is there evidence that allows you to distinguish between the two modes?
+
## The first exon of CDC6.
# Before you quit VMD, save the "state" of your session so you can reload it later. We will look at residue conservation once we have built phylogenetic trees. In the main VMD window, choose '''File&rarr;Save State...'''.
+
## The chromosomal coordinates of the current view.
 +
## The binding sites for the transcription factors that bind to the CDC6 promoter.
 +
## The locations of the missense-variant SNPs.
 +
# Write your name and Student number on this page and bring it to class to hand it in on Tuesday.
 
}}
 
}}
  
<!--
 
== R code: conservation scores and sequence weighting==
 
-->
 
  
;That is all.
 
  
  
Line 248: Line 189:
  
 
== Links and resources ==
 
== Links and resources ==
{{#pmid: 10679470}}
+
{{#pmid: 22764121}}
 
+
{{smallvspace}}
 +
{{#pmid: 26527727}}
 +
{{#pmid: 25762420}}
 +
{{#pmid: 21526222}}
 +
{{smallvspace}}
 +
{{#pmid: 25645873}}
  
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
  
 +
{{vspace}}
  
&nbsp;
 
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 +
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_8|&lt;&nbsp;Assignment&nbsp;8]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_10|Assignment&nbsp;10&nbsp;&gt;]]</td>
 +
</tr></table>
  
  

Latest revision as of 04:12, 13 December 2016

Assignment for Week 9
Genomics

< Assignment 8 Assignment 10 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

 

Large scale genome sequencing and annotation has made a wealth of information available that is all related to the same biological objects: the DNA. The information however can be of very different types, it includes:

  • the actual sequence
  • sequence variants (SNPs and CNVs)
  • conservation between related species
  • genes (with introns and exons)
  • mRNAs
  • expression levels
  • regulatory features such as transcription factor bindings sites

and much more.

Since all of this information relates to specific positions or ranges on the chromosome, displaying it alongside the chromosomal coordinates is a useful way to integrate and visualize it. We call such strips of annotation tracts and display them in genome browsers. Quite a number of such browsers exist and most work on the same principle: server hosted databases are queried through a Web interface; the resulting data is displayed graphically in a Web browser window. The large data centres each have their own browsers, but arguably the best engineered, most informative and mostly widely used one is provided by the University of California Santa Cruz (UCSC) Genome Browser Project.

Compiling the data requires a massive annotation effort, which has not been completed for all genome-sequenced species. In particular, not all of our YFOs have been included in the major model-organism annotation efforts. The general strategy for analysis of a gene in YFO is thus to map it to homologous genes in model organisms. In this assignment you will explore the UCSC genome browser and we will go through an exercise that relates fungal replication genes to human genes. We have previously focused a lot on Mbp1 homologs, but these have no clear equivalences in "higher" eukaryotes. However one of the key target genes of Mbp1 is the cell cycle protein Cdc6, which is well conserved in fungi and other eukaryotes eukaryotes and has a human homolog. Since generally speaking the annotation level for human genes is the highest, we will have a closer look at that gene.


 

The UCSC genome browser

 

The University of California Santa Cruz (UCSC) Genome Browser Project has the largest offering of annotation information. However it is strictly model-organism oriented and you will probably not find YFO among its curated genomes. Nevertheless, if you are studying eg. human genes, or yeast, the UCSC browser will probably be your first choice.

Task:
In this task you will access the UCSC genome browser view of the human Cdc6 gene. You will explore some of the very large number of tracks that are available and study the transcription factor binding region.

  1. Navigate to the UCSC Genome Bioinformatics entry page and follow the link to the Genome Browser in the "Our tools" section.
  2. Click on the link to humans. Note that this is the hg38 assembly.
  3. Enter CDC6 into the "Position/Search Term" field and click "Go". You should get a list of entries, click on the top link, the gene on chromosome 17: CDC6 (uc002huj.2) at chr17:40287633-40304657
  1. Zoom out 1.5x to view the upstream regulatory region: the end of the adjacent WIPF2 gene should have just come into view on the left.
  2. Study the Genome Browser view of the human CDC6 homolog.
    1. In particular, note the extensive functional annotations of DNA and the alignments of vertebrate syntenic regions that allow detailed genomic comparisons.
    2. Distinguish between exon and intron sequence.
    3. Note that the mammal Conservation track has high values for all of the exons, but not only for exons.
    4. Find more information on the "Layered H3K27Ac" tract.
  1. Note the large number of available tracks that have been integrated into this view. Most of them are switched off. Find the Regulation section, and follow the link to the "ORegAnno" information to see what that is about. Note that you can switch individual annotations on or off on this page, as well as set the display format for all of the results. Select the check-box only for "transcription factor binding site" to be on, select the "Display mode" to full and click submit.
  2. Study this information and note:
    1. There is a cluster of TFBS just upstream of the transcription initiation site.
    2. This cluster coincides with the highest H3K27Ac density.
    3. If you <control>-click (right-click?) on the top orange bar of this cluster, a contextual menu opens from which you can access the details page for OREG1791811 in a new window. Follow the link to the RBL2 transcription factor via ENST00000379935 ... from where you can access transcript and gene and expression and protein family and GO and all other information.
  3. Go back to the Genome Browser and set the ORegAnno tract to "pack" and click "refresh".
  4. Slide the SNP track to just beneath the RefSeq genes track that contains the introns and exons. You will notice that one of the SNPs is green, and two are red. Why? Set the "Common SNPs" track display mode to "pack" and click "refresh".


Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors or factors in YFO. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.


Task:
Finally:

  1. Print this page, but print the first page only.
  2. With a red pen, mark and label the following four items on your print-out:
    1. The first exon of CDC6.
    2. The chromosomal coordinates of the current view.
    3. The binding sites for the transcription factors that bind to the CDC6 promoter.
    4. The locations of the missense-variant SNPs.
  3. Write your name and Student number on this page and bring it to class to hand it in on Tuesday.



 

Links and resources

Wang et al. (2013) A brief introduction to web-based genome browsers. Brief Bioinformatics 14:131-43. (pmid: 22764121)

PubMed ] [ DOI ] Genome browser provides a graphical interface for users to browse, search, retrieve and analyze genomic sequence and annotation data. Web-based genome browsers can be classified into general genome browsers with multiple species and species-specific genome browsers. In this review, we attempt to give an overview for the main functions and features of web-based genome browsers, covering data visualization, retrieval, analysis and customization. To give a brief introduction to the multiple-species genome browser, we describe the user interface and main functions of the Ensembl and UCSC genome browsers using the human alpha-globin gene cluster as an example. We further use the MSU and the Rice-Map genome browsers to show some special features of species-specific genome browser, taking a rice transcription factor gene OsSPL14 as an example.

 
Sloan et al. (2016) ENCODE data at the ENCODE portal. Nucleic Acids Res 44:D726-32. (pmid: 26527727)

PubMed ] [ DOI ] The Encyclopedia of DNA Elements (ENCODE) Project is in its third phase of creating a comprehensive catalog of functional elements in the human genome. This phase of the project includes an expansion of assays that measure diverse RNA populations, identify proteins that interact with RNA and DNA, probe regions of DNA hypersensitivity, and measure levels of DNA methylation in a wide range of cell and tissue types to identify putative regulatory elements. To date, results for almost 5000 experiments have been released for use by the scientific community. These data are available for searching, visualization and download at the new ENCODE Portal (www.encodeproject.org). The revamped ENCODE Portal provides new ways to browse and search the ENCODE data based on the metadata that describe the assays as well as summaries of the assays that focus on data provenance. In addition, it is a flexible platform that allows integration of genomic data from multiple projects. The portal experience was designed to improve access to ENCODE data by relying on metadata that allow reusability and reproducibility of the experiments.

Pazin (2015) Using the ENCODE Resource for Functional Annotation of Genetic Variants. Cold Spring Harb Protoc 2015:522-36. (pmid: 25762420)

PubMed ] [ DOI ] This article illustrates the use of the Encyclopedia of DNA Elements (ENCODE) resource to generate or refine hypotheses from genomic data on disease and other phenotypic traits. First, the goals and history of ENCODE and related epigenomics projects are reviewed. Second, the rationale for ENCODE and the major data types used by ENCODE are briefly described, as are some standard heuristics for their interpretation. Third, the use of the ENCODE resource is examined. Standard use cases for ENCODE, accessing the ENCODE resource, and accessing data from related projects are discussed. Although the focus of this article is the use of ENCODE data, some of the same approaches can be used with data from other projects.

ENCODE Project Consortium (2011) A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9:e1001046. (pmid: 21526222)

PubMed ] [ DOI ] The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

 
Zarrei et al. (2015) A copy number variation map of the human genome. Nat Rev Genet 16:172-83. (pmid: 25645873)

PubMed ] [ DOI ] A major contribution to the genome variability among individuals comes from deletions and duplications - collectively termed copy number variations (CNVs) - which alter the diploid status of DNA. These alterations may have no phenotypic effect, account for adaptive traits or can underlie disease. We have compiled published high-quality data on healthy individuals of various ethnicities to construct an updated CNV map of the human genome. Depending on the level of stringency of the map, we estimated that 4.8-9.5% of the genome contributes to CNV and found approximately 100 genes that can be completely deleted without producing apparent phenotypic consequences. This map will aid the interpretation of new CNV findings for both clinical and research applications.


 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 8 Assignment 10 >