Difference between revisions of "BIO Assignment Week 10"

From "A B C"
Jump to navigation Jump to search
m
m
 
(15 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 10<br />
 
Assignment for Week 10<br />
<span style="font-size: 70%">Protein Ligand Complex</span>
+
<span style="font-size: 70%">Expression Analysis</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_9|&lt;&nbsp;Assignment&nbsp;9]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_11|Assignment&nbsp;11&nbsp;&gt;]]</td>
 +
</tr></table>
  
 
{{Template:Inactive}}
 
{{Template:Inactive}}
Line 14: Line 18:
  
 
&nbsp;
 
&nbsp;
 +
 
==Introduction==
 
==Introduction==
  
One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
+
The transcriptome is the set of a cell's mRNA molecules. The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is {{WP|Transcription (genetics)|transcribed}} from the genome is not yet fit for translation but must be processed: {{WP|RNA splicing|splicing}} is ubiquitous<ref>Strictly speaking, splicing is an {{WP|Eukaryote|eukaryotic}} achievement, however there are examples of splicing in {{WP|Prokaryote|prokaryotes}} as well.</ref> and in addition {{WP|RNA editing}} has been encountered in many species. Some authors therefore refer to the ''exome''&mdash;the set of transcribed {{WP|exons}}&mdash; to indicate the actual coding sequence.  
  
Since there is currently no software available that would reliably model such a complex from first principles<ref>''Rosetta'' may get the structure approximately right, ''Autodock'' may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct. </ref>, we will base a model of a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a structurally similar protein-DNA complex?
+
'''Microarray technology''' &mdash; the quantitative, sequence-specific hybridization of labelled nucleotides in chip-format &mdash; was the first domain of "high-throughput biology". Today, it has largely been replaced by {{WP|RNA-Seq|'''RNA-seq'''}}: quantification of transcribed mRNA by high-throughput sequencing and mapping reads to genes. Quantifying gene expression levels in tissue-, development-, or response-specific way has yielded detailed insight into cellular function at the molecular level, with recent results of single-cell sequencing experiments adding a new level of precision. But not all transcripts are mapped to genes: we increasingly realize that the transcriptome is not merely a passive buffer of expressed information on its way to be translated into proteins, but contains multiple levels of complex, regulation through hybridization of small nuclear RNAs<ref>{{#pmid: 25565024}} {{#pmid: 21798102}}</ref>.
  
This assignment is based on the homology model you built. You will (1) identify similar structures of distantly related domains for which protein-DNA complexes are known, (2) assemble a hypothetical complex structure and (3) consider whether the available evidence allows you to distinguish between different modes of ligand binding,
+
In this assignment, we will look at differential expression of Mbp1 and its target genes.
  
==Modeling a DNA ligand==
 
  
 
&nbsp;
 
&nbsp;
  
&nbsp;
+
==GEO2R==
 
 
 
 
===Finding a similar protein-DNA complex===
 
 
 
 
 
&nbsp;<br>
 
 
 
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.
 
 
 
 
 
 
 
  
 +
<section begin=exercises />
  
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is provided as a search tool for structural similarity search.  
+
In this exercise we will use the analysis facilities of the GEO database at the NCBI.
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml '''VAST'''] search interface page.
 
# Enter <code>1bm8</code> as the PDB ID to search for and click '''Go'''.
 
# Follow the link to '''Related Structures'''.
 
# Study the result.
 
}}
 
  
 +
;First, we will search for relevant data sets on GEO, the NCBI's database for expression data.
 +
#Navigate to the entry page for [http://www.ncbi.nlm.nih.gov/gds/ ''' GEO data sets]].
 +
#Enter the following query in the usual Entrez query format: <code>"cell cycle"[ti] AND "saccharomyces cerevisiae"[organism]</code>.
 +
#You should get two datasets among the top hits that analyze wild-type yeast (W303a cells) across two cell-cycles after release from alpha-factor arrest. Choose the [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2347 experiment with lower resolution] (13 samples).
 +
#On the linked GEO DataSet Browser page, follow the link to the [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3635 Accession Viewer page: the "Reference series"].
 +
#Read about the experiment and samples, then follow the link to [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE3635 '''analyze with GEO2R''']
  
You will see that VAST finds more than 3,000 partially similar structures, but it would be almost impossibly tedious to manually search through the list for ''structures of protein DNA complexes'' that are ''similar to the interacting core of the APSES domain''. It turns out that our search is not specific enough in two ways: we have structural elements in our PDB file that are unnecessary for the question at hand, and thus cause the program to find irrelevant matches. But, if we constrain ourselves to just a single helix and strand (i.e. the 50-74 subdomain that has been implicated in DNA binding, the search will become too non-specific. Also we have no good way to retrieve functional information from these hits: which ones are DNA-binding proteins, that bind DNA through residues of this subdomain and for which the structure of a complex has been solved? It seems we need to define our question more precisely.
+
* View the [http://www.youtube.com/watch?v=EUPmGWS8ik0 '''GEO2R''' video tutorial] on youtube.
  
{{task|1=
+
;Now proceed to apply this to the yeast cell-cycle study:[[File:GSE3635_ValueDistribution.png|frame|right|Value distribution for the yeast cell-cycle experiment GSE3635. Experiments are grouped approximately into equivalent time-points on a cell cycle.]]
# Open VMD and load the 1BM8 structure or your YFO homology model.
 
# Display the backbone as a '''Trace''' (of CA atoms) and color by '''Index'''
 
# In the sequence viewer, highlight residues 50 to 74.
 
# In the representations window, find the yellow representation (with Color ID 4) that the sequence viewer has generated. Change the '''Drawing Method''' to '''NewCartoon'''.
 
# Now (using stereo), study the topology of the region. Focus on the helix at the N-terminus of the highlighted subdomain,  it is preceded by a turn and another helix. This first helix makes interactions with the beta hairpin at the C-terminal end of the subdomain and is thus important for the orientation of these elements. (This is what is referred to as a helix-turn-helix motif, or HtH motif, it is very common in DNA-binding proteins.)
 
# Holding the shift key in the alignment viewer, extend your selection until you cover all of the first helix, and the residues that contact the beta hairpin. I think that the first residue of interest here is residue 33.
 
# Again holding the shift key, extend the selection at the C-terminus to include the residues of the beta hairpin to where they contact the helix at the N-terminus. I think that the last residue of interest here is residue 79.
 
# Study the topology and arrangement of this compact subdomain. It contains the DNA-binding elements and probably most of the interactions that establish its three-dimensional shape. This subdomain even has a name: it is a ''winged helix'' DNA binding motif, a member of a very large family of DNA-binding domains. I have linked a review by Gajiwala and Burley to the end of this page; note that their definition of a canonical winged helix motif is a bit larger than what we have here, with an additional helix at the N-terminus and a second "wing". )
 
}}
 
  
 +
# '''Define groups''': the associated publication shows us that one cell-cycle takes pretty exactly 60 minutes. Create timepoints T0, T1, T2, ... T5. Then associate the 0 and 60 min. sample with "T0"; 10 and 70 minutes get grouped as "T1"; 20 and 80 minutes are T2, etc. up to T5. The final sample does not get assigned.
 +
# Confirm that the '''Value distributions''' are unbiased by accessing the value distribution tab - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same.
 +
# Your distribution should look like the image on the right: properly grouped into six categories, and unbiased regarding absolute expression levels and trends.
 +
# '''Look for differentially expressed genes''': open the GEO2R tab and click on '''Top 250'''.
  
Armed with this insight, we can attempt again to find meaningfully similar structures.  At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] provides a convenient interface for structure searches for our purpose
+
;Analyze the results.
  
{{task|1=
+
# Examine the top hits. Click on a few of the gene names in the ''Gene.symbol'' column to view the expression profiles that tell you ''why'' the genes were found to be differentially expressed. What do you think? Is this what you would have expected for genes' responses to the cell-cycle? What seems to be the algorithm's notion of what "differentially expressed" means?
# Navigate to the [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] search interface page.
+
# Look for expected genes. Here are a few genes that are known to be differentially expressed in the cell-cycle as target genes of the MBF complex: <code>DSE1</code>, <code>DSE2</code>, <code>ERF3</code>, <code>HTA2</code>, <code>HTB2</code>, and <code>GAS3</code>. But what about the MBD complex proteins themselves: Mbp1 and Swi6?
# Enter <code>1bm8</code> for the '''PDB code''' and choose '''Select range''' from the drop down menu. Select the residues you have defined above<!-- Select Domain would be better but is currently broken :-( Secondary Structure elements 4 to 7 i.e. those elements that span the range you have previously defined.-->.
 
# Note that you can enter the lowest acceptable match % separately for query and target. This means: what percentage of secondary structure elements would need to be matched in either query or target to produce a hit. Keep that value at 80 for our query, since we would want to find structures with almost all of the elements of the winged helix motif. Set the  match to 10 % for the target, since we are interested in such domains even if they happen to be small subdomains of large proteins.
 
# Keep the '''Precision''' at '''normal'''. Precision and % query match could be relaxed if we wanted to find more structures.
 
#  Finally click on: '''Submit your query'''.
 
# On the results page, click on the index number (in the left-hand column) of the top hit '''that is not one of our familiar Mbp1 structures''' to get a detailed view of the result. Most likely this is <code>1wq2:a</code>, an enzyme. Click on '''View Superposed'''. This will open a window with the structure coordinates superimposed in the Jmol molecular viewer. Control-click anywhere in the window area to open a menu of viewing options. Select '''Style &rarr; Stereographic &rarr; Wall-eyed viewing'''. Select '''Trace''' as the rendering. Then study the superposition. You will note that the secondary structure elements match quite well, but does this mean we have a DNA-binding domain in this sulfite reductase?  
 
}}
 
  
 +
The notion of "differential expression" and "cell-cycle dependent expression" do not overlap completely. Significant differential expression is mathematically determined for genes that have low variance within groups and large differences between groups. This algorithm has no notion of any expectation you might have about the shape of the expression profile. All it finds are genes for which differential expression between some groups is statistically supported. The algorithm returns the top 250 of those. Consistency within groups is very important, while we intuitively might be giving more weight to conformance to our expectations of a cyclical pattern.
  
All in all this appears to be well engineered software! It gives you many options to access result details for further processing. I think this can be put to very good use. But for our problem, we would have to search through too many structures because, once again, we can't tell which ones of the hits are DNA binding domains, especially domains for which the structure of a complex has been solved.
+
Let's see if we can group our time points differently to enhance the contrast between expression levels for cyclically expressed genes. Let's define only two groups: one set before and between the two cycles, one set at the peaks - and we'll omit some of the intermediate values.
  
 
+
# Remove all of your groups and define two groups only. Call them "A" and "B".
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.]]
+
# Assign samples for T = 0 min, 10, 60 and 70 min. to the "A" group. Assign sets  30, 40, 90, and 100 to the "B" group.
 
+
# Recalculate the '''Top 250''' differentially expressed genes (you might have to refresh the page to get the "Top 250" button back.) Which of the "known" MBF targets are now contained in the set? What about Mbp1 and Swi6?
&nbsp;<br>
+
# Finally: Let's compare the expression profiles for Mbp1, Swi6 and Swi4. It is not obvious that transcription factors are themselves under '''transcriptional''' control, as opposed to being expressed at a basal level and ''activated'' by phosporylation or ligand binding. In a new page, navigate to the [http://www.ncbi.nlm.nih.gov/geoprofiles '''Geo profiles'''] page and enter <code>(Mbp1 OR Swi6 OR Swi4 OR Nrm1 OR Cln1 OR Clb6 OR Act1 OR Alg9) AND GSE3635</code> (Nrm1, Cln1, and Clb6 are Mbp1 target genes. Act1 and Alg9 are beta-Actin and mannosyltransferase, these are often used as "housekeeping genes, i.e. genes with condition-independent expression levels, especially for qPCR studies - although Alg9 is also an Mbp1 target. We include them here as negative controls. CGSE3635 is the ID of the GEO data set we have just studied). You could have got similar results in the '''Profile graph''' tab of the GEO2R page. What do you find? What does this tell you? Would this information allow you to define groups that are even better suited for finding cyclically expressed genes?
 
+
# Click on the profile graph for Mbp1 and print out the page. Write your name and student number on the page. With a red pen, '''in one sentence''' describe the evidence you find '''on that page''' that allows us to conclude '''whether or not''' Mbp1 is a cell-cycle gene. You'll probably want to think for a moment what this question really means, how a cell-cycle gene could be defined, and what can be considered "evidence", before you write. I will mark your response for a maximum of four marks.
APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of a protein-DNA complex. Superfamilies of such structural domains are compiled in the CATH database. Unfortunately CATH itself does not provide information about whether the structures have been determined as complexes. '''But''' we can search the PDB with CATH codes and restrict the results to complexes. Essentially, this should give us a list of all winged helix domains for which the structure of complexes with DNA have been determined. This works as follows:
+
<!--
 
+
* Finally, review the '''R''' script for the GEO2R analysis in the '''R script''' tab. This code will run on your machine and make the expression analysis available. Once the datasets are loaded and prepared, you could - for example - perform a "real" time series analysis, calculate correlation coefficients with an idealized sine wave, or search for genes that are '''co-regulated''' with your genes of interest.
{{task|1=
+
-->
* For reference, access [http://www.cathdb.info/superfamily/1.10.10.10 CATH domain superfamily 1.10.10.10]; this is the CATH classification code we will use to find protein-DNA complexes. Click on '''Superfamily Superposition''' to get a sense of the structural core of the winged helix domain.
 
 
 
# Navigate to the [http://www.pdb.org/ PDB home page] and follow the link to [http://www.pdb.org/pdb/search/advSearch.do Advanced Search]
 
# In the options menu for '''Choose a Query Type''' select '''Structure Features &rarr; CATH Classification Browser'''. A window will open that allows you to navigate down through the CATH tree. You can view the Class/Architecture/Topology names on the CATH page linked above. Click on '''the triangle icons''' (not the text) for '''Mainly Alpha &rarr; Orthogonal Bundle &rarr; ARC repressor mutant, subunit A''' then click on the link to '''winged helix repressor DNA binding domain'''. Or, just enter "winged helix" into the search field. This subquery should match more than 550 coordinate entries.
 
# Click on the '''(+)''' button behind '''Add search criteria''' to add an additional query. Select the option '''Structure Features &rarr; Macromolecule type'''. In the option menus that pop up, select '''Contains Protein&rarr;Yes, Contains DNA&rarr;Yes, Contains RNA&rarr;Ignore, Contains DNA/RNA hybrid&rarr;Ignore'''. This selects files that contain Protein-DNA complexes.
 
# Check the box below this subquery to '''Remove Similar Sequences at 90% identity''' and click on '''Submit Query'''. This query should retrieve more than 100 complexes.
 
# Scroll down to the beginning of the list of PDB codes and locate the '''Reports''' menu. Under the heading '''View''' select '''Gallery'''. This is a fast way to obtain an overview of the structures that have been returned. Adjust the number of '''Results''' to see all 100 images and choose '''Options&rarr;Resize medium'''.
 
# Finally we have a set of winged-helix domain/DNA complexes, for comparison. Scroll through the gallery and study how the protein binds DNA.
 
 
}}
 
}}
 
+
<section end=exercises />
 
 
First of all you may notice that in fact not all of the structures are really different, despite having requested only to retrieve dissimilar sequences, and not all images show DNA. This appears to be a deficiency of the algorithm. But you can also easily recognize how in most of the the structures the '''recognition helix inserts into the major groove of B-DNA''' (eg. 1BC8, 1CF7) and the wing - if clearly visible at all in the image - appears to make accessory interactions with the DNA backbone.. There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way, through the beta-strands of the "wing". This is interesting since it suggests there is more than one way for winged helix domains to bind to DNA. We can therefore use structural superposition of '''your homology model''' and '''two of the winged-helix proteins''' to decide whether the canonical or the non-canonical mode of DNA binding seems to be more plausible for Mbp1 orthologues.
 
 
 
  
  
 
&nbsp;
 
&nbsp;
  
===Preparation and superposition of a canonical complex===
 
 
&nbsp;<br>
 
 
The structure we shall use as a reference for the '''canonical binding mode''' is the Elk-1 transcription factor.
 
 
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
 
 
The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, you should delete the second copy of the complex from the PDB file. (Remember that PDB files are simply text files that can be edited.)
 
 
{{task|1=
 
# Find the 1DUX structure in the image gallery and open the 1DUX structure explorer page in a separate window. Download the coordinates to your computer.
 
# Open the coordinate file in a text-editor (TextEdit or Notepad - '''NOT''' MS-Word!) and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
 
# Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which.
 
# You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
 
# In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
 
# In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
 
# You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that your '''model''''s side-chain orientations have not been determined experimentally but inferred from the '''template''', and that the template's structure was determined in the absence of bound DNA ligand.
 
 
# Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. You may want to keep a copy of the image for future reference. Consider which parts of the structure appear to superimpose best.  Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in this orientation.
 
}}
 
 
&nbsp;<br>
 
&nbsp;
 
 
 
===Preparation and superposition of a non-canonical complex===
 
 
 
The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
 
 
[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coresponds to the recognition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
 
 
 
Before we can work with this however, we have to fix an annoying problem. If you download and view the <code>1DP7</code> structure in VMD, you will notice that there is only a single strand of DNA! Where is the second strand of the double helix? It is not in the coordinate file, because it happens to be exactly equivalent to the frist starnd, rotated around a two-fold axis of symmetry in the crystal lattice. We need to download and work with the so-called '''Biological Assembly''' instead.  But there is a problem related to the way the PDB stores replicates in biological assemblies. The PDB generates the additional chains as copies of the original and delineates them with <code>MODEL</code> and <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The PDB file thus contains the '''same molecule in two different orientations''', not '''two independent molecules'''. This is an important difference regarding how such molecules are displayed by VMD. '''If you try to use the biological unit file of the PDB, VMD does not recognize that there is a second molecule present and displays only one chain.''' And that looks exactly like the one we have seen before. We have to edit the file, extract the second DNA molecule, change its chain ID and then append it to the original 1DP7 structure<ref>My apologies if this is tedious. '''But''' in the real world, we encounter such problems a lot and I would be remiss not to use this opportunity to let you practice how to fix the issue that could otherwise be a roadblock in a project of yours.</ref>...
 
 
{{task|1=
 
# On the structure explorer page for 1DP7, select the option '''Download Files''' &rarr; '''PDB File'''.
 
# Also select the option '''Download Files''' &rarr; '''Biological Assembly'''.
 
# Uncompress the biological assembly file.
 
# Open the file in a text editor.
 
# Delete everything except the '''second DNA molecule'''. This comes after the <code>MODEL  2</code> line and has chain ID '''D'''. Keep the <code>TER</code> and <code>END</code> lines. Save this with a new filename (e.g. <code>1DP7_DNAonly.pdb</code>).
 
# Also delete all <code>HETATM</code> records for <code>HOH</code>, <code>PEG</code> and <code>EDO</code>, as well as the entire second protein chain and the <code>MASTER</code> record. The resulting file should only contain the DNA chain and its copy and one protein chain. Save the file with a new name, eg. <code>1DP7_BDNA.PDB</code>.
 
# Use a similar procedure as [[BIO_Assignment_Week_8#R code: renumbering the model in the last assignment]] to change the  chain ID.
 
 
<source lang="rsplus">
 
PDBin <- "1DP7_DNAonly.pdb"
 
PDBout <- "1DP7_DNAnewChain.pdb"
 
 
pdb  <- read.pdb(PDBin)
 
pdb$atom[,"chain"] <- "E"
 
write.pdb(pdb=pdb,file=PDBout)
 
</source>
 
 
# Use your text-editor to open both the <code>1DP7.pdb</code> structure file and the  <code>1DP7_DNAnewChain.pdb</code>. Copy the DNA coordinates, paste them into the original file before the <code>END</code> line and save.
 
# Open the edited coordinate file with VMD. You should see '''one protein chain''' and a '''B-DNA double helix'''. (Actually, the BDNA helix has a gap, because the R-library did not read the BRDU nucleotide as DNA). Switch to stereo viewing and spend some time to see how '''amazingly beautiful''' the complementarity between the protein and the DNA helix is (you might want to display ''protein'' and ''nucleic'' in separate representations and color the DNA chain by ''Position'' &rarr; ''Radial'' for clarity) ... in particular, appreciate how not all positively charged side chains contact the phosphate backbone, but some pnetrate into the helix and make detailed interactions with the nucleobases!
 
# Then clear all molecules
 
# In VMD, open '''Extensions&rarr;Analysis&rarr;MultiSeq'''. When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default, or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
 
# Choose '''File&rarr;Import Data''', browse to your directory and load one by one:
 
:: -Your model;
 
:: -The 1DUX complex;
 
:: -The 1DP7 complex.
 
# Mark all three protein chains by selecting the checkbox next to their name and choose '''Tools&rarr; STAMP structural alignment'''.
 
# '''Align''' the '''Marked Structures''', choose a '''scanscore''' of '''2''' and '''scanslide''' of '''5'''. Also choose '''Slow scan'''. You may have to play around with the setting to get the molecules to superimpose: but the '''can''' be superimposed quite well - at least the DNA-binding helices and the wings should line up.
 
# In the graphical representations window, double-click on the cartoon representations that multiseq has generated to undisplay them, also undisplay the Tube representation of 1DUX. Then create a Tube representation for 1DP7, and select a Color by ColorID (a different color that you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
 
# Orient and scale your superimposed structures so that their structural similarity is apparent, and the differences in binding elements is clear. Perhaps visualizing a solvent accessible surface of the DNA will help understand the spatial requirements of the complex formation. You may want to keep a copy of the image for future reference. Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in the "alternative" conformation.
 
}}
 
 
 
&nbsp;
 
  
 
<!--
 
<!--
===Coloring by conservation===
+
==Co-Expression==
 
 
With the superimposed coordinates, you can begin to get a sense whether either or both binding modes could be appropriate for a protein-DNA complex in your Mbp1 orthologue. But these are geometrical criteria only, and the protein in your species may be flexible enough to adopt a different conformation in a complex, and different again from your model. A more powerful way to analyze such hypothetical complexes is to look at conservation patterns. With VMD, you can import a sequence alignment into the MultiSeq extension and color residies by conservation. The protocol below assumes
 
 
 
*You have prealigned the reference Mbp1 proteins with your species' Mbp1 orthologue;
 
*You have saved the alignment in a CLUSTAL format.
 
  
You can use Jalview or any other MSA server to do so. You can even do this by hand - there should be few if any indels and the correct alignment is easy to see.
 
  
 
{{task|1=
 
{{task|1=
;Load the Mbp1 APSES alignment into MultiSeq.
 
  
:(A) In the MultiSeq Window, navigate to '''File &rarr; Import Data...'''; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable <code>ALN</code> files (these are CLUSTAL formatted multiple sequence alignments).
+
[http://coxpresdb.jp/ '''CoExpressdb'''] is a well curated database of pre-calculated co-expression profiles for model organisms. Expression values across a large number of published experiments on the same platform are compared via their coefficient of correlation. Highly correlated genes are either co-regulated, or one gene influences the expression level of the other.  
:(B) Open the alignment file, click on '''Ok''' to import the data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.
 
:(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).  
 
  
You will see that the 1MB1 sequence and the APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.
+
* Navigate to [http://coxpresdb.jp/ '''CoExpressdb'''].
 +
* Enter <code>Mbp1</code> as a "gene alias" in the search field.
 +
* Click on the link to the coexpressed gene list. Do any of the "known" target genes appear here? How do you interpret this result?
  
;Bring the 1MB1 sequence in register with the APSES alignment.
+
Unfortunately, the support for yeast genes is very limited. CoexDB is however an excellent resource to study higher eukaryotic, especially human genes. You might want to consider it for its additional capabilities for your "systems" term project. Refer to the [http://coxpresdb.jp/help/movie/ YouTube tutorials for details].
:(A)MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequences you have imported.
 
:(B) Select '''Edit &rarr; Enable Editing... &rarr; Gaps only''' to allow changing indels.  
 
:(C) Pressing the spacebar once should insert a gap character before the '''selected column''' in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1: <code>S I M ...</code>
 
:(D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)
 
:(E) When you are done, it may be prudent to save the state of your alignment. Use '''File &rarr; Save Session...'''
 
  
;Color by similarity
 
:(A) Use the '''View &rarr; Coloring &rarr; Sequence similarity &rarr; BLOSUM30''' option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
 
:(B) You can adjust the color scale in the usual way by navigating to '''VMD main &rarr; Graphics &rarr; Colors...''', choosing the Color Scale tab and adjusting the scale midpoint.
 
:(C) Navigate to the '''Representations''' window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use '''User''' coloring of your ''Tube'' and ''Licorice'' representations to apply the sequence similarity color gradient that MultiSeq has calculated.
 
 
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
 
* Once you have colored the residues of your model by conservation, create another informative stereo-image and paste it into your assignment.
 
 
}}
 
}}
  
&nbsp;
 
-->
 
 
== Interpretation==
 
<!--
 
Analysis of the ligand binding site:
 
 
* http://dnasite.limlab.ibms.sinica.edu.tw/
 
* http://proline.biochem.iisc.ernet.in/pocketannotate/
 
* http://www.biosolveit.de/PoseView/
 
 
*Comparison with seq2logo
 
{{#pmid: 19483101}}
 
*protedna server PMID: 19483101
 
* http://serv.csbb.ntu.edu.tw/ProteDNA/
 
* http://protedna.csie.ntu.edu.tw/
 
* Multi Harmony
 
{{#pmid: 20525785}}
 
  
 +
{{Vspace}}
 
-->
 
-->
  
 +
==Further reading and resources==
  
 +
{{#pmid: 25392420}}
 +
{{#pmid: 23193258}}
  
{{task|1=
 
# Spend some time studying the complex.
 
# Recapitulate in your mind how we have arrived at this comparison, in particular, how this was possible even though the sequence similarity between these proteins is low - none of these winged helix domains came up as a result of our previous BLAST search in the PDB.
 
# You should clearly think about the following question: considering the position of the two DNA helices relative to the YFO structural model, which binding mode appears to be more plausible for protein-DNA interactions in the YFO Mbp1 APSES domains? Is it the canonical, or the non-canonical binding mode? Is there evidence that allows you to distinguish between the  two modes?
 
# Before you quit VMD, save the "state" of your session so you can reload it later. We will look at residue conservation once we have built phylogenetic trees. In the main VMD window, choose '''File&rarr;Save State...'''.
 
}}
 
  
<!--
+
<!--  
== R code: conservation scores and sequence weighting==
+
{{#pmid: 23846655}}
 +
{{#pmid: 23377968}}
 +
{{#pmid: 23258890}}
 +
{{#pmid: 21925324}}
 +
{{#pmid: 21627854}}
 +
{{#pmid: 21468988}}
 +
{{#pmid: 21097893}}
 +
{{#pmid: 21071405}}
 +
{{#pmid: 20652519}}
 +
{{#pmid: 20523743}}
 +
{{#pmid: 18953035}}
 +
{{#pmid: 17940530}}
 +
{{#pmid: 17449815}}
 +
{{#pmid: 16888359}} 
 
-->
 
-->
 
;That is all.
 
 
 
&nbsp;
 
 
== Links and resources ==
 
{{#pmid: 10679470}}
 
{{#pmid: 15808743}}
 
  
  
Line 259: Line 128:
 
&nbsp;
 
&nbsp;
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 +
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_9|&lt;&nbsp;Assignment&nbsp;9]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_11|Assignment&nbsp;11&nbsp;&gt;]]</td>
 +
</tr></table>
  
  

Latest revision as of 04:12, 13 December 2016

Assignment for Week 10
Expression Analysis

< Assignment 9 Assignment 11 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

The transcriptome is the set of a cell's mRNA molecules. The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is transcribed from the genome is not yet fit for translation but must be processed: splicing is ubiquitous[1] and in addition RNA editing has been encountered in many species. Some authors therefore refer to the exome—the set of transcribed exons— to indicate the actual coding sequence.

Microarray technology — the quantitative, sequence-specific hybridization of labelled nucleotides in chip-format — was the first domain of "high-throughput biology". Today, it has largely been replaced by RNA-seq: quantification of transcribed mRNA by high-throughput sequencing and mapping reads to genes. Quantifying gene expression levels in a tissue-, development-, or response-specific way has yielded detailed insight into cellular function at the molecular level, with recent results of single-cell sequencing experiments adding a new level of precision. But not all transcripts are mapped to genes: we increasingly realize that the transcriptome is not merely a passive buffer of expressed information on its way to be translated into proteins, but contains multiple levels of complex, regulation through hybridization of small nuclear RNAs[2].

In this assignment, we will look at differential expression of Mbp1 and its target genes.


 

GEO2R


In this exercise we will use the analysis facilities of the GEO database at the NCBI.

Task:

First, we will search for relevant data sets on GEO, the NCBI's database for expression data.
  1. Navigate to the entry page for GEO data sets].
  2. Enter the following query in the usual Entrez query format: "cell cycle"[ti] AND "saccharomyces cerevisiae"[organism].
  3. You should get two datasets among the top hits that analyze wild-type yeast (W303a cells) across two cell-cycles after release from alpha-factor arrest. Choose the experiment with lower resolution (13 samples).
  4. On the linked GEO DataSet Browser page, follow the link to the Accession Viewer page: the "Reference series".
  5. Read about the experiment and samples, then follow the link to analyze with GEO2R
Now proceed to apply this to the yeast cell-cycle study
Value distribution for the yeast cell-cycle experiment GSE3635. Experiments are grouped approximately into equivalent time-points on a cell cycle.
  1. Define groups: the associated publication shows us that one cell-cycle takes pretty exactly 60 minutes. Create timepoints T0, T1, T2, ... T5. Then associate the 0 and 60 min. sample with "T0"; 10 and 70 minutes get grouped as "T1"; 20 and 80 minutes are T2, etc. up to T5. The final sample does not get assigned.
  2. Confirm that the Value distributions are unbiased by accessing the value distribution tab - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same.
  3. Your distribution should look like the image on the right: properly grouped into six categories, and unbiased regarding absolute expression levels and trends.
  4. Look for differentially expressed genes: open the GEO2R tab and click on Top 250.
Analyze the results.
  1. Examine the top hits. Click on a few of the gene names in the Gene.symbol column to view the expression profiles that tell you why the genes were found to be differentially expressed. What do you think? Is this what you would have expected for genes' responses to the cell-cycle? What seems to be the algorithm's notion of what "differentially expressed" means?
  2. Look for expected genes. Here are a few genes that are known to be differentially expressed in the cell-cycle as target genes of the MBF complex: DSE1, DSE2, ERF3, HTA2, HTB2, and GAS3. But what about the MBD complex proteins themselves: Mbp1 and Swi6?

The notion of "differential expression" and "cell-cycle dependent expression" do not overlap completely. Significant differential expression is mathematically determined for genes that have low variance within groups and large differences between groups. This algorithm has no notion of any expectation you might have about the shape of the expression profile. All it finds are genes for which differential expression between some groups is statistically supported. The algorithm returns the top 250 of those. Consistency within groups is very important, while we intuitively might be giving more weight to conformance to our expectations of a cyclical pattern.

Let's see if we can group our time points differently to enhance the contrast between expression levels for cyclically expressed genes. Let's define only two groups: one set before and between the two cycles, one set at the peaks - and we'll omit some of the intermediate values.

  1. Remove all of your groups and define two groups only. Call them "A" and "B".
  2. Assign samples for T = 0 min, 10, 60 and 70 min. to the "A" group. Assign sets 30, 40, 90, and 100 to the "B" group.
  3. Recalculate the Top 250 differentially expressed genes (you might have to refresh the page to get the "Top 250" button back.) Which of the "known" MBF targets are now contained in the set? What about Mbp1 and Swi6?
  4. Finally: Let's compare the expression profiles for Mbp1, Swi6 and Swi4. It is not obvious that transcription factors are themselves under transcriptional control, as opposed to being expressed at a basal level and activated by phosporylation or ligand binding. In a new page, navigate to the Geo profiles page and enter (Mbp1 OR Swi6 OR Swi4 OR Nrm1 OR Cln1 OR Clb6 OR Act1 OR Alg9) AND GSE3635 (Nrm1, Cln1, and Clb6 are Mbp1 target genes. Act1 and Alg9 are beta-Actin and mannosyltransferase, these are often used as "housekeeping genes, i.e. genes with condition-independent expression levels, especially for qPCR studies - although Alg9 is also an Mbp1 target. We include them here as negative controls. CGSE3635 is the ID of the GEO data set we have just studied). You could have got similar results in the Profile graph tab of the GEO2R page. What do you find? What does this tell you? Would this information allow you to define groups that are even better suited for finding cyclically expressed genes?
  5. Click on the profile graph for Mbp1 and print out the page. Write your name and student number on the page. With a red pen, in one sentence describe the evidence you find on that page that allows us to conclude whether or not Mbp1 is a cell-cycle gene. You'll probably want to think for a moment what this question really means, how a cell-cycle gene could be defined, and what can be considered "evidence", before you write. I will mark your response for a maximum of four marks.


 


Further reading and resources

Okamura et al. (2015) COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res 43:D82-6. (pmid: 25392420)

PubMed ] [ DOI ] The COXPRESdb (http://coxpresdb.jp) provides gene coexpression relationships for animal species. Here, we report the updates of the database, mainly focusing on the following two points. For the first point, we added RNAseq-based gene coexpression data for three species (human, mouse and fly), and largely increased the number of microarray experiments to nine species. The increase of the number of expression data with multiple platforms could enhance the reliability of coexpression data. For the second point, we refined the data assessment procedures, for each coexpressed gene list and for the total performance of a platform. The assessment of coexpressed gene list now uses more reasonable P-values derived from platform-specific null distribution. These developments greatly reduced pseudo-predictions for directly associated genes, thus expanding the reliability of coexpression data to design new experiments and to discuss experimental results.

Barrett et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41:D991-5. (pmid: 23193258)

PubMed ] [ DOI ] The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.




 


Footnotes and references

  1. Strictly speaking, splicing is an eukaryotic achievement, however there are examples of splicing in prokaryotes as well.
  2. (2015) The noncoding explosion. Nat Struct Mol Biol 22:1. (pmid: 25565024)

    PubMed ] [ DOI ]

    Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

    PubMed ] [ DOI ]


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 9 Assignment 11 >