Difference between revisions of "BIO Assignment Week 8"

From "A B C"
Jump to navigation Jump to search
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<div id="BIO">
 
<div id="BIO">
 
<div class="b1">
 
<div class="b1">
Assignment for Week 7<br />
+
Assignment for Week 8<br />
 
<span style="font-size: 70%">Predictions: Homology Modeling</span>
 
<span style="font-size: 70%">Predictions: Homology Modeling</span>
 
</div>
 
</div>
 
<table style="width:100%;"><tr>
 
<table style="width:100%;"><tr>
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_6|&lt;&nbsp;Assignment&nbsp;6]]</td>
+
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_7|&lt;&nbsp;Assignment&nbsp;7]]</td>
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_8|Assignment&nbsp;8&nbsp;&gt;]]</td>
+
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_9|Assignment&nbsp;9&nbsp;&gt;]]</td>
 
</tr></table>
 
</tr></table>
  
 
{{Template:Inactive}}
 
{{Template:Inactive}}
  
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
+
Concepts and activities (and reading, if applicable) for this assignment will be topics on the next quiz.  
  
  
Line 19: Line 19:
  
  
<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
 
::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
 
</div>
 
&nbsp;
 
&nbsp;
 
 
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html Vendian period] of the Proterozoic era of Precambrian times.
 
  
In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
+
In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
  
In this and the following assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 orthologue in your assigned species, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and (4) consider whether the available evidence allows you to distinguish between different modes of ligand binding.
+
In this assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 RBM orthologue in your assigned species.
  
 
For the following, please remember the following terminology:
 
For the following, please remember the following terminology:
Line 39: Line 31:
 
:The protein whose structure you are using as a guide to build the model.
 
:The protein whose structure you are using as a guide to build the model.
 
;Model
 
;Model
:The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
+
:The structure that results from the modelling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
 
&nbsp;
 
&nbsp;
  
Line 46: Line 38:
  
 
&nbsp;
 
&nbsp;
==Warm-up: a minimal change==
+
 
Minimal changes to structure models can be done directly in Chimera. This illustrates the principle of full-scale modeling quite nicely. For an example, let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented twards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, <code>V</code>, or even <code>I</code>.
+
&nbsp;
 +
 
 +
 
 +
 
 +
==A Point Mutation==
 +
 
 +
To illustrate how homology modelling works in principle, let's consider changing the sequence of a single amino acid, based on a structural template.
 +
 
 +
Such minimal changes to structure models can be done directly in Chimera. Let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented towards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, <code>V</code>, or even <code>I</code>.
  
 
{{task|1=
 
{{task|1=
Line 64: Line 64:
 
If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group [http://www.youtube.com/watch?v=bcXMexN6hjY '''here''']. I would also encourage you to go over [http://www.youtube.com/watch?v=eJkrvr-xeXY '''Part 2 of the video tutorial'''] that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.
 
If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group [http://www.youtube.com/watch?v=bcXMexN6hjY '''here''']. I would also encourage you to go over [http://www.youtube.com/watch?v=eJkrvr-xeXY '''Part 2 of the video tutorial'''] that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.
  
What we have done here with one residue is exactly the way homology modeling works with entire sequences. Let's now build a homology model for YFO Mbp1.
+
What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes '''all''' amino acids to the residues of the '''target sequence''', based on the '''template structure'''. Let's now build a homology model for YFO Mbp1.
 +
 
 +
 
 +
&nbsp;
  
 
==Preparation==
 
==Preparation==
 +
 +
* We need to define our '''Target sequence''';
 +
* find a suitable structural '''Template'''; and
 +
* build a '''Model'''.
 +
  
 
===Target sequence===
 
===Target sequence===
The first step of homology modelling is to determine which sequence to model. We have determined the putative orthologue with conserved function in YFO by reciprocal best match with ''saccharomyces cervisiae'' Mbp1. Your sequence was initially found with an APSES domain search in YFO and the alignments with the yeast sequence are straightforward for the most part.
 
  
There are two exceptions however: the alignment of '''ASPFU''' gene XP_754232 and the '''CAPCO''' gene XP_007722875 both are missing part of the domin's N-terminus. This is odd, because this may imply the APSES domain of these genes might not be properly folded. When such surprising results of alignement occurr,  you '''must''' consider whether there could be an error in the published sequence, perhaps stemming from an erroneous gene model. This is not absolutely germane to this assignment, so I have placed the process into the collapsible section below - optional reading. However it may be useful for you to understand what the issue is here and how to address it.
+
We have encountered the PDB <code>1BM8</code> structure before, the APSES domain of ''saccharomyces cerevisiae'' Mbp1. This is a useful template to model the DNA binding domain of your RBM match. But what exactly is the aligned region of the APSES domain? We could use several approaches to define the APSES domain:
 +
 
 +
* we could use the biostrings package to calculate a pairwise sequence alignment with the <code>1BM8</code> sequence, like we did previously for the full-length sequences. This would give us the domain boundaries.
 +
* we could calculate a multiple sequence alignment, while including the <code>1BM8</code> sequence. This would also allow us to infer domain boundaries, actually in all sequences in our database at once. But we have found previously that such multiple sequence alignments are quite sensitive to un-alignable regions of which we have quite a few in the full length sequences. We do need an MSA, but we do need to restrict the length of the sequences we align to a reasonable region.
 +
* we could access the domain annotations at [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml CDD] or at the [http://smart.embl-heidelberg.de/ SMART Database], but both have interfaces that are difficult to use computationally, and have other issues: NCBI does not recognize APSES domains, only the smaller KilA-N domain, and SMART sometimes does not find APSES domains in our sequences.
 +
* the most straightforward approach of course is to use the annotation that you already have produced for the APSES domain in <tt>MBP1_&lt;YFO&gt;</tt>. You should be able to simply take the MBP1_SACCE sequence and the one for YFO from the <tt>APSES.mfa</tt> file.
 +
 
 +
This is the 1BM8 sequence:
 +
>SACCE
 +
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
 +
  LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
 +
 
 +
 
 +
<!--
 +
 
 +
{{task|1=
 +
 
 +
* In our case it seems the best results are had when searching the [http://prosite.expasy.org/prosite.html Prosite] database with the [http://prosite.expasy.org/scanprosite/ ScanProsite] interface.
 +
 
 +
Let's have a first look at ScanProsite, using the yeast Mbp1 sequence. We need the UniProt ID to search Prosite. With your protein database loaded in a fresh '''R''' session, type
 +
 
 +
<source lang="RSplus">
 +
# (commands indented, to align their components and
 +
# help you understand their relationship)
 +
 
 +
      refDB$protein$uniProtID
 +
                              which(refDB$protein$name == "MBP1")
 +
      refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
 +
uID <- refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
 +
uID
 +
</source>
 +
 
 +
* Navigate to [http://prosite.expasy.org/scanprosite/ ScanProsite], paste the UniprotID for yeast Mbp1 into the text field, select '''Table''' output for STEP 3, and '''START THE SCAN'''.
 +
 
 +
You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because we will want to fetch a few of these. Prosite does not have a nice API interface like UniProt, but the principles of using '''R''''s <code>httr</code> package to send POST requests and retrieve the results are the same. Getting data informally from Webpages is called '''screenscraping''' and really a life-saving skill. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.
 +
 
 +
(I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)
 +
 
 +
* Use the menu and access '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
 +
 
 +
*Use the browser's back button, and again: '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. This is the page that accepts user input in a so called <code>form</code> via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can construct a valid <code>POST</code> request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are much harder to identify thean the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled <code>POST</code> header that it sent to the Prosite server!
 +
 
 +
* On the scanProsite page, open '''View''' &rarr; '''Developer''' &rarr; '''Developer Tools''' in the Chrome menu. '''Then''' click again on '''START THE SCAN'''. The Developer Tools page will show you information about what just happened in the transaction it negotiated to retrieve the results page. Click on the '''Network''' tab, and then on the top element: <code>PSScan.cgi</code>. This contains the form data. Then click on the '''Headers''' tab and scroll down until you see the '''Request Payload'''. This has all the the required <code>POST</code> elements nicely spelled out. No guesswork required. What worked from the browser should work the same way from an '''R''' script. Analogous to our UniProt fetch code, we create a <code>POST</code> query:
 +
 
 +
<source lang="RSplus">
 +
 
 +
URL <- "http://prosite.expasy.org/cgi-bin/prosite/PSScan.cgi"
 +
response <- POST(URL,
 +
                body = list(meta = "opt1",
 +
                            meta1_protein = "opt1",
 +
                            seq = "P39678",
 +
                            skip = "on",
 +
                            output = "tabular"))
 +
# Note how the list-elements correspond to the page header's
 +
# Request Payload. We include everything but the value of the
 +
# submit button (which is for display only) in our POST
 +
# request.
 +
 
 +
# Send off this request, and you should have a response in a few
 +
# seconds.
 +
 
 +
# The text contents of the response is available with the
 +
# content() function:
 +
content(response, "text")
 +
 
 +
# ... should show you the same as the page contents that
 +
# you have seen in the browser. Now we need to extract
 +
# the data from the page: we need regular expressions, but
 +
# only simple ones. First, we strsplit() the response into
 +
# individual lines, since each of our data elements is on
 +
# its own line. We simply split on the "\\n" newline character.
 +
 
 +
lines <- unlist(strsplit(content(response, "text"), "\\n"))
 +
head(lines)
 +
 
 +
# Now we define a query pattern for the lines we want:
 +
# we can use the uID, bracketed by two "|" pipe
 +
# characters:
 +
 
 +
pattern <- paste("\\|", uID, "\\|", sep="")
 +
 
 +
# ... and select only the lines that match this
 +
# pattern:
 +
 
 +
lines <- lines[grep(pattern, lines)]
 +
lines
 +
 
 +
# ... captures the four lines of output.
 +
 
 +
# Now we break the lines apart into
 +
# apart in tokens: this is another application of
 +
# strsplit(), but this time we split either on
 +
# "pipe" characters, "|" OR on tabs "\t". Look at the
 +
# regex "\\t|\\|" in the strsplit() call:
 +
 
 +
strsplit(lines[1], "\\t|\\|")
 +
 
 +
# Its parts are (\\t)=tab (|)=or (\\|)=pipe.
 +
# Both "t" and "|" need to be escaped with a backslash.
 +
# "t" has to be escaped because we want to match a tab (\t),
 +
# not the literal character "t". And "|" has to be escaped
 +
# because we mean the literal pipe character, not its
 +
# usual (special) meaning OR. Thus sometimes the backslash
 +
# turns a special meaning off, and sometimes it turns a
 +
# special meaning on. Unfortunately there's no easy way
 +
# to tell - you just need to remember the characters - or
 +
# have a reference handy. The special characters are
 +
# (){}[]^$?*+.|&-  ... and some of them have different
 +
# meanings depending on where in the regex they are. 
 +
 
 +
# Let's put the tokens into named slots of a vector.
 +
 
 +
features <- list()
 +
for (line in lines) {
 +
    tokens <- unlist(strsplit(line, "\\t|\\|"))
 +
    features <- rbind(features, c(uID  =  tokens[2],
 +
                                  start =  tokens[4],
 +
                                  end  =  tokens[5],
 +
                                  psID  =  tokens[6],
 +
                                  psName = tokens[7]))
 +
}
 +
features
 +
</source>
 +
 
 +
This forms the base of a function that collects the features automatically from a PrositeScan result. We still need to do a bit more on the database part, but this is mostly bookkeeping:
 +
 
 +
* We need to put the feature annotations into a database table and link them to a protein ID and to a description of the feature itself.
 +
* We need a function that extracts feature sequences in FASTA format.
 +
* And, since we are changing the structure of the database, we need a way to migrate your old database contents to a newer version.
 +
 
 +
I don't think much new can be learned from this, so I have written those functions and put them into dbUtilities.R But you can certainly learn something from having a look at the code of
 +
 
 +
*<code>fetchPrositeFeatures()</code>
 +
*<code>addFeatureToDB()</code>
 +
*<code>getFeatureFASTA()</code>
 +
 
 +
Also, have a quick look back at our [[BIO_Assignment_Week_3#The_Protein_datamodel|database schema:]] this update has implemented the proteinFeature and the feature table. Do you remember what they were good for?
 +
 
 +
Time for a database update. You must be up to date with the latest version of <code>dbUtilities.R</code> for this to work. When you are, execute the following steps:
 +
 
 +
<source lang="R">
 +
 
 +
updateVerifiedFile("363ffbae3ff21ba80aa4fbf90dcc75164dbf10f8")
 +
 
 +
# Make a backup copy of your protein database.
 +
# Load your protein database. Then merge the data in your database
 +
# with the updated reference database. (Obviously, substitute the
 +
# actual filename in the placeholder strings below. And don't type
 +
# the angled brackets!)
 +
 
 +
<my-new-database> <- mergeDB(<my-old-database>, refDB)
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand to read about gene model correction" data-collapsetext="Collapse">
+
# check that this has worked:
;Correcting the ASPFU Mbp1 gene model.
+
str(<my-new-database>)
  
 +
# and save your database.
  
<div class="mw-collapsible-content">
+
save(<my-new-database>, file="<my-DB-filename.02>.RData")
An alignment of APSES domain sequence shows the shortened N-terminus of the ASPFU and the CAPCOprotein, relative to SACCE and e.g. the closely related ''aspergillus nidulans'', ASPNI:
 
APSES domains:
 
Mbp1_SACCE  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAA...
 
Mbp1_ASPNI  NVYSATYSSVPVYEFKIGTDSVMRRRSDDWINATHILKVA...
 
Mbp1_ASPFU  ----------------------MRRRGDDWINATHILKVA...
 
Mbp1_CAPCO  ----------------------MRRRSDDWVNATHILKVA...
 
  
We analyse this for the ASPFU gene.
+
# Now, for each of your proteins, add the domain annotations to
 +
# the database. You could write a loop to do this but it's probably
 +
# better to check the results of each annotation before committing
 +
# it to the database. So just paste the UniProt Ids as argument of
 +
# the function fetchPrositeFeatures(), execute and repeat.
  
Working from the possibility that this may be a gene model error - e.g. a false translational start, a frameshift due to a sequencing error, or an erroneously modelled intron, we check whether the translation of the genomic sequence supports the presence of the expected amino acids. This is easily done running TBLASTN - BLASTing the protein query against the six reading frames of the ASPFU genome. We find the following:
 
  
 +
features <- fetchPrositeFeatures(<one-of-my-proteins-uniProt-IDs>)
 +
refDB <- addFeatureToDB(refDB, features)
  
Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
+
# When you are done, save your database.
Sequence ID: ref|NC_007196.1|Length: 4079167Number of Matches: 2
+
</source>
[...]
 
Query  10      VDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILE ...
 
                V VYEF    S+M+R+ DDW+NATHILK A F K  RTRILE ...
 
Sbjct  3691193  VPVYEFKVDGESVMRRRGDDWINATHILKVAGFDKPARTRILE ...
 
  
Indeed, there is sequence upstream of the gene's published translation start that matches well with our query! But where is the correct translation start? For that we need to look at the actual nucleotide sequence and translate it. Remember: BLAST is a '''local''' sequence alignment algorithm and it won't retrieve everything that matches to our query, just the best matching segment. ASPFU chromosome 3 is over 4 megabases large, so let us try to obtain only the region we are actually interested in: downstream of bases 3691193, lets say 3691100 (make sure this offset is divisible by three, to stay in the same reading frame) and upstream to, say, 3691372.
+
Finally, we can create a sequence selection of APSES domains
 +
from our reference proteins. The function <code>getFeatureFasta()</code>
  
#At the [http://www.ncbi.nlm.nih.gov/genome/browse/ '''NCBI genome project site'''] we search for ''aspergillus fumigatus''.
+
* accepts a feature name such as <code>"HTH_APSES"</code>;
#At the [http://www.ncbi.nlm.nih.gov/genome/18 '''''aspergillus fumigatus''''' '''genome project site'''] we click on chromosome 3 to access the map viewer.
+
* finds the corresponding feature ID;
#Hovering over the ''Download/View sequence'' link shows us how an URL to access sequence data is structured:
+
* finds all matching entries in the proteinFeature table;
<nowiki>http://www.ncbi.nlm.nih.gov/projects/mapview/seq_reg.cgi?taxid=746128&chr=3&from=1&to=4079167</nowiki>
+
* looks up the start and end position of each feature;
:We can easily adapt this to the sequence range we need ...
+
* fetches the corresponding substring from the sequence entries;
<ol start="4">
+
* adds a meaningful header line; and
<li>... and follow: http://www.ncbi.nlm.nih.gov/nuccore/NC_007196.1?from=3691003&to=3691243&report=fasta to yield:
+
* writes everything to output.
</ol>
 
>gi|71025130:3691003-3691243 Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
 
ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
 
GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
 
CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
 
AGTCGATGGCGAAAGTGTTATGCGCCGACGA
 
  
 +
... so that you can simply execute:
  
<ol start="5">
+
<source lang="R">
<li>To translate this, we navigate to any of the [http://bips.u-strasbg.fr/EMBOSS/ '''EMBOSS''' tools servers] and use "remap" - we want to see the translation matched to the nucleotide sequence. We turn restriction sites off, translate all three forward frames and paste and manually align the SACCE Mbp1 sequence into the output to see what we expect and what we got. I have selected only the frame(s) that actually give a match, and I have pasted the homologous CAPCO and SACCE sequences (lower case) to demonstrate their similarity:
+
cat(getFeatureFasta(<my-new-database>, "HTH_APSES"))
</ol>
+
</source>
ASPFU    ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
 
                                                                       
 
ASPFU      R  F  A  E  T  G  I  M  A  A  V  D  F  S  K  I  Y  S  A  T  Y  S  S 
 
CAPCO                          m  -  a  f  d  -  k  e  i  y  s  a  t  y  s  n 
 
SACCE                          m  s  -  -  -  -  n  q  i  y  s  a  r  y  s  g
 
 
         
 
ASPFU    GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
 
 
 
ASPFU    V  S  L  F  *  ...
 
CAPCO    v  a  -  -    ...
 
SACCE    v  d  -  -    ...
 
         
 
ASPFU    CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
 
                                                              ...  V  Y  E  F  K
 
CAPCO                                                        ...  v  y  e  l  k
 
SACCE                                                        ...  v  y  e  f  i
 
         
 
ASPFU      AGTCGATGGCGAAAGTGTTATGCGCCGACGAGGCGATGATTGGATCAATGCTACACATATTCTTAAA
 
 
ASPFU      V  D  G  E  S  V  M  R  R  R  G  D  D  W  I  N  A  T  H  I  L  K ...
 
CAPCO      v  a  g  d  h  i  m  r  r  r  s  d  d  w  v  n  a  t  h  i  l  k ...
 
SACCE      h  s  t  g  s  i  m  k  r  k  k  d  d  w  v  n  a  t  h  i  l  k ...
 
  
 +
Here are the first five sequences from that result:
  
:This clearly shows us that there is N-terminal sequence that ought to be added to the gene model, upstream of the reported translational start of <tt>MRRR...</tt>. The sequences thus most likely begin as follows:
+
<source lang="text">
 +
>CC1G_01306_COPCI    HTH_APSES 6:112
 +
IFKATYSGIPVYEMMCKGVAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHE
 +
KVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEFTPAAKSPPL
 +
>CNBB4890_CRYNE    HTH_APSES 17:123
 +
IYKATYSGVPVYEMVCRDVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHE
 +
KVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPP
 +
>COCMIDRAFT_338_BIPOR    HTH_APSES 9:115
 +
IYSATYSNVPVYECNVNGHHVMRRRADDWINATHILKVADYDKPARTRILEREVQKGVHE
 +
KVQGGYGKYQGTWIPLEEGRGLAERNGVLDKMRAIFDYVPGDRSPPP
 +
>WALSEDRAFT_68476_WALME    HTH_APSES 83:192
 +
IYSAVYSGVGVYEAMIRGIAVMRRRADGYMNATQILKVAGVDKGRRTKILEREILAGLHE
 +
KIQGGYGKYQGTWIPFERGRELALQYGCDHLLAPIFDFNPSVMQPSAGRS
 +
>PGTG_08863_PUCGR    HTH_APSES 90:196
 +
IYKATYSGVPVLEMPCEGIAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREIQKGTHE
 +
KIQGGYGKYQGTWVPLDRGIDLAKQYGVDHLLSALFNFQPSSNESPP
 +
[...]
 +
</source>
  
ASPFU  MAAVDFSKIYSATYSSVSLFVYEFKVDGE-----SVMRRRGDDWINATHILK...
 
CAPCO  ma-fd-keiysatysnva--vyelkvagd-----himrrrsddwvnathilk...
 
SACCE  ms----nqiysarysgvd--ysgvdvyefihstgsimkrkkddwvnathilk...
 
  
The fact that the truncated N-terminus appears in both closely '''related''' genes and species suggests that what we see here is a mis-annotated intron. The take-home lesson is: if your retrieved protein sequence does not conform to your expectations, it may be worthwhile to follow up with the actual nucleotide sequence.
+
At the bottom of these sequences, you should see the APSES sequences from
 +
YFO, '''in particular the Mbp1 RBM sequence from YFO'''. Email me if you have trouble getting to that stage.
  
</div>
+
We'll need to align these sequences with the template...
</div>
 
  
 +
}}
  
&nbsp;
+
-->
  
 
===Template choice and template sequence===
 
===Template choice and template sequence===
  
  
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.
+
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.
  
Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lectures; please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I have reviewed the principles and discussed more details and alternatives. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modeling is sequence similarity.
+
Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.
  
In [[BIO_Assignment_Week_3#Search_input|Assignment 3]], you have defined the extent of the APSES domain in yeast Mbp1. In [[BIO_Assignment_Week_6|Assignment 6]], you have used PSI-BLAST to search for APSES domains in YFO. In [[BIO_Assignment_Week_7|Assignment 7]] you have confirmed by ''Reciprocal Best Match'' which of these APSES domain sequences is the closest related orthologue to yeast Mbp1. This sequence is the best candidate for having a conserved function similar to yeast Mbp1. Therefore, this sequence is the one you will model: it is called the '''target''' for the homology modeling procedure. In the same assignment you have also computed a multiple sequence alignment that includes the sequence of  Mbp1 with YFO.
 
  
Defining a '''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target''' that you can build a model based on that '''template'''. In  [[BIO_Assignment_Week_2#Structure_search|Assignment 2]] you have used a keyword search at the PDB to find "Mbp1" structures - but some of these structures were not homologs: keyword searches are notoriously unreliable. To find suitable PDB structures, we will perform a BLAST search at the PDB instead.
+
Defining a '''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target''' that you can build a model based on that '''template'''. To find suitable PDB structures, we will perform a BLAST search at the PDB.
  
  
Line 181: Line 322:
  
 
{{task|1=
 
{{task|1=
# Retrieve your YFO Mbp1-like APSES domain sequence. You can find the domain boundaries for the yeast protein in the [[Reference annotation yeast Mbp1|Mbp1 annotation reference page]], and you can get the aligned sequence from your Jalview alignment, or simply recompute it with the <code>needle</code> program of the EMBOSS suite. This YFO sequence is your '''target''' sequence.
+
# Retrieve your '''aligned''' YFO's Mbp1 RBM APSES domain sequence from the <tt>APSES.mfa</tt> selection you have prepared for the phylogeny assignment. This YFO sequence is your '''target''' sequence.
 
# Navigate to the [http://www.pdb.org/pdb/home/home.do PDB].
 
# Navigate to the [http://www.pdb.org/pdb/home/home.do PDB].
 
# Click on '''Advanced''' to enter the advanced search interface.
 
# Click on '''Advanced''' to enter the advanced search interface.
Line 213: Line 354:
 
# click: '''Create report'''.
 
# click: '''Create report'''.
  
Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. Neither of the structures has a bound DNA ligand, but the experimental methods and structure quality are different. Two of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice: 1BM8. In case you don't agree, please let me know.
+
Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. And there is a new structure from January 2015, with a lower resolution. Some of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice for our template: 1BM8.  
  
 
;Finally: Click on the 1BM8 ID to navigate to the structure page for the '''template''' and save the FASTA sequence to your computer. This is '''the template sequence'''.
 
;Finally: Click on the 1BM8 ID to navigate to the structure page for the '''template''' and save the FASTA sequence to your computer. This is '''the template sequence'''.
Line 221: Line 362:
  
 
&nbsp;
 
&nbsp;
 
 
  
 
===Sequence numbering===
 
===Sequence numbering===
Line 274: Line 413:
  
 
{{task|1=
 
{{task|1=
Choose on of the following options to align your '''target''' and '''template''' sequence.
+
Choose one of the following options to align your '''target''' and '''template''' sequence. Make sure your '''template''' sequence is included, i.e. the FASTA sequence of 1BM8.
  
  
 
;In Jalview...
 
;In Jalview...
* Load your Jalview project with aligned APSES domain sequences or recreate it from the Mbp1 orthologue sequences from the [[Reference Mbp1 orthologues (all fungi)|'''Mbp1 protein orthologs page''']] that I prepared for Assignment 7. Include the sequence of your '''template protein''' and re-align.
+
* Load your APSES domain sequences plus the 1BM8 sequence in Jalview. Include the sequence of your '''template protein''' and align using Muscle.
 
* Delete all sequence you no longer need, i.e. keep only the APSES domains of the '''target''' (from your species) and the '''template''' (from the PDB) and choose '''Edit &rarr; Remove empty columns'''. This is your '''input alignment'''.  
 
* Delete all sequence you no longer need, i.e. keep only the APSES domains of the '''target''' (from your species) and the '''template''' (from the PDB) and choose '''Edit &rarr; Remove empty columns'''. This is your '''input alignment'''.  
 
* Choose '''File&rarr;Output to textbox&rarr;FASTA''' to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.
 
* Choose '''File&rarr;Output to textbox&rarr;FASTA''' to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.
Line 285: Line 424:
 
;Using a different MSA program
 
;Using a different MSA program
 
* Copy the FASTA formatted sequences of the Mbp1 proteins in the reference  species from the [[Reference APSES domains (reference species)|'''Reference APSES domain page''']].
 
* Copy the FASTA formatted sequences of the Mbp1 proteins in the reference  species from the [[Reference APSES domains (reference species)|'''Reference APSES domain page''']].
* Access e.g. the MSA tools page at the EBI.  
+
* Access the [http://www.ebi.ac.uk/Tools/msa/ '''MSA tools page at the EBI'''].  
 
* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the input form.
 
* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the input form.
*Run the alignment and save the output.
+
*Run an alignment (I like T-coffee) and save the output.
  
  
;Using the EMBOSS explorer
+
;Using the '''R''' bioconductor [[BIO_Assignment_Week_4#Computing_an_MSA_in_R|MSA package that you used previously]].
* Use the <code>needle</code> tool for the alignment  ... but remember that pairwise alignments will only be suitable in case the alignment is absolutely unambiguous (such as here) . If there are any indels, an MSA will give much more reliable information.
+
Refer back to the page if you are lacking notes how to go about this.
 
 
 
 
;By hand
 
APSES domains are strongly conserved and have few if any indels. You could also simply align by hand.
 
 
 
* Copy the CLUSTAL formatted reference alignment of the Mbp1 proteins in the reference species from the [[Reference APSES domains (reference species)|'''Reference APSES domain page''']].
 
* Open a new file in a text editor.
 
* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the file.
 
*Align by hand, replace all spaces with hyphens and save the output.
 
 
}}
 
}}
  
Line 312: Line 442:
 
  AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
 
  AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
 
  LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL
 
  LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL
 +
 +
 +
In this case, there are no indels and therefore no hyphens - in your case there may be.
  
  
 
&nbsp;
 
&nbsp;
 +
  
 
==Homology model==
 
==Homology model==
  
  
&nbsp;
+
The alignment defines the residue by residue relationship between '''target''' and '''template''' sequence. All we need to do now is to change every residue of the template to the target sequence
  
  
Line 326: Line 460:
 
&nbsp;<br>
 
&nbsp;<br>
  
Access the Swissmodel server at '''http://swissmodel.expasy.org''' and click on '''Start Modelling'''. Then, under the '''Supported Inputs''', click on '''Target-Template Alignment'''.
+
Access the Swissmodel server at '''http://swissmodel.expasy.org''' and click on the '''Start Modelling''' button. Under the '''Supported Inputs''', choose '''Target-Template Alignment'''.
  
 
{{task|1=
 
{{task|1=
*Paste your alignment for target and model into the form field. Click on the question mark next to "Supported Inputs" if you are not sure about the format. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
+
*Paste the aligned sequences of the YFO target and the 1BM8 template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
  
* Click '''Validate Target Template Alignment''' and check that the returned alignment is correct.
+
* Click '''Validate Target Template Alignment''' and check that the returned alignment is correct. All non-identical residues are shown in light-grey.
  
*Click '''Build Model''' to start the modeling process.
+
*Click '''Build Model''' to start the modeling process. This will take about a minute or so.
  
* The resulting page returns information about the resulting model. Mouse over the '''Model 01''', open the '''PDB file''' and save the coordinates to your computer. Read the information on what is being returned by the server (click on the question mark icon). Study the quality measures.
+
* The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.
 +
 
 +
*Mouse over the '''Model 01''' dropdown menu (under the icon of the template structure), and choose the '''PDB file'''. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. Save the PDB file on your computer.
 +
 
 +
* Open the [http://swissmodel.expasy.org/docs/help SwissModel documentation] in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the '''GQME''' and '''QMEAN''' quality scores.
  
 
* Also save:
 
* Also save:
Line 343: Line 481:
 
}}
 
}}
  
==Model analysis==
+
 
 +
==Model interpretation==
 +
 
 +
 
 +
We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the '''interpretation of results''' is often somewhat neglected. Don't be that way. Data does not explain itself. The interpreattion of your computational results is the most important part.
 +
 
 +
We will look at our homology model with two different questions:
 +
 
 +
* Can we define the DNA binding residues?
 +
* Can we tell which residues are conserved for functional reasons, rather than for structural reasons?
 +
 
  
 
&nbsp;
 
&nbsp;
Line 355: Line 503:
  
 
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your '''model''' correspond to that region?
 
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your '''model''' correspond to that region?
 +
 +
That's not easy to tell. But it should be.
 +
 
}}
 
}}
 
<!-- discuss flagging of loops - setting of B-factor to 99.0 phps. ANOLEA vs. Gromos ... packing vs. energy? -->
 
  
  
 
===R code: renumbering the model ===
 
===R code: renumbering the model ===
  
As you have seen, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. Fortunately there is a very useful R package that will help us with that.
+
As you have seen above, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately there is a very useful R package that will help: '''bio3d'''.
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://thegrantlab.org/bio3d/index.php '''bio3D'''] home page. '''bio3d''' is not available for installation via CRAN, but needs to be installed from source. Instructions for the different platforms are here http://thegrantlab.org/bio3d/tutorials/installing-bio3d Follow the instructions and install '''bio3d''' for '''R''' on your platform.
+
# Navigate to the [http://thegrantlab.org/bio3d/index.php '''bio3D'''] home page to . '''bio3d''' has recently been made available via CRAN - previously it had to be compiled from source.  
  
# Explore and execute the following '''R''' script. I am assuming that your model is in your working directory, change paths and filenames as required.
+
 
 +
 
 +
# Explore and execute the following '''R''' script. I am assuming that your model is in your <code>PROJECTDIR</code> folder, change paths and filenames as required.
  
 
<source lang="rsplus">
 
<source lang="rsplus">
# renumberPDB.R
 
  
# This is a simple renumbering script that uses the bio3D
+
setwd(PROJECTDIR)
# package. We simply set the first residue number to what it
+
PDB_INFILE      <- "YFOmodel.pdb"
# should be and renumber all residues based on the first one.
+
PDB_OUTFILE    <- "YFOmodelRenumbered.pdb"
# The script assumes your input PDBfile is in your working
 
# directory.
 
  
# To run this, you must have installed the bio3D R package; instructions
 
# are here: http://thegrantlab.org/bio3d/tutorials/installing-bio3d
 
  
setwd("~/my/working/directory")
+
# The bio3d package provides functions for working with
PDBin      <- "YFO_model.pdb"
+
# protein structures in R
PDBout    <- "YFO_model_ren.pdb"
+
if (!require(bio3d, quietly=TRUE)) {
 +
install.packages("bio3d")
 +
library(bio3d)
 +
}
  
first <- 4  # residue number that the first residue should have
+
# == Read the YFO pdb file
 +
 
 +
iFirst <- 4  # residue number for the first residue
 
   
 
   
# ================================================
+
YFOmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list
#    Read coordinate file
 
# ================================================
 
 
# read PDB file using bio3D function read.pdb()
 
library(bio3d)
 
pdb  <- read.pdb(PDBin) # read the PDB file into a list
 
  
pdb            # examine the information
+
YFOmodel          # examine the information
pdb$atom[1,]   # get information for the first atom
+
YFOmodel$atom[1,] # get information for the first atom
  
# you can explore ?read.pdb and study the examples.
+
# Explore ?read.pdb and study the examples.
  
# ================================================
+
# == Modify residue numbers for each atom
#   Change residue numbers
+
resNum <- as.numeric(YFOmodel $atom[,"resno"])
# ================================================
+
resNum 
 +
resNum <- resNum - resNum[1] + iFirst  # add offset
 +
YFOmodel $atom[ , "resno"] <- resNum  # replace old numbers with new
  
 +
# check result
 +
YFOmodel $atom[ , "resno"]
 +
YFOmodel $atom[1, ]
  
resNum <- as.numeric(pdb$atom[,"resno"])  # get residue numbers for all atoms
+
# == Write output to file
resNum <- resNum + (first - resNum[1])         # calculate offset
+
write.pdb(pdb = YFOmodel, file=PDBout)
pdb$atom[,"resno"] <- resNum            # replace old numbers with new
 
pdb$atom[1,]                                  # check result
 
  
 +
# Done. Open the PDB file you have written in a text editor
 +
# and confirm that this has worked.
  
# ================================================
 
#    Write output to file
 
# ================================================
 
 
write.pdb(pdb=pdb,file=PDBout)
 
 
# Done. Open the PDB file you have written in a text editor and confirm
 
# that this has worked.
 
  
 
</source>
 
</source>
Line 425: Line 567:
  
 
&nbsp;
 
&nbsp;
 +
  
 
===First visualization===
 
===First visualization===
Line 438: Line 581:
 
# Hide the ribbon and choose '''backbone only &rarr; full'''. You will note that the backbone of the two structures is virtually identical.  
 
# Hide the ribbon and choose '''backbone only &rarr; full'''. You will note that the backbone of the two structures is virtually identical.  
 
# Next, choose '''Actions &rarr; Atoms/Bonds &rarr; show''' to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: '''Select &rarr; Chemistry &rarr; Element &rarr; H''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''
 
# Next, choose '''Actions &rarr; Atoms/Bonds &rarr; show''' to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: '''Select &rarr; Chemistry &rarr; Element &rarr; H''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''
# Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. Choose '''Favourites &rarr; Sequence''', select the residues for one model, then '''Select &rarr; Invert (selected model)''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''.
+
# Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. You can drag your mouse in the  '''Favourites &rarr; Sequence''', window to select the range then '''Select &rarr; Invert (selected model)''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''. Or you can use Chimera's commandline: <code>~display</code> to undisplay everything, <code>show #:50-74</code> to show this residue range for all models.  
# Study the result. A model of the HTH domain of YFO Mbp1.
+
# Study the result: a model of the HTH subdomain of YFO's RBM to Mbp1.
 
}}
 
}}
  
&nbsp;<br>
+
 
&nbsp;<br>
+
&nbsp;
  
 
==Coloring the model by energy ==
 
==Coloring the model by energy ==
  
SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB files B-factor field.
+
SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.
  
  
 
{{task|1=
 
{{task|1=
# Back in Chimera, use the model panel to '''close''' the 1BM8 structure.
+
# Back in Chimera, use the model panel to '''close''' the 1BM8 structure. Select all and show Atoms, bonds to view the entire model structure.
 
# Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''OK'''.
 
# Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''OK'''.
# Study the result: It seems that residues in the core of the protein have better energies than residues at the surface. Why could that be the case?
+
# Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface. Why could that be the case?
 
}}
 
}}
  
Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. Simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. The rewnder this property to map it on the 3D structure of your molecule. If you want to experience with this a bit, you could apply the information scores from the previous assignment to your model, using a script that is easy to derive from the renumbering R-script you have studied above.
+
Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...
  
  
==Introduction==
+
&nbsp;
 
 
One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
 
 
 
Since there is currently no software available that would reliably model such a complex from first principles<ref>''Rosetta'' may get the structure approximately right, ''Autodock'' may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct. </ref>, we will base a model of  a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a structurally similar protein-DNA complex?
 
  
This assignment is based on the homology model you built. You will (1) identify similar structures of distantly related domains for which protein-DNA complexes are known, (2) assemble a hypothetical complex structure and (3) consider whether the available evidence allows you to distinguish between different modes of ligand binding,
 
 
==Modeling a DNA ligand==
 
  
 
&nbsp;
 
&nbsp;
  
&nbsp;
+
==Modelling DNA binding==
  
 +
One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
  
===Finding a similar protein-DNA complex===
+
Since there is currently no software available that would reliably model such a complex from first principles<ref>''Rosetta'' may get the structure approximately right, ''Autodock'' may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct. </ref>, we will base a model of  a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. It so happens that early in 2015 an APSES domain structure with bound DNA was published. You probably noticed it as a result of the PDB BLAST search: [http://www.rcsb.org/pdb/explore/explore.do?structureId=4UX5 '''4UX5'''], from the ''Magnaporthe oryzae'' Mbp1 orhologue PCG2<ref>{{#pmid: 25550425}}</ref>.
  
  
&nbsp;<br>
+
<!-- But can we also find (and align) distant relatives based purely on '''structural similarity''', ideally a protein-DNA complex? -->
  
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.
 
  
 +
===A homologous protein/DNA complex structure===
  
 
 
 
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is provided as a search tool for structural similarity search.
 
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml '''VAST'''] search interface page.
 
# Enter <code>1bm8</code> as the PDB ID to search for and click '''Go'''.
 
# Follow the link to '''Related Structures'''.
 
# Study the result.
 
}}
 
 
 
You will see that VAST finds more than 3,000 partially similar structures, but it would be almost impossibly tedious to manually search through the list for ''structures of protein DNA complexes'' that are ''similar to the interacting core of the APSES domain''. It turns out that our search is not specific enough in two ways: we have structural elements in our PDB file that are unnecessary for the question at hand, and thus cause the program to find irrelevant matches. But, if we constrain ourselves to just a single helix and strand (i.e. the 50-74 subdomain that has been implicated in DNA binding, the search will become too non-specific. Also we have no good way to retrieve functional information from these hits: which ones are DNA-binding proteins, that bind DNA through residues of this subdomain and for which the structure of a complex has been solved? It seems we need to define our question more precisely.
 
 
{{task|1=
 
# Open VMD and load the 1BM8 structure or your YFO homology model.
 
# Display the backbone as a '''Trace''' (of CA atoms) and color by '''Index'''
 
# In the sequence viewer, highlight residues 50 to 74.
 
# In the representations window, find the yellow representation (with Color ID 4) that the sequence viewer has generated. Change the '''Drawing Method''' to '''NewCartoon'''.
 
# Now (using stereo), study the topology of the region. Focus on the helix at the N-terminus of the highlighted subdomain,  it is preceded by a turn and another helix. This first helix makes interactions with the beta hairpin at the C-terminal end of the subdomain and is thus important for the orientation of these elements. (This is what is referred to as a helix-turn-helix motif, or HtH motif, it is very common in DNA-binding proteins.)
 
# Holding the shift key in the alignment viewer, extend your selection until you cover all of the first helix, and the residues that contact the beta hairpin. I think that the first residue of interest here is residue 33.
 
# Again holding the shift key, extend the selection at the C-terminus to include the residues of the beta hairpin to where they contact the helix at the N-terminus. I think that the last residue of interest here is residue 79.
 
# Study the topology and arrangement of this compact subdomain. It contains the DNA-binding elements and probably most of the interactions that establish its three-dimensional shape. This subdomain even has a name: it is a ''winged helix'' DNA binding motif, a member of a very large family of DNA-binding domains. I have linked a review by Gajiwala and Burley to the end of this page; note that their definition of a canonical winged helix motif is a bit larger than what we have here, with an additional helix at the N-terminus and a second "wing".  )
 
}}
 
  
 +
; The PCG2 / DNA complex
  
Armed with this insight, we can attempt again to find meaningfully similar structures.  At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] provides a convenient interface for structure searches for our purpose
+
* Open Chimera and load the '''<code>4UX5</code>''' structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule. The first question I would have is whether the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box", and whether the observed protein:DNA interfaces are actually with the cognate sequence, or whether one (or both) proteins are non-specific complexes. The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.<ref>This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.</ref> Indeed, Liu ''et al.'' (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact '''not''' identical.
  
{{task|1=
+
* Without taking this question too far, let's get a quick view of the comparison by duplicating one domain of the structure and superimposing it on the other. The authors feel that chain <code>A</code> represents the tighter, more specific mode of interaction; so we will duplicate chain <code>B</code> and superpose the copy on <code>A</code>.
# Navigate to the [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] search interface page.
 
# Enter <code>1bm8</code> for the '''PDB code''' and choose '''Select range''' from the drop down menu. Select the residues you have defined above<!-- Select Domain would be better but is currently broken :-( Secondary Structure elements 4 to 7 i.e. those elements that span the range you have previously defined.-->.
 
# Note that you can enter the lowest acceptable match % separately for query and target. This means: what percentage of secondary structure elements would need to be matched in either query or target to produce a hit. Keep that value at 80 for our query, since we would want to find structures with almost all of the elements of the winged helix motif. Set the  match to 10 % for the target, since we are interested in such domains even if they happen to be small subdomains of large proteins.
 
# Keep the '''Precision''' at '''normal'''. Precision and % query match could be relaxed if we wanted to find more structures.
 
#  Finally click on: '''Submit your query'''.
 
# On the results page, click on the index number (in the left-hand column) of the top hit '''that is not one of our familiar Mbp1 structures''' to get a detailed view of the result. Most likely this is <code>1wq2:a</code>, an enzyme. Click on '''View Superposed'''. This will open a window with the structure coordinates superimposed in the Jmol molecular viewer. Control-click anywhere in the window area to open a menu of viewing options. Select '''Style &rarr; Stereographic &rarr; Wall-eyed viewing'''. Select '''Trace''' as the rendering. Then study the superposition. You will note that the secondary structure elements match quite well, but does this mean we have a DNA-binding domain in this sulfite reductase?
 
}}
 
  
 +
* In Chimera, open the '''Favorites''' &rarr; '''Model Panel''' and use the '''copy/combine''' button to create a copy of the <code>4UX5</code> model. Call it <code>test</code>.
 +
* '''Select''' chain B of the <code>test</code> model, then use '''Select''' &rarr; '''Invert (selected models)''' to apply the selection to everything in the <code>test</code> model '''except''' chain B.
 +
* Use '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''delete''' to remove everything ''but'' Chain B.
 +
* Select and colour the chain red.
 +
* Back on the Model Panel, select both models and use the '''match...''' dialogue to open a '''MatchMaker''' dialogue window.  Choose the radio button two match two specific chains and select <code>4UX5</code> chain A as the '''Reference chain''', <code>test</code> chain B as the '''Chain to match'''. Click '''Apply'''.
  
All in all this appears to be well engineered software! It gives you many options to access result details for further processing. I think this can be put to very good use. But for our problem, we would have to search through too many structures because, once again, we can't tell which ones of the hits are DNA binding domains, especially domains for which the structure of a complex has been solved.
+
You will see that the superimposed structures are very similar, that the main difference is in the orientation of the disordered C-terminus, but also that there is a structural difference between the two structures around Gly 84 which inserts into the minor groove of the double helix.
  
 +
* Select one of the residues of that loop in chain A by &lt;control&gt;-clicking on it and use '''Action''' &rarr; '''Set pivot''' to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.
  
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.]]
+
* Select residues 81 to 87 and the corresponding (sequence <code>VQGGYGKY</code>) and in both chains turn their ribbon display off and display this range as "sticks".
 +
* Select '''nucleic acid''' in the '''structure''' submenu and turn ribbons and nucleotide objects off to display the DNA as sticks as well. Colour the DNA by element.
 +
* Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think<ref>Besides the coordinate difference between the chains, if indeed chain B would be representative of a DNA "scanning" conformation, perhaps one should expect that the local DNA structure that chain B binds to is structurally closer to canonical B-DNA than the DNA binding interface of chain A...</ref>? It seems to me that a crucial interaction for the cognate sequence is contributed by Guanine 8,
 +
* Finally, use the Model Panel to select <code>test</code> and '''close''' it.
  
&nbsp;<br>
 
  
APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of a protein-DNA complex. Superfamilies of such structural domains are compiled in the CATH database. Unfortunately CATH itself does not provide information about whether the structures have been determined as complexes. '''But''' we can search the PDB with CATH codes and restrict the results to complexes. Essentially, this should give us a list of all winged helix domains for which the structure of complexes with DNA have been determined. This works as follows:
 
 
{{task|1=
 
* For reference, access [http://www.cathdb.info/superfamily/1.10.10.10 CATH domain superfamily 1.10.10.10]; this is the CATH classification code we will use to find protein-DNA complexes. Click on '''Superfamily Superposition''' to get a sense of the structural core of the winged helix domain.
 
 
# Navigate to the [http://www.pdb.org/ PDB home page] and follow the link to [http://www.pdb.org/pdb/search/advSearch.do Advanced Search]
 
# In the options menu for '''Choose a Query Type''' select '''Structure Features &rarr; CATH Classification Browser'''. A window will open that allows you to navigate down through the CATH tree. You can view the Class/Architecture/Topology names on the CATH page linked above. Click on '''the triangle icons''' (not the text) for '''Mainly Alpha &rarr; Orthogonal Bundle &rarr; ARC repressor mutant, subunit A''' then click on the link to '''winged helix repressor DNA binding domain'''. Or, just enter "winged helix" into the search field. This subquery should match more than 550 coordinate entries.
 
# Click on the '''(+)''' button behind '''Add search criteria''' to add an additional query. Select the option '''Structure Features &rarr; Macromolecule type'''. In the option menus that pop up, select '''Contains Protein&rarr;Yes, Contains DNA&rarr;Yes, Contains RNA&rarr;Ignore, Contains DNA/RNA hybrid&rarr;Ignore'''. This selects files that contain Protein-DNA complexes.
 
# Check the box below this subquery to '''Remove Similar Sequences at 90% identity''' and click on '''Submit Query'''. This query should retrieve more than 100 complexes.
 
# Scroll down to the beginning of the list of PDB codes and locate the '''Reports''' menu. Under the heading '''View''' select '''Gallery'''. This is a fast way to obtain an overview of the structures that have been returned. Adjust the number of '''Results''' to see all 100 images and choose '''Options&rarr;Resize medium'''.
 
# Finally we have a set of winged-helix domain/DNA complexes, for comparison. Scroll through the gallery and study how the protein binds DNA.
 
 
}}
 
}}
 
 
First of all you may notice that in fact not all of the structures are really different, despite having requested only to retrieve dissimilar sequences, and not all images show DNA. This appears to be a deficiency of the algorithm. But you can also easily recognize how in most of the the structures the '''recognition helix inserts into the major groove of B-DNA''' (eg. 1BC8, 1CF7) and the wing - if clearly visible at all in the image - appears to make accessory interactions with the DNA backbone.. There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way, through the beta-strands of the "wing". This is interesting since it suggests there is more than one way for winged helix domains to bind to DNA. We can therefore use structural superposition of '''your homology model''' and '''two of the winged-helix proteins''' to decide whether the canonical or the non-canonical mode of DNA binding seems to be more plausible for Mbp1 orthologues.
 
  
  
Line 548: Line 650:
 
&nbsp;
 
&nbsp;
  
===Preparation and superposition of a canonical complex===
+
===Superimposing your model===
  
&nbsp;<br>
+
Both your homology model and the template structure provide valuable information:
 +
* The template structure shows how conserved the structure is at the protein/DNA interface. You have seen what subtle differences can give rise to a sequence specific complex and a non-specific binding mode. For Mbp1 we know that the APSES domain binds to the same cognate DNA sequence as PCG2. Since your model structure is heavily biased towards the template, evaluating the template in the context of a real protein/DNA complex allows you to judge which binding residues appear to be conserved and possibly modelled in an orientation that is productive for binding.
  
The structure we shall use as a reference for the '''canonical binding mode''' is the Elk-1 transcription factor.
+
* The model structure maps sequence variation into that context: are the crucial residues for sequence specific binding conserved?
 
 
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
 
 
 
The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, you should delete the second copy of the complex from the PDB file. (Remember that PDB files are simply text files that can be edited.)
 
  
 
{{task|1=
 
{{task|1=
# Find the 1DUX structure in the image gallery and open the 1DUX structure explorer page in a separate window. Download the coordinates to your computer.
 
# Open the coordinate file in a text-editor (TextEdit or Notepad - '''NOT''' MS-Word!) and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
 
# Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which.
 
# You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
 
# In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
 
# In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
 
# You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that your '''model''''s side-chain orientations have not been determined experimentally but inferred from the '''template''', and that the template's structure was determined in the absence of bound DNA ligand.
 
  
# Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. You may want to keep a copy of the image for future reference. Consider which parts of the structure appear to superimpose best.  Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in this orientation.
+
* Start by loading your model and the 1BM8 structure into your chimera session. Select all, turn all ribbons off, and set all atoms to stick representation. Then select H atoms by element and '''hide''' them.
}}
 
  
&nbsp;<br>
+
* We need to visualize and evaluate differences in binding between different proteins and for me it works well to colour everything by element, and give the carbon atoms some identifying, distinct colour. This is best achieved through the Chimera command line that you can turn on with the little "computer" icon on the left-hand side of the graphics window. Have a look at the [https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/framecommand.html Chimera Users guide], and choose '''select''' to learn how Chimera's selection syntax works.
&nbsp;
+
* Open the Model Panel to check which protein has which Chimera-internal model number. Then you can use the following selection syntax. Instead of the model numbers, I will type <code>&lt;YFO&gt;</code>, <code>&lt;4ux5&gt;</code>, and <code>&lt;1BM8&gt;</code> - you will certainly know by now that these are placeholder labels and you need to replace them with the numbers <code>0</code>, <code>1</code>, and <code>2</code> instead.
  
 +
:* To colour the DNA carbon atoms white, type:<br />
 +
::<code>color white #&lt;4ux5&gt;:.C,.D & C</code>
  
===Preparation and superposition of a non-canonical complex===
+
:* To colour the 4ux5 A chain carbon atoms grey, type:<br />
 +
::<code>color #878795 #&lt;4ux5&gt;:.A & C</code>  <small>Note: the color values after the first hash are rgb triplets in the hexadecimal numbering systems - exactly like in '''R'''.</small>
  
 +
:* To undisplay the 4ux5 B chain, type:<br />
 +
::<code>~display #&lt;4ux5&gt;:.B</code> <small>Note: this is the tilde character, not a hyphen or minus sign.</small>
  
The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
+
:* To colour the YFO model carbon atoms a pale reddish color, type:<br />
 +
::<code>color #b06268 #&lt;YFO&gt; & C</code>
  
[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coresponds to the recognition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
+
:* To colour the 1BM8 structure carbon atoms a pale greenish color, type:<br />
 +
::<code>color #92b098 #&lt;1BM8&gt; & C</code>
  
 +
* Ready? Let's superimpose the chains.
 +
** Select all models in the Model Panel and click on '''match'''.
 +
** Set 4ux5 Chain A as the Reference chain.
 +
** Select YFO as a '''Chain to match''', select the button for specific reference and specific match, and click '''Apply'''.
 +
** Repeat this with 1BM8 as the match chain.
  
Before we can work with this however, we have to fix an annoying problem. If you download and view the <code>1DP7</code> structure in VMD, you will notice that there is only a single strand of DNA! Where is the second strand of the double helix? It is not in the coordinate file, because it happens to be exactly equivalent to the frist starnd, rotated around a two-fold axis of symmetry in the crystal lattice. We need to download and work with the so-called '''Biological Assembly''' instead.  But there is a problem related to the way the PDB stores replicates in biological assemblies. The PDB generates the additional chains as copies of the original and delineates them with <code>MODEL</code> and <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The PDB file thus contains the '''same molecule in two different orientations''', not '''two independent molecules'''. This is an important difference regarding how such molecules are displayed by VMD. '''If you try to use the biological unit file of the PDB, VMD does not recognize that there is a second molecule present and displays only one chain.''' And that looks exactly like the one we have seen before. We have to edit the file, extract the second DNA molecule, change its chain ID and then append it to the original 1DP7 structure<ref>My apologies if this is tedious. '''But''' in the real world, we encounter such problems a lot and I would be remiss not to use this opportunity to let you practice how to fix the issue that could otherwise be a roadblock in a project of yours.</ref>...
+
* Easy. Now enlarge the binding site. Remember that 4ux5 and 1bm8 are independently determined crystal structures, wheres YFO was modelled on 1bm8 and is expected to be '''very''' similar to it. To give you some guidance what you should focus on, select 4ux5 residue 84 CA atom and display it as '''Ball & Stick'''. You can also repeat the '''Action''' "Set Pivot in case the pivot has shifted.
  
{{task|1=
+
* Study the scene. This is where stereo- vision will help '''a lot'''.
# On the structure explorer page for 1DP7, select the option '''Download Files''' &rarr; '''PDB File'''.
 
# Also select the option '''Download Files''' &rarr; '''Biological Assembly'''.
 
# Uncompress the biological assembly file.
 
# Open the file in a text editor.
 
# Delete everything except the '''second DNA molecule'''. This comes after the <code>MODEL  2</code> line and has chain ID '''D'''. Keep the <code>TER</code> and <code>END</code> lines. Save this with a new filename (e.g. <code>1DP7_DNAonly.pdb</code>).
 
# Also delete all <code>HETATM</code> records for <code>HOH</code>, <code>PEG</code> and <code>EDO</code>, as well as the entire second protein chain and the <code>MASTER</code> record. The resulting file should only contain the DNA chain and its copy and one protein chain. Save the file with a new name, eg. <code>1DP7_BDNA.PDB</code>.
 
# Use a similar procedure as [[BIO_Assignment_Week_8#R code: renumbering the model in the last assignment]] to change the  chain ID.
 
  
<source lang="rsplus">
+
* What do you think? Is this what you expected? Can you explain what you see? Was the modelling process succesful?
PDBin <- "1DP7_DNAonly.pdb"
 
PDBout <- "1DP7_DNAnewChain.pdb"
 
 
 
pdb  <- read.pdb(PDBin)
 
pdb$atom[,"chain"] <- "E"
 
write.pdb(pdb=pdb,file=PDBout)
 
</source>
 
 
 
# Use your text-editor to open both the <code>1DP7.pdb</code> structure file and the  <code>1DP7_DNAnewChain.pdb</code>. Copy the DNA coordinates, paste them into the original file before the <code>END</code> line and save.
 
# Open the edited coordinate file with VMD. You should see '''one protein chain''' and a '''B-DNA double helix'''. (Actually, the BDNA helix has a gap, because the R-library did not read the BRDU nucleotide as DNA). Switch to stereo viewing and spend some time to see how '''amazingly beautiful''' the complementarity between the protein and the DNA helix is (you might want to display ''protein'' and ''nucleic'' in separate representations and color the DNA chain by ''Position'' &rarr; ''Radial'' for clarity) ... in particular, appreciate how not all positively charged side chains contact the phosphate backbone, but some pnetrate into the helix and make detailed interactions with the nucleobases!
 
# Then clear all molecules
 
# In VMD, open '''Extensions&rarr;Analysis&rarr;MultiSeq'''. When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default, or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
 
# Choose '''File&rarr;Import Data''', browse to your directory and load one by one:
 
:: -Your model;
 
:: -The 1DUX complex;
 
:: -The 1DP7 complex.
 
# Mark all three protein chains by selecting the checkbox next to their name and choose '''Tools&rarr; STAMP structural alignment'''.
 
# '''Align''' the '''Marked Structures''', choose a '''scanscore''' of '''2''' and '''scanslide''' of '''5'''. Also choose '''Slow scan'''. You may have to play around with the setting to get the molecules to superimpose: but the '''can''' be superimposed quite well - at least the DNA-binding helices and the wings should line up.
 
# In the graphical representations window, double-click on the cartoon representations that multiseq has generated to undisplay them, also undisplay the Tube representation of 1DUX. Then create a Tube representation for 1DP7, and select a Color by ColorID (a different color that you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
 
# Orient and scale your superimposed structures so that their structural similarity is apparent, and the differences in binding elements is clear. Perhaps visualizing a solvent accessible surface of the DNA will help understand the spatial requirements of the complex formation. You may want to keep a copy of the image for future reference. Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in the "alternative" conformation.
 
}}
 
  
 +
<!-- I see that the model is very good regarding the global fold, but completely different in the binding loop. This is not expected. -->
  
&nbsp;
+
* Now turn the display of 4ux5 chain B back on and turn chain A off instead. Then superimpose the 1BM8 template and your model on Chain B.
  
<!--
 
===Coloring by conservation===
 
  
With the superimposed coordinates, you can begin to get a sense whether either or both binding modes could be appropriate for a protein-DNA complex in your Mbp1 orthologue. But these are geometrical criteria only, and the protein in your species may be flexible enough to adopt a different conformation in a complex, and different again from your model. A more powerful way to analyze such hypothetical complexes is to look at conservation patterns. With VMD, you can import a sequence alignment into the MultiSeq extension and color residies by conservation. The protocol below assumes
+
* Again, focus on the binding region. What do you think of that? What would you have expected? Do you see a difference? What does this all mean?
  
*You have prealigned the reference Mbp1 proteins with your species' Mbp1 orthologue;
 
*You have saved the alignment in a CLUSTAL format.
 
  
You can use Jalview or any other MSA server to do so. You can even do this by hand - there should be few if any indels and the correct alignment is easy to see.
+
}}
  
{{task|1=
 
;Load the Mbp1 APSES alignment into MultiSeq.
 
  
:(A) In the MultiSeq Window, navigate to '''File &rarr; Import Data...'''; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable <code>ALN</code> files (these are CLUSTAL formatted multiple sequence alignments).
+
Nb. I haven't seen this before and I am completely intrigued by the results. In fact, I think I understand the protein much, much better now through this exercise. I'm very pleased how this turned out.
:(B) Open the alignment file, click on '''Ok''' to import the data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.
 
:(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).  
 
  
You will see that the 1MB1 sequence and the APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.
 
  
;Bring the 1MB1 sequence in register with the APSES alignment.
 
:(A)MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequences you have imported. 
 
:(B) Select '''Edit &rarr; Enable Editing... &rarr; Gaps only''' to allow changing indels.
 
:(C) Pressing the spacebar once should insert a gap character before the '''selected column''' in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1: <code>S I M ...</code>
 
:(D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)
 
:(E) When you are done, it may be prudent to save the state of your alignment. Use '''File &rarr; Save Session...'''
 
 
;Color by similarity
 
:(A) Use the '''View &rarr; Coloring &rarr; Sequence similarity &rarr; BLOSUM30''' option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
 
:(B) You can adjust the color scale in the usual way by navigating to '''VMD main &rarr; Graphics &rarr; Colors...''', choosing the Color Scale tab and adjusting the scale midpoint.
 
:(C) Navigate to the '''Representations''' window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use '''User''' coloring of your ''Tube'' and ''Licorice'' representations to apply the sequence similarity color gradient that MultiSeq has calculated.
 
 
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
 
* Once you have colored the residues of your model by conservation, create another informative stereo-image and paste it into your assignment.
 
}}
 
  
 
&nbsp;
 
&nbsp;
-->
 
 
== Interpretation==
 
<!--
 
Analysis of the ligand binding site:
 
 
* http://dnasite.limlab.ibms.sinica.edu.tw/
 
* http://proline.biochem.iisc.ernet.in/pocketannotate/
 
* http://www.biosolveit.de/PoseView/
 
 
*Comparison with seq2logo
 
{{#pmid: 19483101}}
 
*protedna server PMID: 19483101
 
* http://serv.csbb.ntu.edu.tw/ProteDNA/
 
* http://protedna.csie.ntu.edu.tw/
 
* Multi Harmony
 
{{#pmid: 20525785}}
 
 
-->
 
 
 
 
{{task|1=
 
# Spend some time studying the complex.
 
# Recapitulate in your mind how we have arrived at this comparison, in particular, how this was possible even though the sequence similarity between these proteins is low - none of these winged helix domains came up as a result of our previous BLAST search in the PDB.
 
# You should clearly think about the following question: considering the position of the two DNA helices relative to the YFO structural model, which binding mode appears to be more plausible for protein-DNA interactions in the YFO Mbp1 APSES domains? Is it the canonical, or the non-canonical binding mode? Is there evidence that allows you to distinguish between the  two modes?
 
# Before you quit VMD, save the "state" of your session so you can reload it later. We will look at residue conservation once we have built phylogenetic trees. In the main VMD window, choose '''File&rarr;Save State...'''.
 
}}
 
 
<!--
 
== R code: conservation scores and sequence weighting==
 
-->
 
  
 
== Links and resources ==
 
== Links and resources ==
{{#pmid: 22407712}}
 
 
  
 
:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
 
:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
Line 697: Line 715:
  
  
;Reference sequences
+
<!-- ;Reference sequences
  
 
:* [[Reference Mbp1 orthologues (all fungi)|'''Mbp1 ortholog sequences (all fungi)''']]
 
:* [[Reference Mbp1 orthologues (all fungi)|'''Mbp1 ortholog sequences (all fungi)''']]
 
+
-->
  
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{#pmid: 19957275}} -->
Line 711: Line 729:
  
 
<table style="width:100%;"><tr>
 
<table style="width:100%;"><tr>
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_6|&lt;&nbsp;Assignment&nbsp;6]]</td>
+
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_7|&lt;&nbsp;Assignment&nbsp;7]]</td>
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_8|Assignment&nbsp;8&nbsp;&gt;]]</td>
+
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_9|Assignment&nbsp;9&nbsp;&gt;]]</td>
 
</tr></table>
 
</tr></table>
  

Latest revision as of 21:23, 4 December 2016

Assignment for Week 8
Predictions: Homology Modeling

< Assignment 7 Assignment 9 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on the next quiz.


Introduction

In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, the experimental evidence we have considered in Assignment 2 (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

In this assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 RBM orthologue in your assigned species.

For the following, please remember the following terminology:

Target
The protein that you are planning to model.
Template
The protein whose structure you are using as a guide to build the model.
Model
The structure that results from the modelling process. It has the Target sequence and is similar to the Template structure.

 

A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.


 

 


A Point Mutation

To illustrate how homology modelling works in principle, let's consider changing the sequence of a single amino acid, based on a structural template.

Such minimal changes to structure models can be done directly in Chimera. Let us consider the residue A 42 of the 1BM8 structure. It is oriented towards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, V, or even I.

Task:

  1. Open 1BM8 in Chimera, hide the ribbons and show all atoms as a stick model.
  2. Color the protein white.
  3. Open the sequence window and select A 42. Color it red. Choose Actions → Set pivot. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
  4. To emphasize this better, hide the solvent molecules and select only the protein atoms. Display them as a sphere model to better appreciate the packing, i.e. the Van der Waals contacts we discussed in class. Use the Favorites → Side view panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
  5. Lets simplify the view: choose Actions → Atoms/Bonds → backbone only → chain trace. Then select A 42 again in the sequence window and choose Actions → Atoms/Bonds → show.
  6. Add the surrounding residues: choose Select → Zone.... In the window, see that the box is checked that selects all atoms at a distance of less then 5Å to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click OK and choose Actions → Atoms/Bonds → show.
  7. Select A 42 again: left-click (control click) on any atom of the alanine to select the atom, then up-arrow to select the entire residue. Now let's mutate this residue to isoleucine.
  8. Choose Tools → Structure Editing → Rotamers and select ILE as the rotamer type. Click OK, a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are very different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D. Btw: I find such "quantitative" work - where the real distances are important - easier in orthographic than in perspective view (cf. the Camera panel).
  9. I find that the first rotamer is actually not such a bad fit. The CD atom comes close to the sidechains of I 25 and L 96. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your Jalview alignment - it is NOT the case that sequences that have I 42, have a smaller residue in position 25 and/or 96. So let's accept the most frequent ILE rotamer by selecting it in the rotamer window and clicking OK (while existing side chain(s): replace is selected).
  10. Done.

If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group here. I would also encourage you to go over Part 2 of the video tutorial that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.

What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes all amino acids to the residues of the target sequence, based on the template structure. Let's now build a homology model for YFO Mbp1.


 

Preparation

  • We need to define our Target sequence;
  • find a suitable structural Template; and
  • build a Model.


Target sequence

We have encountered the PDB 1BM8 structure before, the APSES domain of saccharomyces cerevisiae Mbp1. This is a useful template to model the DNA binding domain of your RBM match. But what exactly is the aligned region of the APSES domain? We could use several approaches to define the APSES domain:

  • we could use the biostrings package to calculate a pairwise sequence alignment with the 1BM8 sequence, like we did previously for the full-length sequences. This would give us the domain boundaries.
  • we could calculate a multiple sequence alignment, while including the 1BM8 sequence. This would also allow us to infer domain boundaries, actually in all sequences in our database at once. But we have found previously that such multiple sequence alignments are quite sensitive to un-alignable regions of which we have quite a few in the full length sequences. We do need an MSA, but we do need to restrict the length of the sequences we align to a reasonable region.
  • we could access the domain annotations at CDD or at the SMART Database, but both have interfaces that are difficult to use computationally, and have other issues: NCBI does not recognize APSES domains, only the smaller KilA-N domain, and SMART sometimes does not find APSES domains in our sequences.
  • the most straightforward approach of course is to use the annotation that you already have produced for the APSES domain in MBP1_<YFO>. You should be able to simply take the MBP1_SACCE sequence and the one for YFO from the APSES.mfa file.

This is the 1BM8 sequence:

>SACCE
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF


Template choice and template sequence

The SWISS-MODEL server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.

Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the template choice principles page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.


Defining a template means finding a PDB coordinate set that has sufficient sequence similarity to your target that you can build a model based on that template. To find suitable PDB structures, we will perform a BLAST search at the PDB.




Task:

  1. Retrieve your aligned YFO's Mbp1 RBM APSES domain sequence from the APSES.mfa selection you have prepared for the phylogeny assignment. This YFO sequence is your target sequence.
  2. Navigate to the PDB.
  3. Click on Advanced to enter the advanced search interface.
  4. Open the menu to Choose a Query Type:
  5. Find the Sequence features section and choose Sequence (BLAST...)
  6. Paste your target sequence into the Sequence field, select not to mask low-complexity regions and Submit Query. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.

All hits that are homologs are potentially suitable templates, but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...

  • sequence similarity to your target
  • size of expected model (= length of alignment)
  • presence or absence of ligands
  • experimental method and quality of the data set

Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.

  1. There is a menu to create Reports: - select customizable table.
  2. Select (at least) the following information items:
Structure Summary
  • Experimental Method
Sequence
  • Chain Length
Ligands
  • Ligand Name
Biological details
  • Macromolecule Name
refinement Details
  • Resolution
  • R Work
  • R free
  1. click: Create report.

Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. And there is a new structure from January 2015, with a lower resolution. Some of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the real world, there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice for our template: 1BM8.

Finally
Click on the 1BM8 ID to navigate to the structure page for the template and save the FASTA sequence to your computer. This is the template sequence.


 

Sequence numbering

 

It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file (one of the related PDB structures) is the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the ATOM records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with MSNQIY..., but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.

Fortunately, the numbering for the residues in the coordinate section of our target structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence (e.g. by using the bio3D R package). If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.


 


The input alignment

  The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the template sequence and the target sequence from your species, proceed as follows.


 

Task:
Choose one of the following options to align your target and template sequence. Make sure your template sequence is included, i.e. the FASTA sequence of 1BM8.


In Jalview...
  • Load your APSES domain sequences plus the 1BM8 sequence in Jalview. Include the sequence of your template protein and align using Muscle.
  • Delete all sequence you no longer need, i.e. keep only the APSES domains of the target (from your species) and the template (from the PDB) and choose Edit → Remove empty columns. This is your input alignment.
  • Choose File→Output to textbox→FASTA to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.


Using a different MSA program
  • Copy the FASTA formatted sequences of the Mbp1 proteins in the reference species from the Reference APSES domain page.
  • Access the MSA tools page at the EBI.
  • Paste the Mbp1 sequence set, your target sequence and the template sequence into the input form.
  • Run an alignment (I like T-coffee) and save the output.


Using the R bioconductor MSA package that you used previously.

Refer back to the page if you are lacking notes how to go about this.


Whatever method you use: the result should be a two sequence alignment in multi-FASTA format, that was constructed from a number of supporting sequences and that contains your aligned target and template sequence. This is your input alignment for the homology modeling server. For a Schizosaccharomyces pombe model, which I am using as an example here, it looks like this:

>1BM8_A 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
>Mbp1_SCHPO 2-100 NP_593032
AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL


In this case, there are no indels and therefore no hyphens - in your case there may be.


 


Homology model

The alignment defines the residue by residue relationship between target and template sequence. All we need to do now is to change every residue of the template to the target sequence


SwissModel

 

Access the Swissmodel server at http://swissmodel.expasy.org and click on the Start Modelling button. Under the Supported Inputs, choose Target-Template Alignment.

Task:

  • Paste the aligned sequences of the YFO target and the 1BM8 template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
  • Click Validate Target Template Alignment and check that the returned alignment is correct. All non-identical residues are shown in light-grey.
  • Click Build Model to start the modeling process. This will take about a minute or so.
  • The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.
  • Mouse over the Model 01 dropdown menu (under the icon of the template structure), and choose the PDB file. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. Save the PDB file on your computer.
  • Open the SwissModel documentation in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the GQME and QMEAN quality scores.
  • Also save:
    • The output page as pdf (for reference)
    • The modeling report (as pdf)


Model interpretation

We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the interpretation of results is often somewhat neglected. Don't be that way. Data does not explain itself. The interpreattion of your computational results is the most important part.

We will look at our homology model with two different questions:

  • Can we define the DNA binding residues?
  • Can we tell which residues are conserved for functional reasons, rather than for structural reasons?


   

The PDB file

 

Task:
Open your model coordinates in a text-editor (make sure you view the PDB file in a fixed-width font (like "courier") so all the columns line up correctly) and consider the following questions:

  • What is the residue number of the first residue in the model? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your model correspond to that region?

That's not easy to tell. But it should be.


R code: renumbering the model

As you have seen above, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately there is a very useful R package that will help: bio3d.

Task:

  1. Navigate to the bio3D home page to . bio3d has recently been made available via CRAN - previously it had to be compiled from source.


  1. Explore and execute the following R script. I am assuming that your model is in your PROJECTDIR folder, change paths and filenames as required.
setwd(PROJECTDIR)
PDB_INFILE      <- "YFOmodel.pdb"
PDB_OUTFILE     <- "YFOmodelRenumbered.pdb"


# The bio3d package provides functions for working with 
# protein structures in R 
if (!require(bio3d, quietly=TRUE)) { 
	install.packages("bio3d")
	library(bio3d)
}

# == Read the YFO pdb file

iFirst <- 4  # residue number for the first residue
 
YFOmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list

YFOmodel           # examine the information
YFOmodel$atom[1,]  # get information for the first atom

# Explore ?read.pdb and study the examples.

# == Modify residue numbers for each atom
resNum <- as.numeric(YFOmodel $atom[,"resno"])
resNum  
resNum <- resNum - resNum[1] + iFirst  # add offset
YFOmodel $atom[ , "resno"] <- resNum   # replace old numbers with new

# check result
YFOmodel $atom[ , "resno"]
YFOmodel $atom[1, ]

# == Write output to file
write.pdb(pdb = YFOmodel, file=PDBout)

# Done. Open the PDB file you have written in a text editor
# and confirm that this has worked.


 


First visualization

 

Since a homology model inherits its structural details from the template, your model of the YFO sequence should look very similar to the original 1BM8 structure.

Task:

  1. Start Chimera and load the model coordinates that you have just renumbered.
  2. From the PDB, also load the template structure. (Use File → Fetch by ID ...)
  3. In the FavouritesModel Panel window you can switch between the two molecules.
  4. Hide the ribbon and choose backbone only → full. You will note that the backbone of the two structures is virtually identical.
  5. Next, choose Actions → Atoms/Bonds → show to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: Select → Chemistry → Element → H and Actions → Atoms/Bonds → hide
  6. Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. You can drag your mouse in the Favourites → Sequence, window to select the range then Select → Invert (selected model) and Actions → Atoms/Bonds → hide. Or you can use Chimera's commandline: ~display to undisplay everything, show #:50-74 to show this residue range for all models.
  7. Study the result: a model of the HTH subdomain of YFO's RBM to Mbp1.


 

Coloring the model by energy

SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.


Task:

  1. Back in Chimera, use the model panel to close the 1BM8 structure. Select all and show Atoms, bonds to view the entire model structure.
  2. Choose Tools → Depiction → Render by attribute and select attributes of atoms, Attribute: bfactor, check color atoms and click OK.
  3. Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface. Why could that be the case?

Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...


 


 

Modelling DNA binding

One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.

Since there is currently no software available that would reliably model such a complex from first principles[1], we will base a model of a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. It so happens that early in 2015 an APSES domain structure with bound DNA was published. You probably noticed it as a result of the PDB BLAST search: 4UX5, from the Magnaporthe oryzae Mbp1 orhologue PCG2[2].



A homologous protein/DNA complex structure

Task:

The PCG2 / DNA complex
  • Open Chimera and load the 4UX5 structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule. The first question I would have is whether the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box", and whether the observed protein:DNA interfaces are actually with the cognate sequence, or whether one (or both) proteins are non-specific complexes. The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.[3] Indeed, Liu et al. (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact not identical.
  • Without taking this question too far, let's get a quick view of the comparison by duplicating one domain of the structure and superimposing it on the other. The authors feel that chain A represents the tighter, more specific mode of interaction; so we will duplicate chain B and superpose the copy on A.
  • In Chimera, open the FavoritesModel Panel and use the copy/combine button to create a copy of the 4UX5 model. Call it test.
  • Select chain B of the test model, then use SelectInvert (selected models) to apply the selection to everything in the test model except chain B.
  • Use ActionsAtoms/Bondsdelete to remove everything but Chain B.
  • Select and colour the chain red.
  • Back on the Model Panel, select both models and use the match... dialogue to open a MatchMaker dialogue window. Choose the radio button two match two specific chains and select 4UX5 chain A as the Reference chain, test chain B as the Chain to match. Click Apply.

You will see that the superimposed structures are very similar, that the main difference is in the orientation of the disordered C-terminus, but also that there is a structural difference between the two structures around Gly 84 which inserts into the minor groove of the double helix.

  • Select one of the residues of that loop in chain A by <control>-clicking on it and use ActionSet pivot to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.
  • Select residues 81 to 87 and the corresponding (sequence VQGGYGKY) and in both chains turn their ribbon display off and display this range as "sticks".
  • Select nucleic acid in the structure submenu and turn ribbons and nucleotide objects off to display the DNA as sticks as well. Colour the DNA by element.
  • Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think[4]? It seems to me that a crucial interaction for the cognate sequence is contributed by Guanine 8,
  • Finally, use the Model Panel to select test and close it.


 

Superimposing your model

Both your homology model and the template structure provide valuable information:

  • The template structure shows how conserved the structure is at the protein/DNA interface. You have seen what subtle differences can give rise to a sequence specific complex and a non-specific binding mode. For Mbp1 we know that the APSES domain binds to the same cognate DNA sequence as PCG2. Since your model structure is heavily biased towards the template, evaluating the template in the context of a real protein/DNA complex allows you to judge which binding residues appear to be conserved and possibly modelled in an orientation that is productive for binding.
  • The model structure maps sequence variation into that context: are the crucial residues for sequence specific binding conserved?

Task:

  • Start by loading your model and the 1BM8 structure into your chimera session. Select all, turn all ribbons off, and set all atoms to stick representation. Then select H atoms by element and hide them.
  • We need to visualize and evaluate differences in binding between different proteins and for me it works well to colour everything by element, and give the carbon atoms some identifying, distinct colour. This is best achieved through the Chimera command line that you can turn on with the little "computer" icon on the left-hand side of the graphics window. Have a look at the Chimera Users guide, and choose select to learn how Chimera's selection syntax works.
  • Open the Model Panel to check which protein has which Chimera-internal model number. Then you can use the following selection syntax. Instead of the model numbers, I will type <YFO>, <4ux5>, and <1BM8> - you will certainly know by now that these are placeholder labels and you need to replace them with the numbers 0, 1, and 2 instead.
  • To colour the DNA carbon atoms white, type:
color white #<4ux5>:.C,.D & C
  • To colour the 4ux5 A chain carbon atoms grey, type:
color #878795 #<4ux5>:.A & C Note: the color values after the first hash are rgb triplets in the hexadecimal numbering systems - exactly like in R.
  • To undisplay the 4ux5 B chain, type:
~display #<4ux5>:.B Note: this is the tilde character, not a hyphen or minus sign.
  • To colour the YFO model carbon atoms a pale reddish color, type:
color #b06268 #<YFO> & C
  • To colour the 1BM8 structure carbon atoms a pale greenish color, type:
color #92b098 #<1BM8> & C
  • Ready? Let's superimpose the chains.
    • Select all models in the Model Panel and click on match.
    • Set 4ux5 Chain A as the Reference chain.
    • Select YFO as a Chain to match, select the button for specific reference and specific match, and click Apply.
    • Repeat this with 1BM8 as the match chain.
  • Easy. Now enlarge the binding site. Remember that 4ux5 and 1bm8 are independently determined crystal structures, wheres YFO was modelled on 1bm8 and is expected to be very similar to it. To give you some guidance what you should focus on, select 4ux5 residue 84 CA atom and display it as Ball & Stick. You can also repeat the Action "Set Pivot in case the pivot has shifted.
  • Study the scene. This is where stereo- vision will help a lot.
  • What do you think? Is this what you expected? Can you explain what you see? Was the modelling process succesful?


  • Now turn the display of 4ux5 chain B back on and turn chain A off instead. Then superimpose the 1BM8 template and your model on Chain B.


  • Again, focus on the binding region. What do you think of that? What would you have expected? Do you see a difference? What does this all mean?


Nb. I haven't seen this before and I am completely intrigued by the results. In fact, I think I understand the protein much, much better now through this exercise. I'm very pleased how this turned out.


 

Links and resources




 


Footnotes and references

  1. Rosetta may get the structure approximately right, Autodock may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct.
  2. Liu et al. (2015) Structural basis of DNA recognition by PCG2 reveals a novel DNA binding mode for winged helix-turn-helix domains. Nucleic Acids Res 43:1231-40. (pmid: 25550425)

    PubMed ] [ DOI ] The MBP1 family proteins are the DNA binding subunits of MBF cell-cycle transcription factor complexes and contain an N terminal winged helix-turn-helix (wHTH) DNA binding domain (DBD). Although the DNA binding mechanism of MBP1 from Saccharomyces cerevisiae has been extensively studied, the structural framework and the DNA binding mode of other MBP1 family proteins remains to be disclosed. Here, we determined the crystal structure of the DBD of PCG2, the Magnaporthe oryzae orthologue of MBP1, bound to MCB-DNA. The structure revealed that the wing, the 20-loop, helix A and helix B in PCG2-DBD are important elements for DNA binding. Unlike previously characterized wHTH proteins, PCG2-DBD utilizes the wing and helix-B to bind the minor groove and the major groove of the MCB-DNA whilst the 20-loop and helix A interact non-specifically with DNA. Notably, two glutamines Q89 and Q82 within the wing were found to recognize the MCB core CGCG sequence through making hydrogen bond interactions. Further in vitro assays confirmed essential roles of Q89 and Q82 in the DNA binding. These data together indicate that the MBP1 homologue PCG2 employs an unusual mode of binding to target DNA and demonstrate the versatility of wHTH domains.

  3. This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.
  4. Besides the coordinate difference between the chains, if indeed chain B would be representative of a DNA "scanning" conformation, perhaps one should expect that the local DNA structure that chain B binds to is structurally closer to canonical B-DNA than the DNA binding interface of chain A...


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 7 Assignment 9 >