Difference between revisions of "BIN-SX-Homology modelling"

From "A B C"
Jump to navigation Jump to search
m
m
 
(7 intermediate revisions by the same user not shown)
Line 32: Line 32:
 
After working through this unit you ...
 
After working through this unit you ...
 
* ... can produce a homology model using the Swiss-Model server;
 
* ... can produce a homology model using the Swiss-Model server;
* ... can work with Chimera to analyze its structural details.
+
* ... can work with ChimeraX to analyze its structural details.
 
</td>
 
</td>
 
</tr>
 
</tr>
Line 40: Line 40:
 
<b>Deliverables:</b><br />
 
<b>Deliverables:</b><br />
 
<section begin=deliverables />
 
<section begin=deliverables />
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
+
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
+
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
+
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
 
<section end=deliverables />
 
<section end=deliverables />
 
<!-- ============================  -->
 
<!-- ============================  -->
Line 51: Line 48:
 
<section begin=prerequisites />
 
<section begin=prerequisites />
 
<b>Prerequisites:</b><br />
 
<b>Prerequisites:</b><br />
<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
+
This unit builds on material covered in the following prerequisite units:<br />
This unit builds on material covered in the following prerequisite units:
 
 
*[[BIN-ALI-MSA|BIN-ALI-MSA (Multiple Sequence Alignment)]]
 
*[[BIN-ALI-MSA|BIN-ALI-MSA (Multiple Sequence Alignment)]]
 
*[[BIN-SX-Molecular_forcefields|BIN-SX-Molecular_forcefields (Molecular Forcefields)]]
 
*[[BIN-SX-Molecular_forcefields|BIN-SX-Molecular_forcefields (Molecular Forcefields)]]
Line 73: Line 69:
  
  
 +
=== Evaluation ===
 +
<b>Evaluation: NA</b><br />
 +
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
== Contents ==
 
== Contents ==
<!-- included from "./components/BIN-SX-Homology_modelling.components.txt", section: "contents" -->
 
  
 
{{Task|1=
 
{{Task|1=
Line 80: Line 78:
  
 
*Read:
 
*Read:
{{#pmid: 24782522}}
+
{{#pmid: 29788355}}
  
 
}}
 
}}
Line 109: Line 107:
 
To illustrate how force fields modify protein structure in principle, let's consider changing the sequence of a single amino acid, based on a structural template and minimize the structure's energy.
 
To illustrate how force fields modify protein structure in principle, let's consider changing the sequence of a single amino acid, based on a structural template and minimize the structure's energy.
  
Such minimal changes to structure models can be done directly in Chimera. Let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented towards the core of the protein, but as the MSA shows, most other Mbp1 orthologs have a larger amino acid in this position: <code>V</code>, or even <code>I</code>.
+
Such minimal changes to structure models can be done directly in ChimeraX. Let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented towards the core of the protein, but as the MSA shows, most other Mbp1 orthologs have a larger amino acid in this position: <code>V</code>, or even <code>I</code>.
  
 
{{task|1=
 
{{task|1=
# Open <code>1BM8</code> in Chimera, hide the ribbons and show all protein atoms as a stick model.
+
* Open <code>1BM8</code> in ChimeraX, turn the camera to stereo(<code>camera sbs</code>), use soft lighting (<code>lighting soft</code>), hide the ribbons and show all protein atoms as a stick model.
# Color the protein white.
+
* Color the protein white.
# Open the sequence window and select <code>A&nbsp;42</code>. Color it red. Choose '''Actions&nbsp;&rarr;&nbsp;Set pivot'''. This sets the center of rotation of the scene to <code>A&nbsp;42</code> so the residue will not pitch out of the visible scene when you rotate the protein. Study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
+
* Open the Sequence Viewer and select <code>A&nbsp;42</code>. Color it red. Choose '''Actions&nbsp;&rarr;&nbsp;Set pivot'''. This sets the center of rotation of the scene to <code>A&nbsp;42</code> so the residue will not pitch out of the visible scene when you rotate the protein. Study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
# To emphasize this better, select the protein atoms and display them as a '''sphere''' model to better appreciate the packing, i.e. the Van der Waals contacts. Use the '''Tools''' &rarr; '''Viewing Controls''' &rarr; '''Side View''' panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
+
* To emphasize this better, select the protein atoms and calculate an "accessible surface" to better appreciate the packing, i.e. the Van der Waals contacts.
# Lets simplify the view: choose '''Actions &rarr; Atoms/Bonds &rarr; show''' and '''Actions &rarr; Atoms/Bonds &rarr; backbone&nbsp;only &rarr; chain&nbsp;trace'''. Then select <code>A&nbsp;42</code> again in the sequence window and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
+
 
# Add the surrounding residues: choose '''Select &rarr; Zone...'''. In the window, see that the box is checked that selects all atoms at a distance of less then 5&Aring; to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click '''OK''' and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''. You now have a very clear scene of the alanine residue in red, the surrounding side chains, and the rest of the structure as a C-alpha trace. You also see three water molecules. Spend a bit of time again, to get a sense for the spatial context<ref>Chimera uses a default '''distance to screen''' that is too close and that exaggerates the depth of the scene to a degree that it is difficult to fuse the stereo pairs. Choose '''Tools &rarr; Viewing controls &rarr; Camera''' and set the distance to screen to 50 cm. This will make stereo viewing easier and will also give a better sense of distance estimates in all three dimensions.</ref>.
+
: <code>hide #1</code>
#Select <code>A&nbsp;42</code> again: '''left-click''' (control click) on any atom of the alanine to select the atom, then '''up-arrow''' to select the entire residue. Now let's mutate this residue to isoleucine.
+
: <code>select #1:4-41,43-102</code>      ;# the protein without residue 42
#Choose '''Tools &rarr; Structure&nbsp;Editing &rarr; Rotamers''' and select <code>ILE</code> as the rotamer type. Click '''OK''', a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are '''very''' different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D.
+
: <code>surface sel enclose sel</code>    ;# its surface without A42
#I find that the first rotamer is actually not such a bad fit. The <code>CD</code> atom comes close to the sidechains of <code>I&nbsp;25</code> and <code>L&nbsp;96</code>. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your MSA - it is '''NOT''' the case that sequences that have <code>I&nbsp;42</code>, have a smaller residue in position <code>25</code> and/or <code>96</code>. So let's accept the most frequent <code>ILE</code> rotamer by selecting it in the rotamer window and clicking '''OK''' (while '''existing side chain(s): replace''' is selected).
+
: <code>color radial #1.1 center #1:42 palette "white:black"</code>
#Done.
+
: <code>transparency sel 50</code>
 +
: <code>select #1:42</code>        ;# A42 only
 +
: <code>surface sel enclose sel</code>
 +
: <code>color #1.2 red</code>      ;# our second surface is submodel 1.2
 +
 
 +
* Click on the '''Graphics''' tab to open the menu bar with graphics settings, there is an icon called '''Side view''' which opens a window with viewer position details. It is quite small when it is placed into the right-hand column, but you can detach it by dragging its menu bar and then increase its size. You can see the camera distance from the scene, and two clipping panels. Drag the clipping planes to visualize a section through the protein. Study the packing, in particular, note that not even a single additional methyl groups of a valine or isoleucine would have space in the structure. Then restore the clipping planes so you can see the whole molecule.
 +
 
 +
* Let's create a clear view of the alanine sidechain in context. We display it as a stick model, the rest of the chain as a C-alpha trace, and then select the surrounding sidechains and display those too.
 +
 
 +
: <code>hide #1 surfaces</code>
 +
: <code>show @ca target ab</code>      ;# CA trace
 +
: <code>style stick</code>
 +
: <code>select zone :42 4.5 #1 extend true residues true</code>
 +
: <code>show sel target ab</code>
 +
: <code>select #1 & protein</code>
 +
: <code>hide @H*</code>                  ;# hide H-atoms
 +
: <code>size sel stickRadius 0.25</code>
 +
: <code>size pseudobondRadius 0.25</code>  ;# the lines connecting the CAs are "pseudobonds"
 +
 
 +
* You now have a very clear scene of the alanine residue in red, the surrounding side chains, and the rest of the structure as a C-alpha trace. You also see three water molecules. Spend a bit of time again, to get a sense for the spatial context.
 +
* Now let's mutate Alanine 42 residue to isoleucine.
 +
 
 +
: <code>select #1:42</code>
 +
: <code>ui tool show rotamers</code> ;# or using the menu: '''Tools &rarr; Structure&nbsp;Editing &rarr; Rotamers'''
 +
 
 +
*Choose  <code>ILE</code> as the rotamer type. Click '''OK''', a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are '''very''' different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D.
 +
*I find that the first rotamer is actually not such a bad fit, and that number five is also quite plausible. Regarding the first rotamer the <code>CD</code> atom comes close to the sidechains of <code>I&nbsp;25</code> and <code>L&nbsp;96</code>. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your MSA - it is '''NOT''' the case that sequences that have <code>I&nbsp;42</code>, have a smaller residue in position <code>25</code> and/or <code>96</code>. So let's accept the most frequent <code>ILE</code> rotamer by selecting it in the rotamer window and clicking '''OK'''.
 +
*Done.
 
}}
 
}}
  
If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group [https://www.youtube.com/watch?v=bcXMexN6hjY '''here''']. I would also encourage you to go over [https://www.youtube.com/watch?v=eJkrvr-xeXY '''Part 2 of the video tutorial'''] that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.
 
  
Incidentally: What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes '''all''' amino acids to the residues of the '''target sequence''', based on the '''template structure'''. Let's now build a homology model for MYSPE Mbp1.
+
What we have done here with ''one'' residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes '''all''' amino acids to the residues of the '''target sequence''', based on the '''template structure'''. Let's now build a homology model for MYSPE Mbp1.
  
 
{{Vspace}}
 
{{Vspace}}
Line 143: Line 167:
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<source lang="R">
+
<pre>
library(msa)
 
 
 
 
# Recreate the database
 
# Recreate the database
source("makeProteinDB.R")
+
source("./myScripts/makeProteinDB.R")
  
 
# A: Define your TARGET sequence.
 
# A: Define your TARGET sequence.
 
#      You have defined a feature annotation for the MYSPE APSES domain in
 
#      You have defined a feature annotation for the MYSPE APSES domain in
#      the BIN-ALI-Optimal_sequence_alignment unit's R code. Retrieve it's
+
#      the BIN-ALI-Optimal_sequence_alignment unit's R code. Retrieve its
 
#      sequence from the feature annotation to get the TARGET sequence.
 
#      sequence from the feature annotation to get the TARGET sequence.
 
#
 
#
  
(targetName <- sprintf("MBP1_%s", biCode(MYSPE)))
+
( targetName <- sprintf("MBP1_%s", biCode(MYSPE)) )
  
 
# Get the protein IDs.
 
# Get the protein IDs.
(sel <- which(myDB$protein$name == targetName))
+
( sel <- which(myDB$protein$name == targetName) )
(proID <- myDB$protein$ID[sel])
+
( proID <- myDB$protein$ID[sel] )
  
 
# Find the feature ID in the feature table
 
# Find the feature ID in the feature table
(ftrID <- myDB$feature$ID[myDB$feature$name == "APSES fold"])
+
( ftrID <- myDB$feature$ID[myDB$feature$name == "APSES fold"] )
  
 
# Get the annotation ID.
 
# Get the annotation ID.
(fanID <- myDB$annotation$ID[myDB$annotation$proteinID == proID &
+
( fanID <- myDB$annotation$ID[myDB$annotation$proteinID == proID &
                             myDB$annotation$featureID == ftrID])
+
                             myDB$annotation$featureID == ftrID] )
  
 
# Get the feature start and end:
 
# Get the feature start and end:
(start <- myDB$annotation$start[fanID])
+
( start <- myDB$annotation$start[fanID] )
(end  <- myDB$annotation$end[fanID])
+
( end  <- myDB$annotation$end[fanID] )
  
 
# Extract the feature from the sequence
 
# Extract the feature from the sequence
Line 179: Line 201:
  
 
targetSeq
 
targetSeq
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 186: Line 208:
  
  
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I think that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may have answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no significant indels to consider, the automated mode would have done just as well. But the strategy we pursue here is also suitable for much more difficult problems. The automated strategy maybe not. More control over the process is a good thing.
+
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I think that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may have answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no significant indels to consider, the automated mode would have done just as well. But the strategy we pursue here is also suitable for significantly more difficult problems. The automated strategy maybe not. More control over the process is a good thing.
  
 
Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.
 
Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.
Line 204: Line 226:
  
 
{{task|1=
 
{{task|1=
# Navigate to the [https://www.rcsb.org/pdb/home/home.do '''PDB'''].
+
* Navigate to the [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome '''BLAST''' query page].
# Click on '''Advanced Search''' to enter the advanced search interface.
+
* Click on '''Search''' &rarr; '''Advanced Search''' to enter the advanced search interface.
# Open the menu to '''Choose a Query Type:'''
+
* Paste your MYSPE APSES domain sequence (the "target sequence") and choose '''Protein Data Bank proteins''' as the database choice, default parameters will work just nicely... - click '''BLAST'''
# Find the '''Sequence features''' section and choose '''Sequence (BLAST...)'''
 
# Copy the <code>targetSequence</code> from the R console and paste it into the '''Sequence''' field, select '''BLAST''' as the search tool, select '''not''' to mask low-complexity regions and '''Submit Query'''. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.
 
  
All hits that are homologs are potentially suitable '''templates''', but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...
+
All hits that are homologs are potentially suitable "templates", but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...
  
 
:*sequence similarity to your target
 
:*sequence similarity to your target
Line 217: Line 237:
 
:*experimental method and quality of the data set
 
:*experimental method and quality of the data set
  
Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.
+
As of September 2020, you should find four reasonable candidate structures from 2 species, three of which are from the same species. Some of the yeast sequences have a longer chain-lengths ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). Depending on MYSPE, your ideal template will be either be 1BM8 or 4UX5. Let's consider both.
  
# There is a menu to create '''Reports:''' - select '''customizable table'''.
+
;Finally: Click on the Accession numbers to navigate to the sequence entry for those '''templates''' and save the FASTA sequences to your project directory. Name one file <code>./myScripts/1BM8_A.fa</code> and the other file <code>./myScripts/4UX5_A.fa</code> (save only chain '''A''' for 4UX5). These are '''template sequence'''.
# Select (at least) the following information items:
 
;Structure Summary
 
* Experimental Method
 
;Sequence
 
* Chain Length
 
;Ligands
 
* Ligand Name
 
;Biological details
 
* Macromolecule Name
 
; refinement Details
 
* Resolution
 
* R Work
 
* R free
 
# click: '''Create report'''.
 
  
Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. As of October 2017, you should find four reasonable candidate structures from 2 species, three of which are from the same species. Some of the yeast sequences have a longer chain-lengths ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). Depending on MYSPE, your ideal template will be either be 1BM8 or 4UX5. Let's consider both.
+
* Then visit the PDB entry pages and learn more about the structures - things like resolution, status of ligands, mutations etc.
 +
** [https://www.rcsb.org/structure/1BM8 '''1BM8''']
 +
** [https://www.rcsb.org/structure/4UX5 '''4UX5''']
  
;Finally: Click on the ID to navigate to the structure page for those '''templates''' and save the FASTA sequences to your project directory. Name one <code>1BM8_A.fa</code> and the other <code>4UX5_A.fa</code> (save only chain '''A''' for 4UX5). These are '''template sequence'''.
 
  
 
}}
 
}}
Line 244: Line 251:
 
{{Vspace}}
 
{{Vspace}}
  
<!--
 
 
===Sequence numbering===
 
===Sequence numbering===
  
It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file <small>(one of the related PDB structures)</small> '''is''' the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with <code>MSNQIY...</code>, but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with  ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.
+
It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file <small>(one of the related PDB structures)</small> '''is''' the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with <code>MSNQIY...</code>, but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with  ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be very careful how to get this.
  
 
Fortunately, the numbering for the residues in the coordinate section of our '''target''' structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence <small>(e.g. by using the bio3D R package)</small>. If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.
 
Fortunately, the numbering for the residues in the coordinate section of our '''target''' structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence <small>(e.g. by using the bio3D R package)</small>. If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.
  
 
+
{{Vspace}}
BELOW IS NOT NECESSARY FOR THE 1BM8 TEMPLATE. ALSO extraction can be done with bio3D
 
 
 
 
 
The homology '''model''' will be based on an alignment of '''target''' and '''template'''. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit  and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for [http://swift.cmbi.ru.nl/servers/html/index.html '''WhatIf'''], a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.
 
 
 
 
 
*Navigate to the '''Administration''' sub-menu of the [http://swift.cmbi.ru.nl/servers/html/index.html WhatIf Web server]. Follow the link to '''Make sequence file from PDB file'''. Enter the PDB-ID of your template into the form field and '''Send''' the request to the server. The server accesses the PDB file and extracts sequence information directly from the <code>ATOM&nbsp;&nbsp;</code> records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this '''implied''' sequence to check if and how it differs from the sequence ...
 
 
 
:*... listed in the <code>SEQRES</code> records of the coordinate file;
 
:*... given in the FASTA sequence for the template, which is provided by the PDB;
 
:*... stored in the protein database of the NCBI.
 
: and record your results.
 
 
 
* Establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.
 
 
 
:(*) <small>These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the ''Sequence Viewer'' extension of VMD.</small>.
 
:<small>Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence.</small>.
 
-->
 
  
 
===The input alignment===
 
===The input alignment===
Line 286: Line 274:
 
Here's how we do this in R:
 
Here's how we do this in R:
  
<source lang = "R">
+
<pre>
 
# Get all MBP1 Sequences
 
# Get all MBP1 Sequences
 
sel <- grep("^MBP1_", myDB$protein$name)
 
sel <- grep("^MBP1_", myDB$protein$name)
Line 297: Line 285:
  
 
# Read the template sequences
 
# Read the template sequences
seq1BM8 <- dbSanitizeSequence(readLines("1BM8_A.fa"))
+
seq1BM8 <- dbSanitizeSequence(readLines("./myScripts/1BM8_A.fa"))
 
names(seq1BM8) <- "1BM8_A"
 
names(seq1BM8) <- "1BM8_A"
seq4UX5 <- dbSanitizeSequence(readLines("4UX5_A.fa"))
+
seq4UX5 <- dbSanitizeSequence(readLines("./myScripts/4UX5_A.fa"))
 
names(seq4UX5) <- "4UX5_A"
 
names(seq4UX5) <- "4UX5_A"
  
Line 305: Line 293:
 
MBP1Set <- c(MBP1Set, seq1BM8, seq4UX5)
 
MBP1Set <- c(MBP1Set, seq1BM8, seq4UX5)
  
# Turn it into an AAStringSet
+
# Turn it into an Biostrings::AAStringSet
(MBP1Set <- AAStringSet(MBP1Set))  # You should have 13 sequences.
+
(MBP1Set <- Biostrings::AAStringSet(MBP1Set))  # You should have 13 sequences.
  
 
# Calculate an msa
 
# Calculate an msa
(MBP1msa <- msaMuscle(MBP1Set))
+
(MBP1msa <- msa::msaMuscle(MBP1Set))
  
 
# Inspect the msa
 
# Inspect the msa
Line 316: Line 304:
  
  
 +
</pre>
  
</source>
+
You need to decide which of the templates you will use. '''Choose either 1BM8 or 4UX5 - depending on which <ul>template</ul> has higher sequence similarity to the <ul>target</ul>.''' Next, extract aligned target and template sequences, while masking gaps that are not needed for the aligned pair.
  
You need to decide which of the templates you will use. '''Choose either 1BM8 or 4UX5 - depending on which template has higher sequence similarity to the target.''' Next, extract aligned target and template sequences, while masking gaps that are not needed for the aligned pair.
+
<pre>
 
 
<source lang="R">
 
  
 
# Write the alignments to file, we will need it later. Depending on which
 
# Write the alignments to file, we will need it later. Depending on which
 
# template you have decided on, execute ...
 
# template you have decided on, execute ...
writeMFA(fetchMSAmotif(MBP1msa, seq1BM8), myCon = "APSES-MBP1.fa") # or ...
+
writeMFA(fetchMSAmotif(MBP1msa, seq1BM8), myCon = "./myScripts/APSES-MBP1.fa") # or ...
writeMFA(fetchMSAmotif(MBP1msa, seq4UX5), myCon = "APSES-MBP1.fa")
+
writeMFA(fetchMSAmotif(MBP1msa, seq4UX5), myCon = "./myScripts/APSES-MBP1.fa")
  
 
# We extract the TARGET and TEMPLATE sequence, and remove any hyphens that
 
# We extract the TARGET and TEMPLATE sequence, and remove any hyphens that
Line 354: Line 341:
 
writeMFA(TTset)  # write output to multi FASTA format
 
writeMFA(TTset)  # write output to multi FASTA format
  
</source>
+
</pre>
  
 
}}
 
}}
Line 361: Line 348:
 
The result should be a two sequence alignment in '''multi-FASTA''' format, that was constructed from a number of supporting sequences and that contains your aligned '''target''' and '''template''' sequence. This is your '''input alignment''' for the homology modeling server. For <code>MBP1_CRYNE</code> aligned to <code>4UX5</code> the result looks like this:
 
The result should be a two sequence alignment in '''multi-FASTA''' format, that was constructed from a number of supporting sequences and that contains your aligned '''target''' and '''template''' sequence. This is your '''input alignment''' for the homology modeling server. For <code>MBP1_CRYNE</code> aligned to <code>4UX5</code> the result looks like this:
  
>MBP1_CRYNE
+
<pre>
MGKKVIASGGDNGPNTIYKATYSGVPVYEMVCR-DVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQ
+
>MBP1_CRYNE
GGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPPAPKHSVAPPSKARRDK
+
MGKKVIASGGDNGPNTIYKATYSGVPVYEMVCR-DVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPPAPKHSVAPPSKARRDK
  
>4UX5_A
+
>4UX5_A
MVKAAAAAASAPTGPGIYSATYSGIPVYEYQFGLKEHVMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQ
+
MVKAAAAAASAPTGPGIYSATYSGIPVYEYQFGLKEHVMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEFSPGPDSPPPAPRH----TSKPKQPK
GGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEFSPGPDSPPPAPRH----TSKPKQPK
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 382: Line 369:
  
 
{{task|1=
 
{{task|1=
*Paste the aligned sequences of the MYSPE target and the template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The MYSPE sequence is your target. The 1BM8 or 4UX5 sequence is the template.
+
*Paste the aligned sequences of the MYSPE target and the template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The MYSPE sequence is your target. The 1BM8 or 4UX5 sequence is the template. Make sure there are no extraneous spaces or special characters in your sequence.
  
 
*Click '''Build Model''' to start the modeling process. This will take about a minute or so.
 
*Click '''Build Model''' to start the modeling process. This will take about a minute or so.
Line 388: Line 375:
 
* The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.
 
* The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.
  
*Mouse over the '''Model 01''' dropdown menu (under the icon of the template structure), and choose the '''PDB file'''. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. '''Save the PDB file in your project directory call it <code>MBP1_MYSPE-APSES.pdb</code>.'''
+
*Mouse over the '''Model 01''' dropdown menu (under the icon of the template structure), and choose the '''PDB file'''. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. '''Save the PDB file in your project directory call it <code>./myScripts/MBP1_MYSPE-APSES.pdb</code>.'''
  
 
* Open the [http://swissmodel.expasy.org/docs/help SwissModel documentation] in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the '''GQME''' and '''QMEAN''' quality scores.
 
* Open the [http://swissmodel.expasy.org/docs/help SwissModel documentation] in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the '''GQME''' and '''QMEAN''' quality scores.
Line 402: Line 389:
  
  
We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the '''interpretation of results''' is often somewhat neglected. Don't be that way. Data does not explain itself. The interpretation of your computational results in a bilological context is the most important part.
+
We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the '''interpretation of results''' is often somewhat neglected. Don't be that way. Data does not explain itself. The interpretation of your computational results in a biological context is the most important part.
  
 
{{Vspace}}
 
{{Vspace}}
Line 426: Line 413:
 
{{task|1=
 
{{task|1=
  
# Explore and execute the following '''R''' script. It assumes that your model is in your is project directory and the file is called <code>MBP1_MYSPE-APSES.pdb</code>.
+
# Explore and execute the following '''R''' script. It assumes that your model is in your project directory and the file is called <code>MBP1_MYSPE-APSES.pdb</code>.
  
<source lang="rsplus">
+
<pre>
  
if (! require(bio3d, quietly=TRUE)) {
+
if (! requireNamespace("bio3d", quietly=TRUE)) {
 
   install.packages("bio3d")
 
   install.packages("bio3d")
  library(bio3d)
 
 
}
 
}
 
# Package information:
 
# Package information:
Line 439: Line 425:
 
#  data(package = "bio3d")    # available datasets
 
#  data(package = "bio3d")    # available datasets
  
PDB_INFILE      <- "MBP1_MYSPE-APSES.pdb"
+
PDB_INFILE      <- "./myScripts/MBP1_MYSPE-APSES.pdb"
PDB_OUTFILE    <- "MBP1_MYSPE-APSESrenum.pdb"
+
PDB_OUTFILE    <- "./myScripts/MBP1_MYSPE-APSESrenum.pdb"
  
  
Line 448: Line 434:
  
 
# == Read the MYSPE pdb file
 
# == Read the MYSPE pdb file
MYSPEmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list
+
 
 +
MYSPEmodel <- bio3d::read.pdb(PDB_INFILE) # read the PDB file into a list
  
 
MYSPEmodel          # examine the information
 
MYSPEmodel          # examine the information
 
MYSPEmodel$atom[1,]  # get information for the first atom
 
MYSPEmodel$atom[1,]  # get information for the first atom
  
# Explore ?read.pdb and study the examples.
+
# Explore ?bio3d::read.pdb and study the examples.
  
 
# == Modify residue numbers for each atom
 
# == Modify residue numbers for each atom
resNum <- as.numeric(MYSPEmodel $atom[,"resno"])
+
resNum <- as.numeric(MYSPEmodel$atom[,"resno"])
 
resNum
 
resNum
 
resNum <- resNum - resNum[1] + iFirst  # add offset
 
resNum <- resNum - resNum[1] + iFirst  # add offset
MYSPEmodel $atom[ , "resno"] <- resNum  # replace old numbers with new
+
MYSPEmodel$atom[ , "resno"] <- resNum  # replace old numbers with new
  
 
# check result
 
# check result
MYSPEmodel $atom[ , "resno"]
+
MYSPEmodel$atom[ , "resno"]
MYSPEmodel $atom[1, ]
+
MYSPEmodel$atom[1, ]
  
 
# == Write output to file
 
# == Write output to file
write.pdb(pdb = MYSPEmodel, file=PDBout)
+
bio3d::write.pdb(pdb = MYSPEmodel, file=PDBout)
  
 
# Done. Open the renumbered PDB file in the RStudio editor
 
# Done. Open the renumbered PDB file in the RStudio editor
 
# and confirm that this has worked.
 
# and confirm that this has worked.
  
</source>
+
</pre>
 
}}
 
}}
  
Line 479: Line 466:
  
 
{{Smallvspace}}
 
{{Smallvspace}}
 
  
 
SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.
 
SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.
  
 
{{task|1=
 
{{task|1=
# Start Chimera and load the '''model''' coordinates that you have just renumbered.
+
* Start ChimeraX and load the '''model''' coordinates that you have just renumbered.
# Select all, hide Ribbons and show Atoms, bonds to view the entire model structure.
+
* set the camera to stereo to be able to examine details of the cluttered core of the protein.
# Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''Apply'''. Note that you can change the way the spectrum is mapped to the values by moving the blue, white and red bars over the histogram with your mouse.
+
* Select all, hide cartoons and show Atoms, bonds to view the entire model structure.
# Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface, i.e. Swiss-Model was more confident in the predicted conformationstes. Why could that be the case?
+
* Enter: <code>color byattribute a:bfactor</code> to get atom-level B-factor coloring.
 +
* Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface, i.e. Swiss-Model was more confident in the predicted conformationstes. Why could that be the case?
 +
 
 
}}
 
}}
  
Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
==Modelling DNA binding==
 
==Modelling DNA binding==
 +
 +
{{Smallvspace}}
  
 
One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
 
One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
Line 503: Line 492:
  
 
===A homologous protein/DNA complex structure===
 
===A homologous protein/DNA complex structure===
 +
 +
{{Smallvspace}}
  
 
{{task|1=
 
{{task|1=
Line 508: Line 499:
 
; The PCG2 / DNA complex
 
; The PCG2 / DNA complex
  
* Open Chimera.
+
* Open ChimeraX.
 
* load the '''<code>4UX5</code>''' structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule.
 
* load the '''<code>4UX5</code>''' structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule.
 
**If your homology model was based on <code>4UX5</code>, Swiss-Model has already made two copies, and their orientation is the same as the template, so no superposition is required.
 
**If your homology model was based on <code>4UX5</code>, Swiss-Model has already made two copies, and their orientation is the same as the template, so no superposition is required.
** If your homology model was based on <code>1BM8</code>: make a second copy of your model. Open the '''Tools''' &rarr; '''General''' &rarr; '''Model Panel''' and use the '''copy/combine''' button to create a copy of your model. Then superimpose one copy on chain A of <code>4UX5</code>, and the other copy on chain B: open a '''MatchMaker''' dialogue window with '''Tools''' &rarr; '''Structure comparison''' &rarr; '''MatchMaker'''.  Choose the radio button two match two specific chains and select <code>4UX5</code> chain A as the '''Reference chain''', and one of your models as the '''Chain to match'''. Click '''Apply'''. Similarly superimpose the other copy of the model on chain B.
+
** If your homology model was based on <code>1BM8</code>: use the '''File''' &rarr; '''Open...''' menu option to load a second copy of the molecule. Then superimpose one copy on chain A of <code>4UX5</code>, and the other copy on chain B: open a '''MatchMaker''' dialogue window with '''Tools''' &rarr; '''Structure Analysis''' &rarr; '''MatchMaker'''.  Choose the radio button to match two specific chains and select <code>4UX5</code> chain A as the '''Reference chain''', and one of your models as the '''Chain to match'''. Click '''Apply'''. Similarly superimpose the other copy of the model on chain B.
  
 
*Color the <code>4UX5</code> protein chains grey.
 
*Color the <code>4UX5</code> protein chains grey.
 
*Color the <code>4UX5</code> nucleic acid chains "by element", hide ribbons, show Atoms/Bonds and set nucleotide objects '''offf'''.
 
*Color the <code>4UX5</code> nucleic acid chains "by element", hide ribbons, show Atoms/Bonds and set nucleotide objects '''offf'''.
*Now color your model '''by conservation score''':
 
**In the Multalign Viewer window choose '''Preferences''' &rarr; '''Headers''', and in the Headers window choose the Headers tab and select '''Conservation style''' &rarr; '''AL2CO'''<ref>{{#pmid:11524371}}</ref>. Click '''OK'''.
 
**In the Multalign Viewer window choose '''Structure''' &rarr; '''Render by Conservation''' to open the "Render/Select by Attribute" Window. Select your Model. Select '''mavConservation''' as the "Attribute" to render. Note that you can move the blue white and red coloured bars to adjust the way the colour scale is applied to the values. Click on the blue, white and red bar in turn and then on the colour swatch to change the colour. Choose a bright orange red for the low value threshold (high diversity), a dark red for the midpoint, and a dark greay for the high conservation values. Click on '''Apply'''. Are all residues that make protein-DNA interactions in the complex conserved between target and template? Are they conserved across the entire family?
 
 
 
* Do the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box"? Do the chains have protein:DNA interfaces with the cognate sequence, or are one (or both) proteins  non-specific complexes? The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.<ref>This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.</ref> Indeed, Liu ''et al.'' (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact '''not''' identical.
 
* Do the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box"? Do the chains have protein:DNA interfaces with the cognate sequence, or are one (or both) proteins  non-specific complexes? The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.<ref>This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.</ref> Indeed, Liu ''et al.'' (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact '''not''' identical.
  
Line 533: Line 520:
 
{{Vspace}}
 
{{Vspace}}
  
== Self-evaluation ==
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
== Notes ==
 
== Notes ==
<!-- included from "./components/BIN-SX-Homology_modelling.components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
 
<references />
 
<references />
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
 
{{Vspace}}
 
 
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
 
 
----
 
  
 
{{Vspace}}
 
{{Vspace}}
  
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 582: Line 534:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-10-30
+
:2020-09-22
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.0
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2 2020 Updates; major rewrites for ChimeraX; BLAST now at NCBI; using ./myScripts directory consistently; no GeSHi ...
 +
*1.1 Change from require() to requireNamespace() and use &lt;package&gt;::&lt;function&gt;() idiom.
 
*1.0 First live version
 
*1.0 First live version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 11:41, 28 September 2020

Homology Modeling

(Homology modeling: alignment, alignment, alignment.)


 


Abstract:

This unit introduces the principles of modelling structures based on the known coordinates of a homologue. The key to sucessful modelling is a carfully done multiple sequence alignment.


Objectives:
This unit will ...

  • ... introduce the principles behind homology modeling of structurs;
  • ... teach how to produce a structural model of the MBP1_MYSPE APSES domain;
  • ... demonstrate how to analyze the model;

Outcomes:
After working through this unit you ...

  • ... can produce a homology model using the Swiss-Model server;
  • ... can work with ChimeraX to analyze its structural details.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    Task:

    • Read:
    Waterhouse et al. (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46:W296-W303. (pmid: 29788355)

    PubMed ] [ DOI ] Homology modelling has matured into an important technique in structural biology, significantly contributing to narrowing the gap between known protein sequences and experimentally determined structures. Fully automated workflows and servers simplify and streamline the homology modelling process, also allowing users without a specific computational expertise to generate reliable protein models and have easy access to modelling results, their visualization and interpretation. Here, we present an update to the SWISS-MODEL server, which pioneered the field of automated modelling 25 years ago and been continuously further developed. Recently, its functionality has been extended to the modelling of homo- and heteromeric complexes. Starting from the amino acid sequences of the interacting proteins, both the stoichiometry and the overall structure of the complex are inferred by homology modelling. Other major improvements include the implementation of a new modelling engine, ProMod3 and the introduction a new local model quality estimation method, QMEANDisCo. SWISS-MODEL is freely available at https://swissmodel.expasy.org.


     


    Introduction

    In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, the experimental evidence (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, several distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

    In this assignment you will construct a molecular model of the APSES domain from the Mbp1 RBM orthologue in MYSPE.

    For the following, please remember the following terminology:

    Target
    The protein that you are planning to model.
    Template
    The protein whose structure you are using as a guide to build the model.
    Model
    The structure that results from the modelling process. It has the Target sequence and is similar to the Template structure.


     

    The basic idea - a Point Mutation

    To illustrate how force fields modify protein structure in principle, let's consider changing the sequence of a single amino acid, based on a structural template and minimize the structure's energy.

    Such minimal changes to structure models can be done directly in ChimeraX. Let us consider the residue A 42 of the 1BM8 structure. It is oriented towards the core of the protein, but as the MSA shows, most other Mbp1 orthologs have a larger amino acid in this position: V, or even I.

    Task:

    • Open 1BM8 in ChimeraX, turn the camera to stereo(camera sbs), use soft lighting (lighting soft), hide the ribbons and show all protein atoms as a stick model.
    • Color the protein white.
    • Open the Sequence Viewer and select A 42. Color it red. Choose Actions → Set pivot. This sets the center of rotation of the scene to A 42 so the residue will not pitch out of the visible scene when you rotate the protein. Study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
    • To emphasize this better, select the protein atoms and calculate an "accessible surface" to better appreciate the packing, i.e. the Van der Waals contacts.
    hide #1
    select #1:4-41,43-102 ;# the protein without residue 42
    surface sel enclose sel ;# its surface without A42
    color radial #1.1 center #1:42 palette "white:black"
    transparency sel 50
    select #1:42 ;# A42 only
    surface sel enclose sel
    color #1.2 red ;# our second surface is submodel 1.2
    • Click on the Graphics tab to open the menu bar with graphics settings, there is an icon called Side view which opens a window with viewer position details. It is quite small when it is placed into the right-hand column, but you can detach it by dragging its menu bar and then increase its size. You can see the camera distance from the scene, and two clipping panels. Drag the clipping planes to visualize a section through the protein. Study the packing, in particular, note that not even a single additional methyl groups of a valine or isoleucine would have space in the structure. Then restore the clipping planes so you can see the whole molecule.
    • Let's create a clear view of the alanine sidechain in context. We display it as a stick model, the rest of the chain as a C-alpha trace, and then select the surrounding sidechains and display those too.
    hide #1 surfaces
    show @ca target ab ;# CA trace
    style stick
    select zone :42 4.5 #1 extend true residues true
    show sel target ab
    select #1 & protein
    hide @H* ;# hide H-atoms
    size sel stickRadius 0.25
    size pseudobondRadius 0.25 ;# the lines connecting the CAs are "pseudobonds"
    • You now have a very clear scene of the alanine residue in red, the surrounding side chains, and the rest of the structure as a C-alpha trace. You also see three water molecules. Spend a bit of time again, to get a sense for the spatial context.
    • Now let's mutate Alanine 42 residue to isoleucine.
    select #1:42
    ui tool show rotamers ;# or using the menu: Tools → Structure Editing → Rotamers
    • Choose ILE as the rotamer type. Click OK, a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are very different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D.
    • I find that the first rotamer is actually not such a bad fit, and that number five is also quite plausible. Regarding the first rotamer the CD atom comes close to the sidechains of I 25 and L 96. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your MSA - it is NOT the case that sequences that have I 42, have a smaller residue in position 25 and/or 96. So let's accept the most frequent ILE rotamer by selecting it in the rotamer window and clicking OK.
    • Done.


    What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes all amino acids to the residues of the target sequence, based on the template structure. Let's now build a homology model for MYSPE Mbp1.


     

    Preparation

    • We need to define our Target sequence;
    • find a suitable structural Template; and
    • build a Model.


    Target sequence

    We have encountered the PDB 1BM8 structure before, the APSES domain of saccharomyces cerevisiae Mbp1. This is a useful template to model the DNA binding domain of your RBM match. You have defined the sequence in the BIN-ALI-Optimal_sequence_alignment unit. Let's retrieve it. Open RStudio and load the project.


     
    # Recreate the database
    source("./myScripts/makeProteinDB.R")
    
    # A: Define your TARGET sequence.
    #      You have defined a feature annotation for the MYSPE APSES domain in
    #      the BIN-ALI-Optimal_sequence_alignment unit's R code. Retrieve its
    #      sequence from the feature annotation to get the TARGET sequence.
    #
    
    ( targetName <- sprintf("MBP1_%s", biCode(MYSPE)) )
    
    # Get the protein IDs.
    ( sel <- which(myDB$protein$name == targetName) )
    ( proID <- myDB$protein$ID[sel] )
    
    # Find the feature ID in the feature table
    ( ftrID <- myDB$feature$ID[myDB$feature$name == "APSES fold"] )
    
    # Get the annotation ID.
    ( fanID <- myDB$annotation$ID[myDB$annotation$proteinID == proID &
                                 myDB$annotation$featureID == ftrID] )
    
    # Get the feature start and end:
    ( start <- myDB$annotation$start[fanID] )
    ( end   <- myDB$annotation$end[fanID] )
    
    # Extract the feature from the sequence
    targetSeq <- substring(myDB$protein$sequence[sel], first = start, last = end)
    
    # Name it
    names(targetSeq) <- targetName
    
    targetSeq
    


     

    Template choice and template sequence

    The SWISS-MODEL server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I think that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may have answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no significant indels to consider, the automated mode would have done just as well. But the strategy we pursue here is also suitable for significantly more difficult problems. The automated strategy maybe not. More control over the process is a good thing.

    Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the template choice principles page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.


    Defining a template means finding a PDB coordinate set that has sufficient sequence similarity to your target that you can build a model based on that template. To find suitable PDB structures, we will perform a BLAST search at the PDB.




    Task:

    • Navigate to the BLAST query page.
    • Click on SearchAdvanced Search to enter the advanced search interface.
    • Paste your MYSPE APSES domain sequence (the "target sequence") and choose Protein Data Bank proteins as the database choice, default parameters will work just nicely... - click BLAST

    All hits that are homologs are potentially suitable "templates", but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...

    • sequence similarity to your target
    • size of expected model (= length of alignment)
    • presence or absence of ligands
    • experimental method and quality of the data set

    As of September 2020, you should find four reasonable candidate structures from 2 species, three of which are from the same species. Some of the yeast sequences have a longer chain-lengths ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the real world, there is no automatic tool to evaluate disorder and its effects on template choice). Depending on MYSPE, your ideal template will be either be 1BM8 or 4UX5. Let's consider both.

    Finally
    Click on the Accession numbers to navigate to the sequence entry for those templates and save the FASTA sequences to your project directory. Name one file ./myScripts/1BM8_A.fa and the other file ./myScripts/4UX5_A.fa (save only chain A for 4UX5). These are template sequence.
    • Then visit the PDB entry pages and learn more about the structures - things like resolution, status of ligands, mutations etc.


     

    Sequence numbering

    It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file (one of the related PDB structures) is the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the ATOM records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with MSNQIY..., but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be very careful how to get this.

    Fortunately, the numbering for the residues in the coordinate section of our target structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence (e.g. by using the bio3D R package). If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.


     

    The input alignment

    The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

    The best possible alignment is constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

    In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions, but in some we do. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the template sequence and the target sequence from your species we fetch the Mbp1 sequences from our database, add the template sequences, and convert them to an AAStringSet.


     

    Task:
    Here's how we do this in R:

    # Get all MBP1 Sequences
    sel <- grep("^MBP1_", myDB$protein$name)
    
    # Extract the sequences
    MBP1Set <- myDB$protein$sequence[sel]
    
    # Name the sequences
    names(MBP1Set) <- myDB$protein$name[sel]
    
    # Read the template sequences
    seq1BM8 <- dbSanitizeSequence(readLines("./myScripts/1BM8_A.fa"))
    names(seq1BM8) <- "1BM8_A"
    seq4UX5 <- dbSanitizeSequence(readLines("./myScripts/4UX5_A.fa"))
    names(seq4UX5) <- "4UX5_A"
    
    # Add the template sequences to the MBP1set
    MBP1Set <- c(MBP1Set, seq1BM8, seq4UX5)
    
    # Turn it into an Biostrings::AAStringSet
    (MBP1Set <- Biostrings::AAStringSet(MBP1Set))   # You should have 13 sequences.
    
    # Calculate an msa
    (MBP1msa <- msa::msaMuscle(MBP1Set))
    
    # Inspect the msa
    writeALN(fetchMSAmotif(MBP1msa, seq1BM8)) # and ...
    writeALN(fetchMSAmotif(MBP1msa, seq4UX5))
    
    
    
    You need to decide which of the templates you will use. Choose either 1BM8 or 4UX5 - depending on which
      template
    has higher sequence similarity to the
      target
    .
    Next, extract aligned target and template sequences, while masking gaps that are not needed for the aligned pair.
    
    # Write the alignments to file, we will need it later. Depending on which
    # template you have decided on, execute ...
    writeMFA(fetchMSAmotif(MBP1msa, seq1BM8), myCon = "./myScripts/APSES-MBP1.fa") # or ...
    writeMFA(fetchMSAmotif(MBP1msa, seq4UX5), myCon = "./myScripts/APSES-MBP1.fa")
    
    # We extract the TARGET and TEMPLATE sequence, and remove any hyphens that
    # they both share. Remember: the TARGET is the MYSPE sequence in this alignment,
    # the TEMPLATE is either 1BM8_A or 4UX5_A. You need to edit this code so it
    # identifies the correct sequences for your situation:
    
    myT <- seq1BM8 # either ...
    myT <- seq4UX5 # ... or .
    
    targetSeq   <- as.character(fetchMSAmotif(MBP1msa, myT)[targetName])
    templateSeq <- as.character(fetchMSAmotif(MBP1msa, myT)[names(myT)])
    
    # Drop positions in which both sequences have hyphens.
    targetSeq   <- unlist(strsplit(targetSeq,   ""))
    templateSeq <- unlist(strsplit(templateSeq, ""))
    gapMask <- ! ((targetSeq == "-") & (templateSeq == "-"))
    targetSeq   <- paste0(targetSeq[gapMask], collapse = "")
    templateSeq <- paste0(templateSeq[gapMask], collapse = "")
    
    # Assemble sequences into a set
    TTset <- character()
    TTset[1] <- targetSeq
    TTset[2] <- templateSeq
    names(TTset) <- c(targetName, names(myT))
    
    writeMFA(TTset)  # write output to multi FASTA format
    
    


    The result should be a two sequence alignment in multi-FASTA format, that was constructed from a number of supporting sequences and that contains your aligned target and template sequence. This is your input alignment for the homology modeling server. For MBP1_CRYNE aligned to 4UX5 the result looks like this:

    >MBP1_CRYNE
    MGKKVIASGGDNGPNTIYKATYSGVPVYEMVCR-DVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPPAPKHSVAPPSKARRDK
    
    >4UX5_A
    MVKAAAAAASAPTGPGIYSATYSGIPVYEYQFGLKEHVMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEFSPGPDSPPPAPRH----TSKPKQPK
    


     

    Homology model

    The alignment defines the residue by residue relationship between target and template sequence. All we need to do now is to change every residue of the template to the target sequence - that's what the homology modelling server will do.


    SwissModel

    Access the Swissmodel server at https://swissmodel.expasy.org and click on the Start Modelling button. Under the Supported Inputs, choose Target-Template Alignment.

    Task:

    • Paste the aligned sequences of the MYSPE target and the template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The MYSPE sequence is your target. The 1BM8 or 4UX5 sequence is the template. Make sure there are no extraneous spaces or special characters in your sequence.
    • Click Build Model to start the modeling process. This will take about a minute or so.
    • The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.
    • Mouse over the Model 01 dropdown menu (under the icon of the template structure), and choose the PDB file. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. Save the PDB file in your project directory call it ./myScripts/MBP1_MYSPE-APSES.pdb.
    • Open the SwissModel documentation in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the GQME and QMEAN quality scores.
    • Also save:
      • The output page as pdf (for reference)
      • The modeling report (as pdf)


    Model interpretation

    We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the interpretation of results is often somewhat neglected. Don't be that way. Data does not explain itself. The interpretation of your computational results in a biological context is the most important part.


     


    The PDB file

    Task:
    Open your model coordinates PDB file in RStudio (which is an excellent plain-text editor) and consider the following questions:

    • What is the residue number of the first residue in the model? What should it be, based on the alignment? If you read about a sequence number such as "residue 45" in a manuscript, which residues of your model correspond to that number?

    That's not easy to tell. But it should be.


    Renumbering the model

    As you can see from the coordinate file, SwissModel numbers the first residue "1" in the 1BM8-derived structure, and 14 in the 4UX5 structure: it does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers and thus interpret our model with reference to sequence numbers we find in the manuscript describing the template structure. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately we can do this with bio3d.

    Task:

    1. Explore and execute the following R script. It assumes that your model is in your project directory and the file is called MBP1_MYSPE-APSES.pdb.
    
    if (! requireNamespace("bio3d", quietly=TRUE)) {
      install.packages("bio3d")
    }
    # Package information:
    #  library(help = bio3d)       # basic information
    #  browseVignettes("bio3d")    # available vignettes
    #  data(package = "bio3d")     # available datasets
    
    PDB_INFILE      <- "./myScripts/MBP1_MYSPE-APSES.pdb"
    PDB_OUTFILE     <- "./myScripts/MBP1_MYSPE-APSESrenum.pdb"
    
    
    iFirst <-  4  # residue number for the first residue if your template was 1BM8
    iFirst <- 14  # residue number for the first residue if your template was 4UX5
    
    
    # == Read the MYSPE pdb file
    
    MYSPEmodel <- bio3d::read.pdb(PDB_INFILE) # read the PDB file into a list
    
    MYSPEmodel           # examine the information
    MYSPEmodel$atom[1,]  # get information for the first atom
    
    # Explore ?bio3d::read.pdb and study the examples.
    
    # == Modify residue numbers for each atom
    resNum <- as.numeric(MYSPEmodel$atom[,"resno"])
    resNum
    resNum <- resNum - resNum[1] + iFirst  # add offset
    MYSPEmodel$atom[ , "resno"] <- resNum   # replace old numbers with new
    
    # check result
    MYSPEmodel$atom[ , "resno"]
    MYSPEmodel$atom[1, ]
    
    # == Write output to file
    bio3d::write.pdb(pdb = MYSPEmodel, file=PDBout)
    
    # Done. Open the renumbered PDB file in the RStudio editor
    # and confirm that this has worked.
    
    


     

    First visualization - colouring the model by energy

     

    SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.

    Task:

    • Start ChimeraX and load the model coordinates that you have just renumbered.
    • set the camera to stereo to be able to examine details of the cluttered core of the protein.
    • Select all, hide cartoons and show Atoms, bonds to view the entire model structure.
    • Enter: color byattribute a:bfactor to get atom-level B-factor coloring.
    • Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface, i.e. Swiss-Model was more confident in the predicted conformationstes. Why could that be the case?


     

    Modelling DNA binding

     

    One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.

    Since there is currently no software available that would reliably model such a complex from first principles[1], we will base a model of a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. As a result of the PDB BLAST search we found 4UX5, from the Magnaporthe oryzae Mbp1 orhologue PCG2[2]: this is a protein-DNA complex structure.


     

    A homologous protein/DNA complex structure

     

    Task:

    The PCG2 / DNA complex
    • Open ChimeraX.
    • load the 4UX5 structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule.
      • If your homology model was based on 4UX5, Swiss-Model has already made two copies, and their orientation is the same as the template, so no superposition is required.
      • If your homology model was based on 1BM8: use the FileOpen... menu option to load a second copy of the molecule. Then superimpose one copy on chain A of 4UX5, and the other copy on chain B: open a MatchMaker dialogue window with ToolsStructure AnalysisMatchMaker. Choose the radio button to match two specific chains and select 4UX5 chain A as the Reference chain, and one of your models as the Chain to match. Click Apply. Similarly superimpose the other copy of the model on chain B.
    • Color the 4UX5 protein chains grey.
    • Color the 4UX5 nucleic acid chains "by element", hide ribbons, show Atoms/Bonds and set nucleotide objects offf.
    • Do the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box"? Do the chains have protein:DNA interfaces with the cognate sequence, or are one (or both) proteins non-specific complexes? The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.[3] Indeed, Liu et al. (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact not identical.
    • Select one of the residues of that loop in chain A by <control>-clicking on it and use ActionSet pivot to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.
    • Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think?


     

    In summary: superimposing our homology model with a protein:DNA complex has allowed us to consider how our target sequence might perform its function. This is supported by considering variations in structure between chain A and B of the protein DNA complex that may point to different binding modes, and it is further supported by being able to map structural conservation onto our model, to understand which residues play a structural or functional role that is shared within the entire family.


     

    Notes

    1. Rosetta may get the structure approximately right, Autodock may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct.
    2. Liu et al. (2015) Structural basis of DNA recognition by PCG2 reveals a novel DNA binding mode for winged helix-turn-helix domains. Nucleic Acids Res 43:1231-40. (pmid: 25550425)

      PubMed ] [ DOI ] The MBP1 family proteins are the DNA binding subunits of MBF cell-cycle transcription factor complexes and contain an N terminal winged helix-turn-helix (wHTH) DNA binding domain (DBD). Although the DNA binding mechanism of MBP1 from Saccharomyces cerevisiae has been extensively studied, the structural framework and the DNA binding mode of other MBP1 family proteins remains to be disclosed. Here, we determined the crystal structure of the DBD of PCG2, the Magnaporthe oryzae orthologue of MBP1, bound to MCB-DNA. The structure revealed that the wing, the 20-loop, helix A and helix B in PCG2-DBD are important elements for DNA binding. Unlike previously characterized wHTH proteins, PCG2-DBD utilizes the wing and helix-B to bind the minor groove and the major groove of the MCB-DNA whilst the 20-loop and helix A interact non-specifically with DNA. Notably, two glutamines Q89 and Q82 within the wing were found to recognize the MCB core CGCG sequence through making hydrogen bond interactions. Further in vitro assays confirmed essential roles of Q89 and Q82 in the DNA binding. These data together indicate that the MBP1 homologue PCG2 employs an unusual mode of binding to target DNA and demonstrate the versatility of wHTH domains.

    3. This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-22

    Version:

    1.2

    Version history:

    • 1.2 2020 Updates; major rewrites for ChimeraX; BLAST now at NCBI; using ./myScripts directory consistently; no GeSHi ...
    • 1.1 Change from require() to requireNamespace() and use <package>::<function>() idiom.
    • 1.0 First live version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.