Difference between revisions of "BIN-SX-Homology modelling"

Revision as of 13:49, 31 October 2017

Homology Modeling

Keywords: Homology modeling: alignment, alignment, alignment.

Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.

Abstract

This unit introduces the principles of modelling structures based on the known coordinates of a homologue. The key to sucessful modelling is a carfully done multiple sequence alignment.

This unit ...

Prerequisites

You need to complete the following units before beginning this one:

Objectives

This unit will ...

... introduce the principles behind homology modeling of structurs;
... teach how to produce a structural model of the MBP1_MYSPE APSES domain;
... demonstrate how to analyze the model;

Outcomes

After working through this unit you ...

... can produce a homology model using the Swiss-Model server;
... can work with Chimera to analyze its structural details.

Deliverables

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation

Evaluation: NA

This unit is not evaluated for course marks.

[ PubMed ] [ DOI ] Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/.

Introduction

In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, the experimental evidence (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, several distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

In this assignment you will construct a molecular model of the APSES domain from the Mbp1 RBM orthologue in MYSPE.

For the following, please remember the following terminology:

Target: The protein that you are planning to model.
Template: The protein whose structure you are using as a guide to build the model.
Model: The structure that results from the modelling process. It has the Target sequence and is similar to the Template structure.

The basic idea - a Point Mutation

To illustrate how force fields modify protein structure in principle, let's consider changing the sequence of a single amino acid, based on a structural template and minimize the structure's energy.

Such minimal changes to structure models can be done directly in Chimera. Let us consider the residue A 42 of the 1BM8 structure. It is oriented towards the core of the protein, but as the MSA shows, most other Mbp1 orthologs have a larger amino acid in this position: V, or even I.

Task:

Open 1BM8 in Chimera, hide the ribbons and show all protein atoms as a stick model.
Color the protein white.
Open the sequence window and select A 42. Color it red. Choose Actions → Set pivot. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
To emphasize this better, select the protein atoms and display them as a sphere model to better appreciate the packing, i.e. the Van der Waals contacts. Use the Favorites → Side view panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
Lets simplify the view: choose Actions → Atoms/Bonds → show and Actions → Atoms/Bonds → backbone only → chain trace. Then select A 42 again in the sequence window and choose Actions → Atoms/Bonds → show.
Add the surrounding residues: choose Select → Zone.... In the window, see that the box is checked that selects all atoms at a distance of less then 5Å to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click OK and choose Actions → Atoms/Bonds → show. You now have a very clear scene of the alanine residue in red, the surrounding side chains, and the rest of the structure as a C-alpha trace. You also see three water molecules. Spend a bit of time again, to get a sense for the spatial context^[1].
Select A 42 again: left-click (control click) on any atom of the alanine to select the atom, then up-arrow to select the entire residue. Now let's mutate this residue to isoleucine.
Choose Tools → Structure Editing → Rotamers and select ILE as the rotamer type. Click OK, a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are very different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D.
I find that the first rotamer is actually not such a bad fit. The CD atom comes close to the sidechains of I 25 and L 96. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your MSA - it is NOT the case that sequences that have I 42, have a smaller residue in position 25 and/or 96. So let's accept the most frequent ILE rotamer by selecting it in the rotamer window and clicking OK (while existing side chain(s): replace is selected).
Done.

If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group here. I would also encourage you to go over Part 2 of the video tutorial that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.

Incidentally: What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes all amino acids to the residues of the target sequence, based on the template structure. Let's now build a homology model for MYSPE Mbp1.

Preparation

We need to define our Target sequence;
find a suitable structural Template; and
build a Model.

Target sequence

We have encountered the PDB 1BM8 structure before, the APSES domain of saccharomyces cerevisiae Mbp1. This is a useful template to model the DNA binding domain of your RBM match. You have defined the sequence in the BIN-ALI-Optimal_sequence_alignment unit. Let's retrieve it. Open RStudio and load the project.

library(msa)

# Recreate the database
source("makeProteinDB.R")

# A: Define your TARGET sequence.
#      You have defined a feature annotation for the MYSPE APSES domain in
#      the BIN-ALI-Optimal_sequence_alignment unit's R code. Retrieve it's
#      sequence from the feature annotation to get the TARGET sequence.
#

(targetName <- sprintf("MBP1_%s", biCode(MYSPE)))

# Get the protein IDs.
(sel <- which(myDB$protein$name == targetName))
(proID <- myDB$protein$ID[sel])

# Find the feature ID in the feature table
(ftrID <- myDB$feature$ID[myDB$feature$name == "APSES fold"])

# Get the annotation ID.
(fanID <- myDB$annotation$ID[myDB$annotation$proteinID == proID &
                             myDB$annotation$featureID == ftrID])

# Get the feature start and end:
(start <- myDB$annotation$start[fanID])
(end   <- myDB$annotation$end[fanID])

# Extract the feature from the sequence
targetSeq <- substring(myDB$protein$sequence[sel], first = start, last = end)

# Name it
names(targetSeq) <- targetName

targetSeq

Template choice and template sequence

The SWISS-MODEL server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I think that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may have answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no significant indels to consider, the automated mode would have done just as well. But the strategy we pursue here is also suitable for much more difficult problems. The automated strategy maybe not. More control over the process is a good thing.

Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the template choice principles page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.

Defining a template means finding a PDB coordinate set that has sufficient sequence similarity to your target that you can build a model based on that template. To find suitable PDB structures, we will perform a BLAST search at the PDB.

Task:

Navigate to the PDB.
Click on Advanced Search to enter the advanced search interface.
Open the menu to Choose a Query Type:
Find the Sequence features section and choose Sequence (BLAST...)
Copy the targetSequence from the R console and paste it into the Sequence field, select BLAST as the search tool, select not to mask low-complexity regions and Submit Query. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.

All hits that are homologs are potentially suitable templates, but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...

sequence similarity to your target
size of expected model (= length of alignment)
presence or absence of ligands
experimental method and quality of the data set

Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.

There is a menu to create Reports: - select customizable table.
Select (at least) the following information items:

Structure Summary

Experimental Method

Sequence

Chain Length

Ligands

Ligand Name

Biological details

Macromolecule Name

refinement Details

Resolution
R Work
R free

click: Create report.

Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. As of October 2017, you should find four reasonable candidate structures from 2 species, three of which are from the same species. Some of the yeast sequences have a longer chain-lengths ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the real world, there is no automatic tool to evaluate disorder and its effects on template choice). Depending on MYSPE, your ideal template will be either be 1BM8 or 4UX5. Let's consider both.

Finally: Click on the ID to navigate to the structure page for those templates and save the FASTA sequences to your project directory. Name one 1BM8_A.fa and the other 4UX5_A.fa (save only chain A for 4UX5). These are template sequence.

The input alignment

The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

The best possible alignment is constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions, but in some we do. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the template sequence and the target sequence from your species we fetch the Mbp1 sequences from our database, add the template sequences, and convert them to an AAStringSet.

Task:
Here's how we do this in R:

# Get all MBP1 Sequences
sel <- grep("^MBP1_", myDB$protein$name)

# Extract the sequences
MBP1Set <- myDB$protein$sequence[sel]

# Name the sequences
names(MBP1Set) <- myDB$protein$name[sel]

# Read the template sequences
seq1BM8 <- dbSanitizeSequence(readLines("1BM8_A.fa"))
names(seq1BM8) <- "1BM8_A"
seq4UX5 <- dbSanitizeSequence(readLines("4UX5_A.fa"))
names(seq4UX5) <- "4UX5_A"

# Add the template sequences to the MBP1set
MBP1Set <- c(MBP1Set, seq1BM8, seq4UX5)

# Turn it into an AAStringSet
(MBP1Set <- AAStringSet(MBP1Set))   # You should have 13 sequences.

# Calculate an msa
(MBP1msa <- msaMuscle(MBP1Set))

# Inspect the msa
writeALN(fetchMSAmotif(MBP1msa, seq1BM8)) # and ...
writeALN(fetchMSAmotif(MBP1msa, seq4UX5))

You need to decide which of the templates you will use. Choose either 1BM8 or 4UX5 - depending on which template has higher sequence similarity to the target. Next, extract aligned target and template sequences, while masking gaps that are not needed for the aligned pair.

# Write the alignments to file, we will need it later. Depending on which
# template you have decided on, execute ...
writeMFA(fetchMSAmotif(MBP1msa, seq1BM8), myCon = "APSES-MBP1.fa") # or ...
writeMFA(fetchMSAmotif(MBP1msa, seq4UX5), myCon = "APSES-MBP1.fa")

# We extract the TARGET and TEMPLATE sequence, and remove any hyphens that
# they both share. Remember: the TARGET is the MYSPE sequence in this alignment,
# the TEMPLATE is either 1BM8_A or 4UX5_A. You need to edit this code so it
# identifies the correct sequences for your situation:

myT <- seq1BM8 # either ...
myT <- seq4UX5 # ... or .

targetSeq   <- as.character(fetchMSAmotif(MBP1msa, myT)[targetName])
templateSeq <- as.character(fetchMSAmotif(MBP1msa, myT)[names(myT)])

# Drop positions in which both sequences have hyphens.
targetSeq   <- unlist(strsplit(targetSeq,   ""))
templateSeq <- unlist(strsplit(templateSeq, ""))
gapMask <- ! ((targetSeq == "-") & (templateSeq == "-"))
targetSeq   <- paste0(targetSeq[gapMask], collapse = "")
templateSeq <- paste0(templateSeq[gapMask], collapse = "")

# Assemble sequences into a set
TTset <- character()
TTset[1] <- targetSeq
TTset[2] <- templateSeq
names(TTset) <- c(targetName, names(myT))

writeMFA(TTset)  # write output to multi FASTA format

The result should be a two sequence alignment in multi-FASTA format, that was constructed from a number of supporting sequences and that contains your aligned target and template sequence. This is your input alignment for the homology modeling server. For MBP1_CRYNE aligned to 4UX5 the result looks like this:

>MBP1_CRYNE
MGKKVIASGGDNGPNTIYKATYSGVPVYEMVCR-DVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQ
GGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPPAPKHSVAPPSKARRDK

>4UX5_A
MVKAAAAAASAPTGPGIYSATYSGIPVYEYQFGLKEHVMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQ
GGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEFSPGPDSPPPAPRH----TSKPKQPK

Homology model

The alignment defines the residue by residue relationship between target and template sequence. All we need to do now is to change every residue of the template to the target sequence - that's what the homology modelling server will do.

SwissModel

Access the Swissmodel server at https://swissmodel.expasy.org and click on the Start Modelling button. Under the Supported Inputs, choose Target-Template Alignment.

Task:

Paste the aligned sequences of the MYSPE target and the template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The MYSPE sequence is your target. The 1BM8 or 4UX5 sequence is the template.

Click Build Model to start the modeling process. This will take about a minute or so.

The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.

Mouse over the Model 01 dropdown menu (under the icon of the template structure), and choose the PDB file. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. Save the PDB file in your project directory call it MBP1_MYSPE-APSES.pdb.

Open the SwissModel documentation in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the GQME and QMEAN quality scores.

Also save:

- The output page as pdf (for reference)
- The modeling report (as pdf)

Model interpretation

We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the interpretation of results is often somewhat neglected. Don't be that way. Data does not explain itself. The interpretation of your computational results is the most important part. The integrator

The PDB file

Task:
Open your model coordinates PDB file in RStudio (which is an excellent plain-text editor) and consider the following questions:

What is the residue number of the first residue in the model? What should it be, based on the alignment? If you read about a sequence number such as "residue 45" in a manuscript, which residues of your model correspond to that number?

That's not easy to tell. But it should be.

Renumbering the model

As you can see from the coordinate file, SwissModel numbers the first residue "1" in the 1BM8-derived structure, and 14 in the 4UX5 structure: it does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers and thus interpret our model with reference to sequence numbers we find in the manuscript describing the template structure. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately we can do this with bio3d.

Task:

Explore and execute the following R script. It assumes that your model is in your is project directory and the file is called MBP1_MYSPE-APSES.pdb.

if (! require(bio3d, quietly=TRUE)) {
  install.packages("bio3d")
  library(bio3d)
}
# Package information:
#  library(help = bio3d)       # basic information
#  browseVignettes("bio3d")    # available vignettes
#  data(package = "bio3d")     # available datasets

PDB_INFILE      <- "MBP1_MYSPE-APSES.pdb"
PDB_OUTFILE     <- "MBP1_MYSPE-APSESrenum.pdb"


iFirst <-  4  # residue number for the first residue if your template was 1BM8
iFirst <- 14  # residue number for the first residue if your template was 4UX5


# == Read the MYSPE pdb file
MYSPEmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list

MYSPEmodel           # examine the information
MYSPEmodel$atom[1,]  # get information for the first atom

# Explore ?read.pdb and study the examples.

# == Modify residue numbers for each atom
resNum <- as.numeric(MYSPEmodel $atom[,"resno"])
resNum
resNum <- resNum - resNum[1] + iFirst  # add offset
MYSPEmodel $atom[ , "resno"] <- resNum   # replace old numbers with new

# check result
MYSPEmodel $atom[ , "resno"]
MYSPEmodel $atom[1, ]

# == Write output to file
write.pdb(pdb = MYSPEmodel, file=PDBout)

# Done. Open the renumbered PDB file in the RStudio editor
# and confirm that this has worked.

First visualization - colouring the model by energy

SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.

Task:

Start Chimera and load the model coordinates that you have just renumbered.
Select all, hide Ribbons and show Atoms, bonds to view the entire model structure.
Choose Tools → Depiction → Render by attribute and select attributes of atoms, Attribute: bfactor, check color atoms and click OK.
Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface. Why could that be the case?

Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...

Modelling DNA binding

One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.

Since there is currently no software available that would reliably model such a complex from first principles^[2], we will base a model of a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. As a result of the PDB BLAST search we found 4UX5, from the Magnaporthe oryzae Mbp1 orhologue PCG2^[3]: this is a protein-DNA complex structure.

A homologous protein/DNA complex structure

Task:

The PCG2 / DNA complex

Open Chimera.
load the 4UX5 structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule.
- If your homology model was based on 4UX5, Swiss-Model has already made two copies, and their orientation is the same as the template, so no superposition is required.
- If your homology model was based on 1BM8: make a second copy of your model. Open the Tools → General → Model Panel and use the copy/combine button to create a copy of your model. Then superimpose one copy on chain A of 4UX5, and the other copy on chain B: open a MatchMaker dialogue window with Tools → Structure comparison → MatchMaker. Choose the radio button two match two specific chains and select 4UX5 chain A as the Reference chain, and one of your models as the Chain to match. Click Apply. Similarly superimpose the other copy of the model on chain B.

Color the 4UX5 protein chains grey.
Color the 4UX5 nucleic acid chains "by element", hide ribbons, show Atoms/Bonds and set nucleotide objects offf.
Now color your model by conservation score:
- In the Multalign Viewer window choose Preferences → Headers, and in the Headers window choose the Headers tab and select Conservation style → AL2CO^[4]. Click OK.
- In the Multalign Viewer window choose Structure → Render by Conservation to open the "Render/Select by Attribute" Window. Select your Model. Select mavConservation as the "Attribute" to render. Note that you can move the blue white and red coloured bars to adjust the way the colour scale is applied to the values. Click on the blue, white and red bar in turn and then on the colour swatch to change the colour. Choose a bright orange red for the low value threshold (high diversity), a dark red for the midpoint, and a dark greay for the high conservation values. Click on Apply. Are all residues that make protein-DNA interactions in the complex conserved between target and template? Are they conserved across the entire family?

Do the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box"? Do the chains have protein:DNA interfaces with the cognate sequence, or are one (or both) proteins non-specific complexes? The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.^[5] Indeed, Liu et al. (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact not identical.

Select one of the residues of that loop in chain A by <control>-clicking on it and use Action → Set pivot to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.

Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think?

In summary: superimposing our homology model with a protein:DNA complex has allowed us to consider how our target sequence might perform its function. This is supported by considering variations in structure between chain A and B of the protein DNA complex that may point to different binding modes, and it is further supported by being able to map structural conservation onto our model, to understand which residues play a structural or functional role that is shared within the entire family.

Notes

↑ Chimera uses a default distance to screen that is too close and that exaggerates the depth of the scene to a degree that it is difficult to fuse the stereo pairs. Choose Tools → Viewing controls → Camera and set the distance to screen to 50 cm. This will make stereo viewing easier and will also give a better match between distance estimates in all three dimensions.
↑ Rosetta may get the structure approximately right, Autodock may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct.

↑

Liu et al. (2015) Structural basis of DNA recognition by PCG2 reveals a novel DNA binding mode for winged helix-turn-helix domains. Nucleic Acids Res 43:1231-40. (pmid: 25550425)

[ PubMed ] [ DOI ] The MBP1 family proteins are the DNA binding subunits of MBF cell-cycle transcription factor complexes and contain an N terminal winged helix-turn-helix (wHTH) DNA binding domain (DBD). Although the DNA binding mechanism of MBP1 from Saccharomyces cerevisiae has been extensively studied, the structural framework and the DNA binding mode of other MBP1 family proteins remains to be disclosed. Here, we determined the crystal structure of the DBD of PCG2, the Magnaporthe oryzae orthologue of MBP1, bound to MCB-DNA. The structure revealed that the wing, the 20-loop, helix A and helix B in PCG2-DBD are important elements for DNA binding. Unlike previously characterized wHTH proteins, PCG2-DBD utilizes the wing and helix-B to bind the minor groove and the major groove of the MCB-DNA whilst the 20-loop and helix A interact non-specifically with DNA. Notably, two glutamines Q89 and Q82 within the wing were found to recognize the MCB core CGCG sequence through making hydrogen bond interactions. Further in vitro assays confirmed essential roles of Q89 and Q82 in the DNA binding. These data together indicate that the MBP1 homologue PCG2 employs an unusual mode of binding to target DNA and demonstrate the versatility of wHTH domains.

↑

Pei & Grishin (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 17:700-12. (pmid: 11524371)

[ PubMed ] [ DOI ] MOTIVATION: Amino acid sequence alignments are widely used in the analysis of protein structure, function and evolutionary relationships. Proteins within a superfamily usually share the same fold and possess related functions. These structural and functional constraints are reflected in the alignment conservation patterns. Positions of functional and/or structural importance tend to be more conserved. Conserved positions are usually clustered in distinct motifs surrounded by sequence segments of low conservation. Poorly conserved regions might also arise from the imperfections in multiple alignment algorithms and thus indicate possible alignment errors. Quantification of conservation by attributing a conservation index to each aligned position makes motif detection more convenient. Mapping these conservation indices onto a protein spatial structure helps to visualize spatial conservation features of the molecule and to predict functionally and/or structurally important sites. Analysis of conservation indices could be a useful tool in detection of potentially misaligned regions and will aid in improvement of multiple alignments. RESULTS: We developed a program to calculate a conservation index at each position in a multiple sequence alignment using several methods. Namely, amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. We utilize both unweighted frequencies and frequencies weighted using two different strategies. Three conceptually different approaches (entropy-based, variance-based and matrix score-based) are implemented in the algorithm to define the conservation index. Calculating conservation indices for 35522 positions in 284 alignments from SMART database we demonstrate that different methods result in highly correlated (correlation coefficient more than 0.85) conservation indices. Conservation indices show statistically significant correlation between sequentially adjacent positions i and i + j, where j < 13, and averaging of the indices over the window of three positions is optimal for motif detection. Positions with gaps display substantially lower conservation properties. We compare conservation properties of the SMART alignments or FSSP structural alignments to those of the ClustalW alignments. The results suggest that conservation indices should be a valuable tool of alignment quality assessment and might be used as an objective function for refinement of multiple alignments. AVAILABILITY: The C code of the AL2CO program and its pre-compiled versions for several platforms as well as the details of the analysis are freely available at ftp://iole.swmed.edu/pub/al2co/.

↑ This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.

Self-evaluation

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-10-30

Version:

1.0

Version history:

1.0 First live version
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] Chimera uses a default distance to screen that is too close and that exaggerates the depth of the scene to a degree that it is difficult to fuse the stereo pairs. Choose Tools → Viewing controls → Camera and set the distance to screen to 50 cm. This will make stereo viewing easier and will also give a better match between distance estimates in all three dimensions.

[2] Rosetta may get the structure approximately right, Autodock may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct.

[3] 
Liu et al. (2015) Structural basis of DNA recognition by PCG2 reveals a novel DNA binding mode for winged helix-turn-helix domains. Nucleic Acids Res 43:1231-40. (pmid: 25550425)

[ PubMed ] [ DOI ] The MBP1 family proteins are the DNA binding subunits of MBF cell-cycle transcription factor complexes and contain an N terminal winged helix-turn-helix (wHTH) DNA binding domain (DBD). Although the DNA binding mechanism of MBP1 from Saccharomyces cerevisiae has been extensively studied, the structural framework and the DNA binding mode of other MBP1 family proteins remains to be disclosed. Here, we determined the crystal structure of the DBD of PCG2, the Magnaporthe oryzae orthologue of MBP1, bound to MCB-DNA. The structure revealed that the wing, the 20-loop, helix A and helix B in PCG2-DBD are important elements for DNA binding. Unlike previously characterized wHTH proteins, PCG2-DBD utilizes the wing and helix-B to bind the minor groove and the major groove of the MCB-DNA whilst the 20-loop and helix A interact non-specifically with DNA. Notably, two glutamines Q89 and Q82 within the wing were found to recognize the MCB core CGCG sequence through making hydrogen bond interactions. Further in vitro assays confirmed essential roles of Q89 and Q82 in the DNA binding. These data together indicate that the MBP1 homologue PCG2 employs an unusual mode of binding to target DNA and demonstrate the versatility of wHTH domains.

[4] 
Pei & Grishin (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 17:700-12. (pmid: 11524371)

[ PubMed ] [ DOI ] MOTIVATION: Amino acid sequence alignments are widely used in the analysis of protein structure, function and evolutionary relationships. Proteins within a superfamily usually share the same fold and possess related functions. These structural and functional constraints are reflected in the alignment conservation patterns. Positions of functional and/or structural importance tend to be more conserved. Conserved positions are usually clustered in distinct motifs surrounded by sequence segments of low conservation. Poorly conserved regions might also arise from the imperfections in multiple alignment algorithms and thus indicate possible alignment errors. Quantification of conservation by attributing a conservation index to each aligned position makes motif detection more convenient. Mapping these conservation indices onto a protein spatial structure helps to visualize spatial conservation features of the molecule and to predict functionally and/or structurally important sites. Analysis of conservation indices could be a useful tool in detection of potentially misaligned regions and will aid in improvement of multiple alignments. RESULTS: We developed a program to calculate a conservation index at each position in a multiple sequence alignment using several methods. Namely, amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. We utilize both unweighted frequencies and frequencies weighted using two different strategies. Three conceptually different approaches (entropy-based, variance-based and matrix score-based) are implemented in the algorithm to define the conservation index. Calculating conservation indices for 35522 positions in 284 alignments from SMART database we demonstrate that different methods result in highly correlated (correlation coefficient more than 0.85) conservation indices. Conservation indices show statistically significant correlation between sequentially adjacent positions i and i + j, where j < 13, and averaging of the indices over the window of three positions is optimal for motif detection. Positions with gaps display substantially lower conservation properties. We compare conservation properties of the SMART alignments or FSSP structural alignments to those of the ClustalW alignments. The results suggest that conservation indices should be a valuable tool of alignment quality assessment and might be used as an objective function for refinement of multiple alignments. AVAILABILITY: The C code of the AL2CO program and its pre-compiled versions for several platforms as well as the details of the analysis are freely available at ftp://iole.swmed.edu/pub/al2co/.

[5] This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.

[1]

[2]

[3]

[4]

[5]

Difference between revisions of "BIN-SX-Homology modelling"

Revision as of 13:49, 31 October 2017

Contents

Abstract

This unit ...

Prerequisites

Objectives

Outcomes

Deliverables

Evaluation

Contents

Introduction

The basic idea - a Point Mutation

Preparation

Target sequence

Template choice and template sequence

The input alignment

Homology model

SwissModel

Model interpretation

The PDB file

Renumbering the model

First visualization - colouring the model by energy

Modelling DNA binding

A homologous protein/DNA complex structure

Further reading, links and resources

Notes

Self-evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 51: / Line 51: @@
 <!-- included from "../components/BIN-SX-Homology_modeling.components.wtxt", section: "objectives" -->
 This unit will ...
-* ... introduce ;
+* ... introduce the principles behind homology modeling of structurs;
-* ... demonstrate ;
+* ... teach how to produce a structural model of the MBP1_MYSPE APSES domain;
-* ... teach ;
+* ... demonstrate how to analyze the model;
 {{Vspace}}
@@ Line 61: / Line 61: @@
 <!-- included from "../components/BIN-SX-Homology_modeling.components.wtxt", section: "outcomes" -->
 After working through this unit you ...
-* ... can ;
+* ... can produce a homology model using the Swiss-Model server;
-* ... are familar with ;
+* ... can work with Chimera to analyze its structural details.
-* ... have begun to.
 {{Vspace}}
@@ Line 332: / Line 331: @@
 writeALN(fetchMSAmotif(MBP1msa, seq1BM8)) # and ...
 writeALN(fetchMSAmotif(MBP1msa, seq4UX5))
 </source>
-You need to decide which of the templates you will use. Next we will extract aligned target and template sequences, while masking gaps we don't need for the aligned pair.
+You need to decide which of the templates you will use. '''Choose either 1BM8 or 4UX5 - depending on which template has higher sequence similarity to the target.''' Next, extract aligned target and template sequences, while masking gaps that are not needed for the aligned pair.
 <source lang="R">
-# Next, we extract the TARGET and TEMPLATE sequence, and remove any hyphens that
+# Write the alignments to file, we will need it later. Depending on which
+# template you have decided on, execute ...
+writeMFA(fetchMSAmotif(MBP1msa, seq1BM8), myCon = "APSES-MBP1.fa") # or ...
+writeMFA(fetchMSAmotif(MBP1msa, seq4UX5), myCon = "APSES-MBP1.fa")
+# We extract the TARGET and TEMPLATE sequence, and remove any hyphens that
 # they both share. Remember: the TARGET is the MYSPE sequence in this alignment,
 # the TEMPLATE is either 1BM8_A or 4UX5_A. You need to edit this code so it
@@ Line 422: / Line 428: @@
 {{task|1=
-Open your '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font (like "courier") so all the columns line up correctly) and consider the following questions:
+Open your '''model''' coordinates PDB file in RStudio (which is an excellent plain-text editor) and consider the following questions:
-*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your '''model''' correspond to that region?
+*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If you read about a sequence number such as "residue 45" in a manuscript, which residues of your '''model''' correspond to that number?
 That's not easy to tell. But it should be.
@@ Line 431: / Line 437: @@
-===R code: renumbering the model ===
+===Renumbering the model ===
-As you have seen above, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately there is a very useful R package that will help: '''bio3d'''.
+As you can see from the coordinate file, SwissModel numbers the first residue "1" in the 1BM8-derived structure, and 14 in the 4UX5 structure: it does '''not''' keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers and thus interpret our model with reference to sequence numbers we find in the manuscript describing the template structure. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately  we can do this with '''bio3d'''.
 {{task|1=
-# Navigate to the [http://thegrantlab.org/bio3d/index.php '''bio3D'''] home page to . '''bio3d''' has recently been made available via CRAN - previously it had to be compiled from source.
-# Explore and execute the following '''R''' script. I am assuming that your model is in your <code>PROJECTDIR</code> folder, change paths and filenames as required.
+# Explore and execute the following '''R''' script. It assumes that your model is in your is project directory and the file is called <code>MBP1_MYSPE-APSES.pdb</code>.
 <source lang="rsplus">
-setwd(PROJECTDIR)
+if (! require(bio3d, quietly=TRUE)) {
-PDB_INFILE      <- "MYSPEmodel.pdb"
+  install.packages("bio3d")
-PDB_OUTFILE     <- "MYSPEmodelRenumbered.pdb"
+  library(bio3d)
-# The bio3d package provides functions for working with
-# protein structures in R
-if (!require(bio3d, quietly=TRUE)) {
-	install.packages("bio3d")
-	library(bio3d)
 }
 # Package information:
@@ Line 460: / Line 456: @@
 #  data(package = "bio3d")     # available datasets
+PDB_INFILE      <- "MBP1_MYSPE-APSES.pdb"
+PDB_OUTFILE     <- "MBP1_MYSPE-APSESrenum.pdb"
-# == Read the MYSPE pdb file
-iFirst <- 4  # residue number for the first residue
+iFirst <-  4  # residue number for the first residue if your template was 1BM8
+iFirst <- 14  # residue number for the first residue if your template was 4UX5
+# == Read the MYSPE pdb file
 MYSPEmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list
@@ Line 485: / Line 485: @@
 write.pdb(pdb = MYSPEmodel, file=PDBout)
-# Done. Open the PDB file you have written in a text editor
+# Done. Open the renumbered PDB file in the RStudio editor
 # and confirm that this has worked.
 </source>
 }}
+{{Vspace}}
-&nbsp;
+===First visualization - colouring the model by energy===
+{{Smallvspace}}
-===First visualization===
-&nbsp;<br>
+SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.
-Since a homology model inherits its structural details from the '''template''', your model of the MYSPE sequence should look very similar to the original 1BM8 structure.
 {{task|1=
 # Start Chimera and load the '''model''' coordinates that you have just renumbered.
-# From the PDB, also load the '''template''' structure. (Use File &rarr; Fetch by ID ...)
+# Select all, hide Ribbons and show Atoms, bonds to view the entire model structure.
-# In the '''Favourites''' &rarr; '''Model Panel''' window you can switch between the two molecules.
-# Hide the ribbon and choose '''backbone only &rarr; full'''. You will note that the backbone of the two structures is virtually identical.
-# Next, choose '''Actions &rarr; Atoms/Bonds &rarr; show''' to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: '''Select &rarr; Chemistry &rarr; Element &rarr; H''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''
-# Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. You can drag your mouse in the  '''Favourites &rarr; Sequence''', window to select the range then '''Select &rarr; Invert (selected model)''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''. Or you can use Chimera's commandline: <code>~display</code> to undisplay everything, <code>show #:50-74</code> to show this residue range for all models.
-# Study the result: a model of the HTH subdomain of MYSPE's RBM to Mbp1.
-}}
-&nbsp;
-==Coloring the model by energy ==
-SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.
-{{task|1=
-# Back in Chimera, use the model panel to '''close''' the 1BM8 structure. Select all and show Atoms, bonds to view the entire model structure.
 # Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''OK'''.
 # Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface. Why could that be the case?
@@ Line 528: / Line 509: @@
 Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...
+{{Vspace}}
-&nbsp;
-&nbsp;
 ==Modelling DNA binding==
@@ Line 538: / Line 515: @@
 One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
-Since there is currently no software available that would reliably model such a complex from first principles<ref>''Rosetta'' may get the structure approximately right, ''Autodock'' may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct. </ref>, we will base a model of  a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. It so happens that early in 2015 an APSES domain structure with bound DNA was published. You probably noticed it as a result of the PDB BLAST search: [http://www.rcsb.org/pdb/explore/explore.do?structureId=4UX5 '''4UX5'''], from the ''Magnaporthe oryzae'' Mbp1 orhologue PCG2<ref>{{#pmid: 25550425}}</ref>.
+Since there is currently no software available that would reliably model such a complex from first principles<ref>''Rosetta'' may get the structure approximately right, ''Autodock'' may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct. </ref>, we will base a model of  a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. As a result of the PDB BLAST search we found [http://www.rcsb.org/pdb/explore/explore.do?structureId=4UX5 '''4UX5'''], from the ''Magnaporthe oryzae'' Mbp1 orhologue PCG2<ref>{{#pmid: 25550425}}</ref>: this is a protein-DNA complex structure.
-<!-- But can we also find (and align) distant relatives based purely on '''structural similarity''', ideally a protein-DNA complex? -->
+{{Vspace}}
 ===A homologous protein/DNA complex structure===
 {{task|1=
@@ Line 551: / Line 525: @@
 ; The PCG2 / DNA complex
-* Open Chimera and load the '''<code>4UX5</code>''' structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule. The first question I would have is whether the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box", and whether the observed protein:DNA interfaces are actually with the cognate sequence, or whether one (or both) proteins are non-specific complexes. The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.<ref>This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.</ref> Indeed, Liu ''et al.'' (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact '''not''' identical.
+* Open Chimera.
+* load the '''<code>4UX5</code>''' structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule.
-* Without taking this question too far, let's get a quick view of the comparison by duplicating one domain of the structure and superimposing it on the other. The authors feel that chain <code>A</code> represents the tighter, more specific mode of interaction; so we will duplicate chain <code>B</code> and superpose the copy on <code>A</code>.
+**If your homology model was based on <code>4UX5</code>, Swiss-Model has already made two copies, and their orientation is the same as the template, so no superposition is required.
+** If your homology model was based on <code>1BM8</code>: make a second copy of your model. Open the '''Tools''' &rarr; '''General''' &rarr; '''Model Panel''' and use the '''copy/combine''' button to create a copy of your model. Then superimpose one copy on chain A of <code>4UX5</code>, and the other copy on chain B: open a '''MatchMaker''' dialogue window with '''Tools''' &rarr; '''Structure comparison''' &rarr; '''MatchMaker'''.  Choose the radio button two match two specific chains and select <code>4UX5</code> chain A as the '''Reference chain''', and one of your models as the '''Chain to match'''. Click '''Apply'''. Similarly superimpose the other copy of the model on chain B.
-* In Chimera, open the '''Favorites''' &rarr; '''Model Panel''' and use the '''copy/combine''' button to create a copy of the <code>4UX5</code> model. Call it <code>test</code>.
+*Color the <code>4UX5</code> protein chains grey.
-* '''Select''' chain B of the <code>test</code> model, then use '''Select''' &rarr; '''Invert (selected models)''' to apply the selection to everything in the <code>test</code> model '''except''' chain B.
+*Color the <code>4UX5</code> nucleic acid chains "by element", hide ribbons, show Atoms/Bonds and set nucleotide objects '''offf'''.
-* Use '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''delete''' to remove everything ''but'' Chain B.
+*Now color your model '''by conservation score''':
-* Select and colour the chain red.
+**In the Multalign Viewer window choose '''Preferences''' &rarr; '''Headers''', and in the Headers window choose the Headers tab and select '''Conservation style''' &rarr; '''AL2CO'''<ref>{{#pmid:11524371}}</ref>. Click '''OK'''.
-* Back on the Model Panel, select both models and use the '''match...''' dialogue to open a '''MatchMaker''' dialogue window.  Choose the radio button two match two specific chains and select <code>4UX5</code> chain A as the '''Reference chain''', <code>test</code> chain B as the '''Chain to match'''. Click '''Apply'''.
+**In the Multalign Viewer window choose '''Structure''' &rarr; '''Render by Conservation''' to open the "Render/Select by Attribute" Window. Select your Model. Select '''mavConservation''' as the "Attribute" to render. Note that you can move the blue white and red coloured bars to adjust the way the colour scale is applied to the values. Click on the blue, white and red bar in turn and then on the colour swatch to change the colour. Choose a bright orange red for the low value threshold (high diversity), a dark red for the midpoint, and a dark greay for the high conservation values. Click on '''Apply'''. Are all residues that make protein-DNA interactions in the complex conserved between target and template? Are they conserved across the entire family?
-You will see that the superimposed structures are very similar, that the main difference is in the orientation of the disordered C-terminus, but also that there is a structural difference between the two structures around Gly 84 which inserts into the minor groove of the double helix.
+* Do the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box"? Do the chains have protein:DNA interfaces with the cognate sequence, or are one (or both) proteins  non-specific complexes? The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.<ref>This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.</ref> Indeed, Liu ''et al.'' (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact '''not''' identical.
 * Select one of the residues of that loop in chain A by &lt;control&gt;-clicking on it and use '''Action''' &rarr; '''Set pivot''' to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.
-* Select residues 81 to 87 and the corresponding (sequence <code>VQGGYGKY</code>) and in both chains turn their ribbon display off and display this range as "sticks".
+* Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think?
-* Select '''nucleic acid''' in the '''structure''' submenu and turn ribbons and nucleotide objects off to display the DNA as sticks as well. Colour the DNA by element.
-* Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think<ref>Besides the coordinate difference between the chains, if indeed chain B would be representative of a DNA "scanning" conformation, perhaps one should expect that the local DNA structure that chain B binds to is structurally closer to canonical B-DNA than the DNA binding interface of chain A...</ref>? It seems to me that a crucial interaction for the cognate sequence is contributed by Guanine 8,
-* Finally, use the Model Panel to select <code>test</code> and '''close''' it.
 }}
+{{Vspace}}
+In summary: superimposing our homology model with a protein:DNA complex has allowed us to consider how our target sequence might perform its function. This is supported by considering variations in structure between chain A and B of the protein DNA complex that may point to different binding modes, and it is further supported by being able to map structural conservation onto our model, to understand which residues play a structural or functional role that is shared within the entire family.
-&nbsp;
-===Superimposing your model===
-Both your homology model and the template structure provide valuable information:
-* The template structure shows how conserved the structure is at the protein/DNA interface. You have seen what subtle differences can give rise to a sequence specific complex and a non-specific binding mode. For Mbp1 we know that the APSES domain binds to the same cognate DNA sequence as PCG2. Since your model structure is heavily biased towards the template, evaluating the template in the context of a real protein/DNA complex allows you to judge which binding residues appear to be conserved and possibly modelled in an orientation that is productive for binding.
-* The model structure maps sequence variation into that context: are the crucial residues for sequence specific binding conserved?
-{{task|1=
-* Start by loading your model and the 1BM8 structure into your chimera session. Select all, turn all ribbons off, and set all atoms to stick representation. Then select H atoms by element and '''hide''' them.
-* We need to visualize and evaluate differences in binding between different proteins and for me it works well to colour everything by element, and give the carbon atoms some identifying, distinct colour. This is best achieved through the Chimera command line that you can turn on with the little "computer" icon on the left-hand side of the graphics window. Have a look at the [https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/framecommand.html Chimera Users guide], and choose '''select''' to learn how Chimera's selection syntax works.
-* Open the Model Panel to check which protein has which Chimera-internal model number. Then you can use the following selection syntax. Instead of the model numbers, I will type <code>MYSPE</code>, <code>&lt;4ux5&gt;</code>, and <code>&lt;1BM8&gt;</code> - you will certainly know by now that these are placeholder labels and you need to replace them with the numbers <code>0</code>, <code>1</code>, and <code>2</code> instead.
-:* To colour the DNA carbon atoms white, type:<br />
-::<code>color white #&lt;4ux5&gt;:.C,.D & C</code>
-:* To colour the 4ux5 A chain carbon atoms grey, type:<br />
-::<code>color #878795 #&lt;4ux5&gt;:.A & C</code>  <small>Note: the color values after the first hash are rgb triplets in the hexadecimal numbering systems - exactly like in '''R'''.</small>
-:* To undisplay the 4ux5 B chain, type:<br />
-::<code>~display #&lt;4ux5&gt;:.B</code> <small>Note: this is the tilde character, not a hyphen or minus sign.</small>
-:* To colour the MYSPE model carbon atoms a pale reddish color, type:<br />
-::<code>color #b06268 #MYSPE; & C</code>
-:* To colour the 1BM8 structure carbon atoms a pale greenish color, type:<br />
-::<code>color #92b098 #&lt;1BM8&gt; & C</code>
-* Ready? Let's superimpose the chains.
-** Select all models in the Model Panel and click on '''match'''.
-** Set 4ux5 Chain A as the Reference chain.
-** Select MYSPE as a '''Chain to match''', select the button for specific reference and specific match, and click '''Apply'''.
-** Repeat this with 1BM8 as the match chain.
-* Easy. Now enlarge the binding site. Remember that 4ux5 and 1bm8 are independently determined crystal structures, wheres MYSPE was modelled on 1bm8 and is expected to be '''very''' similar to it. To give you some guidance what you should focus on, select 4ux5 residue 84 CA atom and display it as '''Ball & Stick'''. You can also repeat the '''Action''' "Set Pivot in case the pivot has shifted.
-* Study the scene. This is where stereo- vision will help '''a lot'''.
-* What do you think? Is this what you expected? Can you explain what you see? Was the modelling process succesful?
-<!-- I see that the model is very good regarding the global fold, but completely different in the binding loop. This is not expected. -->
-* Now turn the display of 4ux5 chain B back on and turn chain A off instead. Then superimpose the 1BM8 template and your model on Chain B.
-* Again, focus on the binding region. What do you think of that? What would you have expected? Do you see a difference? What does this all mean?
-}}
-Nb. I haven't seen this before and I am completely intrigued by the results. In fact, I think I understand the protein much, much better now through this exercise. I'm very pleased how this turned out.
-First of all you may notice that in fact not all of the structures are really different, despite having requested only to retrieve dissimilar sequences, and not all images show DNA. This appears to be a deficiency of the algorithm. But you can also easily recognize how in most of the the structures the '''recognition helix inserts into the major groove of B-DNA''' (eg. 1BC8, 1CF7) and the wing - if clearly visible at all in the image - appears to make accessory interactions with the DNA backbone.. There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way, through the beta-strands of the "wing". This is interesting since it suggests there is more than one way for winged helix domains to bind to DNA. We can therefore use structural superposition of '''your homology model''' and '''two of the winged-helix proteins''' to decide whether the canonical or the non-canonical mode of DNA binding seems to be more plausible for Mbp1 orthologues.
-&nbsp;
-===Preparation and superposition of a canonical complex===
-&nbsp;<br>
-The structure we shall use as a reference for the '''canonical binding mode''' is the Elk-1 transcription factor.
-[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
-The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, you should delete the second copy of the complex from the PDB file. (Remember that PDB files are simply text files that can be edited.)
-{{task|1=
-# Find the 1DUX structure in the image gallery and open the 1DUX structure explorer page in a separate window. Download the coordinates to your computer.
-# Open the coordinate file in a text-editor (TextEdit or Notepad - '''NOT''' MS-Word!) and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
-# Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which.
-# You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
-# In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
-# In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
-# You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that your '''model''''s side-chain orientations have not been determined experimentally but inferred from the '''template''', and that the template's structure was determined in the absence of bound DNA ligand.
-# Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. You may want to keep a copy of the image for future reference. Consider which parts of the structure appear to superimpose best.  Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in this orientation.
-}}
-&nbsp;<br>
-&nbsp;
-===Preparation and superposition of a non-canonical complex===
-The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
-[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coresponds to the recognition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
-Before we can work with this however, we have to fix an annoying problem. If you download and view the <code>1DP7</code> structure in VMD, you will notice that there is only a single strand of DNA! Where is the second strand of the double helix? It is not in the coordinate file, because it happens to be exactly equivalent to the frist starnd, rotated around a two-fold axis of symmetry in the crystal lattice. We need to download and work with the so-called '''Biological Assembly''' instead.  But there is a problem related to the way the PDB stores replicates in biological assemblies. The PDB generates the additional chains as copies of the original and delineates them with <code>MODEL</code> and <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The PDB file thus contains the '''same molecule in two different orientations''', not '''two independent molecules'''. This is an important difference regarding how such molecules are displayed by VMD. '''If you try to use the biological unit file of the PDB, VMD does not recognize that there is a second molecule present and displays only one chain.''' And that looks exactly like the one we have seen before. We have to edit the file, extract the second DNA molecule, change its chain ID and then append it to the original 1DP7 structure<ref>My apologies if this is tedious. '''But''' in the real world, we encounter such problems a lot and I would be remiss not to use this opportunity to let you practice how to fix the issue that could otherwise be a roadblock in a project of yours.</ref>...
-{{task|1=
-# On the structure explorer page for 1DP7, select the option '''Download Files''' &rarr; '''PDB File'''.
-# Also select the option '''Download Files''' &rarr; '''Biological Assembly'''.
-# Uncompress the biological assembly file.
-# Open the file in a text editor.
-# Delete everything except the '''second DNA molecule'''. This comes after the <code>MODEL   2</code> line and has chain ID '''D'''. Keep the <code>TER</code> and <code>END</code> lines. Save this with a new filename (e.g. <code>1DP7_DNAonly.pdb</code>).
-# Also delete all <code>HETATM</code> records for <code>HOH</code>, <code>PEG</code> and <code>EDO</code>, as well as the entire second protein chain and the <code>MASTER</code> record. The resulting file should only contain the DNA chain and its copy and one protein chain. Save the file with a new name, eg. <code>1DP7_BDNA.PDB</code>.
-# Use a similar procedure as [[BIO_Assignment_Week_8#R code: renumbering the model in the last assignment]] to change the  chain ID.
-<source lang="rsplus">
-PDBin <- "1DP7_DNAonly.pdb"
-PDBout <- "1DP7_DNAnewChain.pdb"
-pdb  <- read.pdb(PDBin)
-pdb$atom[,"chain"] <- "E"
-write.pdb(pdb=pdb,file=PDBout)
-</source>
-# Use your text-editor to open both the <code>1DP7.pdb</code> structure file and the  <code>1DP7_DNAnewChain.pdb</code>. Copy the DNA coordinates, paste them into the original file before the <code>END</code> line and save.
-# Open the edited coordinate file with VMD. You should see '''one protein chain''' and a '''B-DNA double helix'''. (Actually, the BDNA helix has a gap, because the R-library did not read the BRDU nucleotide as DNA). Switch to stereo viewing and spend some time to see how '''amazingly beautiful''' the complementarity between the protein and the DNA helix is (you might want to display ''protein'' and ''nucleic'' in separate representations and color the DNA chain by ''Position'' &rarr; ''Radial'' for clarity) ... in particular, appreciate how not all positively charged side chains contact the phosphate backbone, but some pnetrate into the helix and make detailed interactions with the nucleobases!
-# Then clear all molecules
-# In VMD, open '''Extensions&rarr;Analysis&rarr;MultiSeq'''. When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default, or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
-# Choose '''File&rarr;Import Data''', browse to your directory and load one by one:
-:: -Your model;
-:: -The 1DUX complex;
-:: -The 1DP7 complex.
-# Mark all three protein chains by selecting the checkbox next to their name and choose '''Tools&rarr; STAMP structural alignment'''.
-# '''Align''' the '''Marked Structures''', choose a '''scanscore''' of '''2''' and '''scanslide''' of '''5'''. Also choose '''Slow scan'''. You may have to play around with the setting to get the molecules to superimpose: but the '''can''' be superimposed quite well - at least the DNA-binding helices and the wings should line up.
-# In the graphical representations window, double-click on the cartoon representations that multiseq has generated to undisplay them, also undisplay the Tube representation of 1DUX. Then create a Tube representation for 1DP7, and select a Color by ColorID (a different color that you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
-# Orient and scale your superimposed structures so that their structural similarity is apparent, and the differences in binding elements is clear. Perhaps visualizing a solvent accessible surface of the DNA will help understand the spatial requirements of the complex formation. You may want to keep a copy of the image for future reference. Note whether it is plausible that your '''model''' could bind a B-DNA double-helix in the "alternative" conformation.
-}}
 {{Vspace}}
-{{task|1=
-# Spend some time studying the complex.
-# Recapitulate in your mind how we have arrived at this comparison, in particular, how this was possible even though the sequence similarity between these proteins is low - none of these winged helix domains came up as a result of our previous BLAST search in the PDB.
-# You should clearly think about the following question: considering the position of the two DNA helices relative to the MYSPE structural model, which binding mode appears to be more plausible for protein-DNA interactions in the MYSPE Mbp1 APSES domains? Is it the canonical, or the non-canonical binding mode? Is there evidence that allows you to distinguish between the  two modes?
-# Before you quit VMD, save the "state" of your session so you can reload it later. We will look at residue conservation once we have built phylogenetic trees. In the main VMD window, choose '''File&rarr;Save State...'''.
-}}
-== Interpretation==
-Analysis of the ligand binding site:
-* http://dnasite.limlab.ibms.sinica.edu.tw/
-* http://proline.biochem.iisc.ernet.in/pocketannotate/
-* http://www.biosolveit.de/PoseView/
-*Comparison with seq2logo
-{{#pmid: 19483101}}
-*protedna server PMID: 19483101
-* http://serv.csbb.ntu.edu.tw/ProteDNA/
-* http://protedna.csie.ntu.edu.tw/
-* Multi Harmony
-{{#pmid: 20525785}}