Difference between revisions of "BIO Assignment Week 8"

Latest revision as of 21:23, 4 December 2016

Assignment for Week 8
Predictions: Homology Modeling

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on the next quiz.

Introduction

In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, the experimental evidence we have considered in Assignment 2 (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

In this assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 RBM orthologue in your assigned species.

For the following, please remember the following terminology:

Target: The protein that you are planning to model.
Template: The protein whose structure you are using as a guide to build the model.
Model: The structure that results from the modelling process. It has the Target sequence and is similar to the Template structure.

A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.

A Point Mutation

To illustrate how homology modelling works in principle, let's consider changing the sequence of a single amino acid, based on a structural template.

Such minimal changes to structure models can be done directly in Chimera. Let us consider the residue A 42 of the 1BM8 structure. It is oriented towards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, V, or even I.

Task:

Open 1BM8 in Chimera, hide the ribbons and show all atoms as a stick model.
Color the protein white.
Open the sequence window and select A 42. Color it red. Choose Actions → Set pivot. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
To emphasize this better, hide the solvent molecules and select only the protein atoms. Display them as a sphere model to better appreciate the packing, i.e. the Van der Waals contacts we discussed in class. Use the Favorites → Side view panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
Lets simplify the view: choose Actions → Atoms/Bonds → backbone only → chain trace. Then select A 42 again in the sequence window and choose Actions → Atoms/Bonds → show.
Add the surrounding residues: choose Select → Zone.... In the window, see that the box is checked that selects all atoms at a distance of less then 5Å to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click OK and choose Actions → Atoms/Bonds → show.
Select A 42 again: left-click (control click) on any atom of the alanine to select the atom, then up-arrow to select the entire residue. Now let's mutate this residue to isoleucine.
Choose Tools → Structure Editing → Rotamers and select ILE as the rotamer type. Click OK, a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are very different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D. Btw: I find such "quantitative" work - where the real distances are important - easier in orthographic than in perspective view (cf. the Camera panel).
I find that the first rotamer is actually not such a bad fit. The CD atom comes close to the sidechains of I 25 and L 96. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your Jalview alignment - it is NOT the case that sequences that have I 42, have a smaller residue in position 25 and/or 96. So let's accept the most frequent ILE rotamer by selecting it in the rotamer window and clicking OK (while existing side chain(s): replace is selected).
Done.

If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group here. I would also encourage you to go over Part 2 of the video tutorial that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.

What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes all amino acids to the residues of the target sequence, based on the template structure. Let's now build a homology model for YFO Mbp1.

Preparation

We need to define our Target sequence;
find a suitable structural Template; and
build a Model.

Target sequence

We have encountered the PDB 1BM8 structure before, the APSES domain of saccharomyces cerevisiae Mbp1. This is a useful template to model the DNA binding domain of your RBM match. But what exactly is the aligned region of the APSES domain? We could use several approaches to define the APSES domain:

we could use the biostrings package to calculate a pairwise sequence alignment with the 1BM8 sequence, like we did previously for the full-length sequences. This would give us the domain boundaries.
we could calculate a multiple sequence alignment, while including the 1BM8 sequence. This would also allow us to infer domain boundaries, actually in all sequences in our database at once. But we have found previously that such multiple sequence alignments are quite sensitive to un-alignable regions of which we have quite a few in the full length sequences. We do need an MSA, but we do need to restrict the length of the sequences we align to a reasonable region.
we could access the domain annotations at CDD or at the SMART Database, but both have interfaces that are difficult to use computationally, and have other issues: NCBI does not recognize APSES domains, only the smaller KilA-N domain, and SMART sometimes does not find APSES domains in our sequences.
the most straightforward approach of course is to use the annotation that you already have produced for the APSES domain in MBP1_<YFO>. You should be able to simply take the MBP1_SACCE sequence and the one for YFO from the APSES.mfa file.

This is the 1BM8 sequence:

>SACCE
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF

Template choice and template sequence

The SWISS-MODEL server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.

Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the template choice principles page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.

Defining a template means finding a PDB coordinate set that has sufficient sequence similarity to your target that you can build a model based on that template. To find suitable PDB structures, we will perform a BLAST search at the PDB.

Task:

Retrieve your aligned YFO's Mbp1 RBM APSES domain sequence from the APSES.mfa selection you have prepared for the phylogeny assignment. This YFO sequence is your target sequence.
Navigate to the PDB.
Click on Advanced to enter the advanced search interface.
Open the menu to Choose a Query Type:
Find the Sequence features section and choose Sequence (BLAST...)
Paste your target sequence into the Sequence field, select not to mask low-complexity regions and Submit Query. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.

All hits that are homologs are potentially suitable templates, but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...

sequence similarity to your target
size of expected model (= length of alignment)
presence or absence of ligands
experimental method and quality of the data set

Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.

There is a menu to create Reports: - select customizable table.
Select (at least) the following information items:

Structure Summary

Experimental Method

Sequence

Chain Length

Ligands

Ligand Name

Biological details

Macromolecule Name

refinement Details

Resolution
R Work
R free

click: Create report.

Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. And there is a new structure from January 2015, with a lower resolution. Some of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the real world, there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice for our template: 1BM8.

Finally: Click on the 1BM8 ID to navigate to the structure page for the template and save the FASTA sequence to your computer. This is the template sequence.

Sequence numbering

It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file (one of the related PDB structures) is the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the ATOM records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with MSNQIY..., but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.

Fortunately, the numbering for the residues in the coordinate section of our target structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence (e.g. by using the bio3D R package). If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.

The input alignment

The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the template sequence and the target sequence from your species, proceed as follows.

Task:
Choose one of the following options to align your target and template sequence. Make sure your template sequence is included, i.e. the FASTA sequence of 1BM8.

In Jalview...

Load your APSES domain sequences plus the 1BM8 sequence in Jalview. Include the sequence of your template protein and align using Muscle.
Delete all sequence you no longer need, i.e. keep only the APSES domains of the target (from your species) and the template (from the PDB) and choose Edit → Remove empty columns. This is your input alignment.
Choose File→Output to textbox→FASTA to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.

Using a different MSA program

Copy the FASTA formatted sequences of the Mbp1 proteins in the reference species from the Reference APSES domain page.
Access the MSA tools page at the EBI.
Paste the Mbp1 sequence set, your target sequence and the template sequence into the input form.
Run an alignment (I like T-coffee) and save the output.

Using the R bioconductor MSA package that you used previously.

Refer back to the page if you are lacking notes how to go about this.

Whatever method you use: the result should be a two sequence alignment in multi-FASTA format, that was constructed from a number of supporting sequences and that contains your aligned target and template sequence. This is your input alignment for the homology modeling server. For a Schizosaccharomyces pombe model, which I am using as an example here, it looks like this:

>1BM8_A 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
>Mbp1_SCHPO 2-100 NP_593032
AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL

In this case, there are no indels and therefore no hyphens - in your case there may be.

Homology model

The alignment defines the residue by residue relationship between target and template sequence. All we need to do now is to change every residue of the template to the target sequence

SwissModel

Access the Swissmodel server at http://swissmodel.expasy.org and click on the Start Modelling button. Under the Supported Inputs, choose Target-Template Alignment.

Task:

Paste the aligned sequences of the YFO target and the 1BM8 template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.

Click Validate Target Template Alignment and check that the returned alignment is correct. All non-identical residues are shown in light-grey.

Click Build Model to start the modeling process. This will take about a minute or so.

The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.

Mouse over the Model 01 dropdown menu (under the icon of the template structure), and choose the PDB file. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. Save the PDB file on your computer.

Open the SwissModel documentation in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the GQME and QMEAN quality scores.

Also save:

- The output page as pdf (for reference)
- The modeling report (as pdf)

Model interpretation

We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the interpretation of results is often somewhat neglected. Don't be that way. Data does not explain itself. The interpreattion of your computational results is the most important part.

We will look at our homology model with two different questions:

Can we define the DNA binding residues?
Can we tell which residues are conserved for functional reasons, rather than for structural reasons?

The PDB file

Task:
Open your model coordinates in a text-editor (make sure you view the PDB file in a fixed-width font (like "courier") so all the columns line up correctly) and consider the following questions:

What is the residue number of the first residue in the model? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your model correspond to that region?

That's not easy to tell. But it should be.

R code: renumbering the model

As you have seen above, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately there is a very useful R package that will help: bio3d.

Task:

Navigate to the bio3D home page to . bio3d has recently been made available via CRAN - previously it had to be compiled from source.

Explore and execute the following R script. I am assuming that your model is in your PROJECTDIR folder, change paths and filenames as required.

setwd(PROJECTDIR)
PDB_INFILE      <- "YFOmodel.pdb"
PDB_OUTFILE     <- "YFOmodelRenumbered.pdb"


# The bio3d package provides functions for working with 
# protein structures in R 
if (!require(bio3d, quietly=TRUE)) { 
	install.packages("bio3d")
	library(bio3d)
}

# == Read the YFO pdb file

iFirst <- 4  # residue number for the first residue
 
YFOmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list

YFOmodel           # examine the information
YFOmodel$atom[1,]  # get information for the first atom

# Explore ?read.pdb and study the examples.

# == Modify residue numbers for each atom
resNum <- as.numeric(YFOmodel $atom[,"resno"])
resNum  
resNum <- resNum - resNum[1] + iFirst  # add offset
YFOmodel $atom[ , "resno"] <- resNum   # replace old numbers with new

# check result
YFOmodel $atom[ , "resno"]
YFOmodel $atom[1, ]

# == Write output to file
write.pdb(pdb = YFOmodel, file=PDBout)

# Done. Open the PDB file you have written in a text editor
# and confirm that this has worked.

First visualization

Since a homology model inherits its structural details from the template, your model of the YFO sequence should look very similar to the original 1BM8 structure.

Task:

Start Chimera and load the model coordinates that you have just renumbered.
From the PDB, also load the template structure. (Use File → Fetch by ID ...)
In the Favourites → Model Panel window you can switch between the two molecules.
Hide the ribbon and choose backbone only → full. You will note that the backbone of the two structures is virtually identical.
Next, choose Actions → Atoms/Bonds → show to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: Select → Chemistry → Element → H and Actions → Atoms/Bonds → hide
Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. You can drag your mouse in the Favourites → Sequence, window to select the range then Select → Invert (selected model) and Actions → Atoms/Bonds → hide. Or you can use Chimera's commandline: ~display to undisplay everything, show #:50-74 to show this residue range for all models.
Study the result: a model of the HTH subdomain of YFO's RBM to Mbp1.

Coloring the model by energy

SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.

Task:

Back in Chimera, use the model panel to close the 1BM8 structure. Select all and show Atoms, bonds to view the entire model structure.
Choose Tools → Depiction → Render by attribute and select attributes of atoms, Attribute: bfactor, check color atoms and click OK.
Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface. Why could that be the case?

Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...

Modelling DNA binding

One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.

Since there is currently no software available that would reliably model such a complex from first principles^[1], we will base a model of a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. It so happens that early in 2015 an APSES domain structure with bound DNA was published. You probably noticed it as a result of the PDB BLAST search: 4UX5, from the Magnaporthe oryzae Mbp1 orhologue PCG2^[2].

A homologous protein/DNA complex structure

Task:

The PCG2 / DNA complex

Open Chimera and load the 4UX5 structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule. The first question I would have is whether the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box", and whether the observed protein:DNA interfaces are actually with the cognate sequence, or whether one (or both) proteins are non-specific complexes. The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.^[3] Indeed, Liu et al. (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact not identical.

Without taking this question too far, let's get a quick view of the comparison by duplicating one domain of the structure and superimposing it on the other. The authors feel that chain A represents the tighter, more specific mode of interaction; so we will duplicate chain B and superpose the copy on A.

In Chimera, open the Favorites → Model Panel and use the copy/combine button to create a copy of the 4UX5 model. Call it test.
Select chain B of the test model, then use Select → Invert (selected models) to apply the selection to everything in the test model except chain B.
Use Actions → Atoms/Bonds → delete to remove everything but Chain B.
Select and colour the chain red.
Back on the Model Panel, select both models and use the match... dialogue to open a MatchMaker dialogue window. Choose the radio button two match two specific chains and select 4UX5 chain A as the Reference chain, test chain B as the Chain to match. Click Apply.

You will see that the superimposed structures are very similar, that the main difference is in the orientation of the disordered C-terminus, but also that there is a structural difference between the two structures around Gly 84 which inserts into the minor groove of the double helix.

Select one of the residues of that loop in chain A by <control>-clicking on it and use Action → Set pivot to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.

Select residues 81 to 87 and the corresponding (sequence VQGGYGKY) and in both chains turn their ribbon display off and display this range as "sticks".
Select nucleic acid in the structure submenu and turn ribbons and nucleotide objects off to display the DNA as sticks as well. Colour the DNA by element.
Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think^[4]? It seems to me that a crucial interaction for the cognate sequence is contributed by Guanine 8,
Finally, use the Model Panel to select test and close it.

Superimposing your model

Both your homology model and the template structure provide valuable information:

The template structure shows how conserved the structure is at the protein/DNA interface. You have seen what subtle differences can give rise to a sequence specific complex and a non-specific binding mode. For Mbp1 we know that the APSES domain binds to the same cognate DNA sequence as PCG2. Since your model structure is heavily biased towards the template, evaluating the template in the context of a real protein/DNA complex allows you to judge which binding residues appear to be conserved and possibly modelled in an orientation that is productive for binding.

The model structure maps sequence variation into that context: are the crucial residues for sequence specific binding conserved?

Task:

Start by loading your model and the 1BM8 structure into your chimera session. Select all, turn all ribbons off, and set all atoms to stick representation. Then select H atoms by element and hide them.

We need to visualize and evaluate differences in binding between different proteins and for me it works well to colour everything by element, and give the carbon atoms some identifying, distinct colour. This is best achieved through the Chimera command line that you can turn on with the little "computer" icon on the left-hand side of the graphics window. Have a look at the Chimera Users guide, and choose select to learn how Chimera's selection syntax works.
Open the Model Panel to check which protein has which Chimera-internal model number. Then you can use the following selection syntax. Instead of the model numbers, I will type <YFO>, <4ux5>, and <1BM8> - you will certainly know by now that these are placeholder labels and you need to replace them with the numbers 0, 1, and 2 instead.

To colour the DNA carbon atoms white, type:

color white #<4ux5>:.C,.D & C

To colour the 4ux5 A chain carbon atoms grey, type:

color #878795 #<4ux5>:.A & C Note: the color values after the first hash are rgb triplets in the hexadecimal numbering systems - exactly like in R.

To undisplay the 4ux5 B chain, type:

~display #<4ux5>:.B Note: this is the tilde character, not a hyphen or minus sign.

To colour the YFO model carbon atoms a pale reddish color, type:

color #b06268 #<YFO> & C

To colour the 1BM8 structure carbon atoms a pale greenish color, type:

color #92b098 #<1BM8> & C

Ready? Let's superimpose the chains.
- Select all models in the Model Panel and click on match.
- Set 4ux5 Chain A as the Reference chain.
- Select YFO as a Chain to match, select the button for specific reference and specific match, and click Apply.
- Repeat this with 1BM8 as the match chain.

Easy. Now enlarge the binding site. Remember that 4ux5 and 1bm8 are independently determined crystal structures, wheres YFO was modelled on 1bm8 and is expected to be very similar to it. To give you some guidance what you should focus on, select 4ux5 residue 84 CA atom and display it as Ball & Stick. You can also repeat the Action "Set Pivot in case the pivot has shifted.

Study the scene. This is where stereo- vision will help a lot.

What do you think? Is this what you expected? Can you explain what you see? Was the modelling process succesful?

Now turn the display of 4ux5 chain B back on and turn chain A off instead. Then superimpose the 1BM8 template and your model on Chain B.

Again, focus on the binding region. What do you think of that? What would you have expected? Do you see a difference? What does this all mean?

Nb. I haven't seen this before and I am completely intrigued by the results. In fact, I think I understand the protein much, much better now through this exercise. I'm very pleased how this turned out.

Links and resources

PDB file format (see the Coordinate Section if you are unsure about chain identifiers)
Wikipedia on Structural Superposition (although the article is called "Structural Alignment")

Footnotes and references

↑ Rosetta may get the structure approximately right, Autodock may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct.

↑

Liu et al. (2015) Structural basis of DNA recognition by PCG2 reveals a novel DNA binding mode for winged helix-turn-helix domains. Nucleic Acids Res 43:1231-40. (pmid: 25550425)

[ PubMed ] [ DOI ] Abstract

↑ This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.
↑ Besides the coordinate difference between the chains, if indeed chain B would be representative of a DNA "scanning" conformation, perhaps one should expect that the local DNA structure that chain B binds to is structurally closer to canonical B-DNA than the DNA binding interface of chain A...

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

< Assignment 7

Assignment 9 >

[1] Rosetta may get the structure approximately right, Autodock may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct.

[2] 
Liu et al. (2015) Structural basis of DNA recognition by PCG2 reveals a novel DNA binding mode for winged helix-turn-helix domains. Nucleic Acids Res 43:1231-40. (pmid: 25550425)

[ PubMed ] [ DOI ] Abstract
The MBP1 family proteins are the DNA binding subunits of MBF cell-cycle transcription factor complexes and contain an N terminal winged helix-turn-helix (wHTH) DNA binding domain (DBD). Although the DNA binding mechanism of MBP1 from Saccharomyces cerevisiae has been extensively studied, the structural framework and the DNA binding mode of other MBP1 family proteins remains to be disclosed. Here, we determined the crystal structure of the DBD of PCG2, the Magnaporthe oryzae orthologue of MBP1, bound to MCB-DNA. The structure revealed that the wing, the 20-loop, helix A and helix B in PCG2-DBD are important elements for DNA binding. Unlike previously characterized wHTH proteins, PCG2-DBD utilizes the wing and helix-B to bind the minor groove and the major groove of the MCB-DNA whilst the 20-loop and helix A interact non-specifically with DNA. Notably, two glutamines Q89 and Q82 within the wing were found to recognize the MCB core CGCG sequence through making hydrogen bond interactions. Further in vitro assays confirmed essential roles of Q89 and Q82 in the DNA binding. These data together indicate that the MBP1 homologue PCG2 employs an unusual mode of binding to target DNA and demonstrate the versatility of wHTH domains.

[3] This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.

[4] Besides the coordinate difference between the chains, if indeed chain B would be representative of a DNA "scanning" conformation, perhaps one should expect that the local DNA structure that chain B binds to is structurally closer to canonical B-DNA than the DNA binding interface of chain A...

[1]

[2]

[3]

[4]

Difference between revisions of "BIO Assignment Week 8"

Latest revision as of 21:23, 4 December 2016

Contents

Introduction

A Point Mutation

Preparation

Target sequence

Template choice and template sequence

Sequence numbering

The input alignment

Homology model

SwissModel

Model interpretation

The PDB file

R code: renumbering the model

First visualization

Coloring the model by energy

Modelling DNA binding

A homologous protein/DNA complex structure

Superimposing your model

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 1: / Line 1: @@
 <div id="BIO">
 <div class="b1">
-Assignment for Week 7<br />
+Assignment for Week 8<br />
-<span style="font-size: 70%">Multiple Sequence Alignment</span>
+<span style="font-size: 70%">Predictions: Homology Modeling</span>
 </div>
+<table style="width:100%;"><tr>
+<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_7|&lt;&nbsp;Assignment&nbsp;7]]</td>
+<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_9|Assignment&nbsp;9&nbsp;&gt;]]</td>
+</tr></table>
 {{Template:Inactive}}
-Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
+Concepts and activities (and reading, if applicable) for this assignment will be topics on the next quiz.
 __TOC__
 ==Introduction==
-In the last assignment we discovered homologs to ''S. cerevisiae'' Mbp1 in YFO. Some of these will be orthologs to Mbp1, some will be paralogs. Some will have similar function, some will not. We discussed previously that genes that evolve under continuously similar evolutionary pressure should be most similar in sequence, and should have the most similar "function".
-In this assignment we will define the YFO gene that is the most similar ortholog to ''S. cerevisiae'' Mbp1, and perform a multiple sequence alignment with it.
-Let us briefly review the basic concepts.
+In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
-==Orthologs and Paralogs revisited==
+In this assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 RBM orthologue in your assigned species.
-<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+For the following, please remember the following terminology:
-&nbsp;<br>
+;Target
-;All related genes are homologs.
+:The protein that you are planning to model.
-</div>
+;Template
+:The protein whose structure you are using as a guide to build the model.
+;Model
+:The structure that results from the modelling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
+&nbsp;
+A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.
-Two central definitions about the mutual relationships between related genes go back to Walter Fitch who stated them in the 1970s:
-<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
-&nbsp;<br>
-;Orthologs have diverged after speciation.
-;Paralogs have diverged after duplication.
-</div>
+&nbsp;
 &nbsp;
-[[Image:OrthologParalog.jpg|frame|none|'''Hypothetical evolutionary tree.''' A single gene evolves through two speciation events and one duplication event. A duplication occurs during the evolution from reptilian to synapsid. It is easy to see how this pair of genes (paralogs) in the ancestral synapsid gives rise to two pairs of genes in pig and elephant, respectively. All ''circle'' genes are mutually orthologs, they form a "cluster of orthologs". All genes within one species are mutual paralogs&ndash;they are so called ''in-paralogs''. The ''circle'' gene in pig and the ''triangle'' gene in the elephant are so-called ''out-paralogs''. Somewhat counterintuitively, the ''triangle'' gene in the pig and the ''circle'' gene in the raven are also orthologs.
-The "phylogram" on the right symbolizes the amount of evolutionary change as proportional to height difference to the "root". It is easy to see how a bidirectional BLAST search will only find pairs of most similar orthologs. If applied to a group of species, bidirectional BLAST searches will find clusters of orthologs only (except if genes were lost, or there are  anomalies in the evolutionary rate.)]]
-==Defining orthologs==
-To be reasonably certain about orthology relationships, we would need to construct and analyze detailed evolutionary trees. This is computationally expensive and the results are not always unambiguous either, as we will see in a later assignment. But a number of different strategies are available that use precomputed results to define orthologs. These are especially useful for large, cross genome surveys. They are less useful for detailed analysis of individual genes. Pay the sites a visit and try a search.
-;Orthologs by COGs and KOGS
-:The [http://www.ncbi.nlm.nih.gov/COG/new/ '''COGs'''] database is a database of clusters of mutually orthologous bacterial genes at the NCBI and the KOGS database is its eukaryotic counterpart. I am sure it is very well done, but the interface is useless. One could download the data and code one's own analysis routines I guess...
-;Orthologs by OMA and OrthoDB
-:[http://www.orthodb.org/ '''OrthoDB'''] includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword) retrieves many paralogs together with the orthologs, for example, the yeast Swi4 protein is found together with yeast Mbp1 and these two are clearly paralogs.
-:[http://omabrowser.org/ '''OMA'''] (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with <code>MBP1_YEAST</code> (this is the Swissprot ID) as a "Group" search finds the correct gene in EREGO, KLULA, CANGL and SACCE. But searching with the sequence of the ''Ustilago maydis'' ortholog does not find the yeast protein, but the orthologs in YARLI, SCHPO, LACCBI, CRYNE and USTMA. Apparently the orthologous group has been split into several subgroups across the fungi.
-;Orthologs by syntenic gene order conservation
-:We will revisit this when we explore the UCSC genome browser.
-;Orthologs by RBM
-:This is easy to do. Simply pick the gene which you have identified and annotated for YFO in [[BIO_Assignment_Week_3|Assignment 3]] and confirm that it is the best match in yeast. The results are unambiguous regarding the applied criterion, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.
-{{task|1=
-# Navigate to the BLAST homepage.
-# Paste the YFO RefSeq sequence identifier into the search field. (You don't have to search with sequences&ndash;you can search directly with an NCBI identifier '''IF''' you want to search with the full-length sequence.)
-# Set the database to refseq, and restrict the species to ''Saccharomyces cerevisiae''.
-# Run BLAST.
-# Keep the window open for the next task.
-The top hit should be yeast Mbp1 (NP_010227). E mail me your sequence identifiers if it is not.
-If it is, you have confirmed the '''RBM''' or '''BBM''' criterion (Reciprocal Best Match or Bidirectional Best Hit, respectively).
-<small>Technically, this is not perfectly true since you have searched with the APSES domain in one direction, with the full-length sequence in the other. For this task I wanted you to try the ''search-with-accession-number''. Therefore the procedural laxness, I hope it is permissible. In fact, performing the reverse search with the YFO APSES domain should actually be more stringent, i.e. if you find the right gene with the longer sequence, you are even more likely to find the right gene with the shorter one.</small>
+==A Point Mutation==
-}}
+To illustrate how homology modelling works in principle, let's consider changing the sequence of a single amino acid, based on a structural template.
-;Orthology by annotation
+Such minimal changes to structure models can be done directly in Chimera. Let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented towards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, <code>V</code>, or even <code>I</code>.
-:The NCBI precomputes BLAST results and makes them available at the RefSeq database entry for your protein.
 {{task|1=
-# In your BLAST result page, click on the RefSeq link for your query to navigate to the RefSeq database entry for your protein.
+# Open <code>1BM8</code> in Chimera, hide the ribbons and show all atoms as a stick model.
-# Follow the '''Blink''' link in the right-hand column under '''Related information'''.
+# Color the protein white.
-# Restrict the view to Fungi and RefSeq under the "Display options"
+# Open the sequence window and select <code>A&nbsp;42</code>. Color it red. Choose '''Actions&nbsp;&rarr;&nbsp;Set pivot'''. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
+# To emphasize this better, hide the solvent molecules and select only the protein atoms. Display them as a '''sphere''' model to better appreciate the packing, i.e. the Van der Waals contacts we discussed in class. Use the '''Favorites&nbsp;&rarr;&nbsp;Side view''' panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
-You should see a number of genes with low E-values and high coverage in other fungi - however this search is problematic since the full length gene across the database finds mostly Ankyrin domains.
+# Lets simplify the view: choose '''Actions &rarr; Atoms/Bonds &rarr; backbone&nbsp;only &rarr; chain&nbsp;trace'''. Then select <code>A&nbsp;42</code> again in the sequence window and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
+# Add the surrounding residues: choose '''Select &rarr; Zone...'''. In the window, see that the box is checked that selects all atoms at a distance of less then 5&Aring; to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click '''OK''' and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
+#Select <code>A&nbsp;42</code> again: '''left-click''' (control click) on any atom of the alanine to select the atom, then '''up-arrow''' to select the entire residue. Now let's mutate this residue to isoleucine.
+#Choose '''Tools &rarr; Structure&nbsp;Editing &rarr; Rotamers''' and select <code>ILE</code> as the rotamer type. Click '''OK''', a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are '''very''' different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D. Btw: I find such "quantitative" work - where the real distances are important - easier in '''orthographic''' than in '''perspective''' view (cf. the '''Camera''' panel).
+#I find that the first rotamer is actually not such a bad fit. The <code>CD</code> atom comes close to the sidechains of <code>I&nbsp;25</code> and <code>L&nbsp;96</code>. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your Jalview alignment - it is '''NOT''' the case that sequences that have <code>I&nbsp;42</code>, have a smaller residue in position <code>25</code> and/or <code>96</code>. So let's accept the most frequent <code>ILE</code> rotamer by selecting it in the rotamer window and clicking '''OK''' (while '''existing side chain(s): replace''' is selected).
+#Done.
 }}
-You will find that '''all''' of these approaches yield '''some''' of the orthologs. But none finds them all. The take home message is: precomputed results are good for large-scale survey-type investigations, where you can't humanly process the information by hand. But for more detailed questions, careful manual searches are still indsipensable.
+If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group [http://www.youtube.com/watch?v=bcXMexN6hjY '''here''']. I would also encourage you to go over [http://www.youtube.com/watch?v=eJkrvr-xeXY '''Part 2 of the video tutorial'''] that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.
-<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for crowdsourcing" data-collapsetext="Collapse">
-;Orthology by crowdsourcing
-:Luckily a crowd of willing hands has prepared the necessary sequences for you: below you will find a link the annotated and verified Mbp1 orthologs from last year's course  :-)
-<div class="mw-collapsible-content">
-We could call this annotation by many hands {{WP|Crowdsourcing|"crowdsourcing"}} - handing out small parcels of work to many workers, who would typically allocate only a small share of their time, but here the strength is in numbers and especially projects that organize via the Internet can tally up very impressive manpower, for free, or as {{WP|Microwork}}. These developments have some interest for bioinformatics: many of our more difficult tasks  can not be easily built into an algorithm, language related tasks such as text-mining, or pattern matching tasks come to mind. Allocating this to a large number of human contributors may be a viable alternative to computation. A marketplace where this kind of work is already a reality is {{WP|Amazon Mechanical Turk|Amazon's "Mechanical Turk" Marketplace}}: programmers&ndash;"requesters"&ndash; use an open interface to post tasks for payment, "providers" from all over the world can engage in these. Tasks may include matching of pictures, or evaluating the aesthetics of competing designs. A quirky example I came across recently was when information designer David McCandless had 200 "Mechanical Turks" draw a small picture of their soul for his collection.
-The name {{WP|The Turk|"Mechanical Turk"}} by the way relates to a famous ruse, when a Hungarian inventor and adventurer toured the imperial courts of 18<sup>th</sup> century Europe with an automaton, dressed in turkish robes and turban, that played chess at the grandmaster level against opponents that included Napoleon Bonaparte and Benjamin Franklin. No small mechanical feat in any case, it was only in the 19<sup>th</sup> century that it was revealed that the computational power was actually provided by a concealed human.
+What we have done here with one residue is exactly the way homology modeling works with entire sequences. The homology modelling program simply changes '''all''' amino acids to the residues of the '''target sequence''', based on the '''template structure'''. Let's now build a homology model for YFO Mbp1.
-Are you up for some "Turking"? Mail me the RefSeq ID of your YFO protein that is the RBM for Mbp1, for 10% bonus on the next quiz, before the quiz.
-</div>
-</div>
 &nbsp;
-==Align and Annotate==
+==Preparation==
-&nbsp;<br>
-===Review of domain annotations===
-APSES domains are relatively easy to identify and annotate but we have had problems with the ankyrin domains in Mbp1 homologues. Both CDD as well as SMART have identified such domains, but while the domain model was based on the same Pfam profile for both, and both annotated approximately the same regions, the details of the alignments and the extent of the predicted region was different.
-[http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=mbp1 Mbp1] forms heterodimeric complexes with a homologue, [http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=swi6 Swi6]. Swi6 does not have an APSES domain, thus it does not bind DNA. But it is similar to Mbp1 in the region spanning the ankyrin domains and in 1999 [http://www.ncbi.nlm.nih.gov/pubmed/10048928 Foord ''et al.''] published its crystal structure ([http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=1SW6 1SW6]). This structure is a good model for Ankyrin repeats in Mbp1. For details, please refer to the consolidated [[Mbp1 annotation|Mbp1 annotation page]] I have prepared.
-In what follows, we will use the program JALVIEW - a Java based multiple sequence alignment editor to load and align sequences and to consider structural similarity between yeast Mbp1 and its closest homologue in your organism.
-In this part of the assignment,
-#You will load sequences that are most similar to Mbp1 into an MSA editor;
+* We need to define our '''Target sequence''';
-#You will add sequences of ankyrin domain models;
+* find a suitable structural '''Template'''; and
-#You will perform a multiple sequence alignment;
+* build a '''Model'''.
-#You will try to improve the alignment manually;
-<!-- Finally you will consider if the Mbp1 APSES domains could extend beyond the section of homology with Swi6 -->
-===Jalview, loading sequences===
+===Target sequence===
+We have encountered the PDB <code>1BM8</code> structure before, the APSES domain of ''saccharomyces cerevisiae'' Mbp1. This is a useful template to model the DNA binding domain of your RBM match. But what exactly is the aligned region of the APSES domain? We could use several approaches to define the APSES domain:
-Geoff Barton's lab in Dundee has developed an integrated MSA editor and sequence annotation workbench with a number of very useful functions. It is written in Java and should run on Mac, Linux and Windows platforms without modifications.
+* we could use the biostrings package to calculate a pairwise sequence alignment with the <code>1BM8</code> sequence, like we did previously for the full-length sequences. This would give us the domain boundaries.
+* we could calculate a multiple sequence alignment, while including the <code>1BM8</code> sequence. This would also allow us to infer domain boundaries, actually in all sequences in our database at once. But we have found previously that such multiple sequence alignments are quite sensitive to un-alignable regions of which we have quite a few in the full length sequences. We do need an MSA, but we do need to restrict the length of the sequences we align to a reasonable region.
+* we could access the domain annotations at [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml CDD] or at the [http://smart.embl-heidelberg.de/ SMART Database], but both have interfaces that are difficult to use computationally, and have other issues: NCBI does not recognize APSES domains, only the smaller KilA-N domain, and SMART sometimes does not find APSES domains in our sequences.
+* the most straightforward approach of course is to use the annotation that you already have produced for the APSES domain in <tt>MBP1_&lt;YFO&gt;</tt>. You should be able to simply take the MBP1_SACCE sequence and the one for YFO from the <tt>APSES.mfa</tt> file.
+This is the 1BM8 sequence:
+ >SACCE
+ QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
+ LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
-{{#pmid: 19151095}}
+<!--
-We will use this tool for this assignment and explore its features as we go along.
 {{task|1=
-#Navigate to the [http://www.jalview.org/ Jalview homepage] click on '''Download''', install Jalview on your computer and start it. A number of windows that showcase the program's abilities will load, you can close these.
-#Prepare homologous Mbp1 sequences for alignment:
-##Open the '''[[All Mbp1 proteins]]''' page.
-##Copy the FASTA sequences of the reference proteins, paste them into a text file (TextEdit on the Mac, Notepad on Windows) and save the file; you could give it an extension of <code>.fa</code>&ndash;but you don't have to.
-##Check whether the sequence for YFO is included in the list. If it is, fine. If it is not, retrieve it from NCBI, paste it into the file and edit the header like the other sequences. If the wrong sequence from YFO is included, replace it and let me know.
-#Return to Jalview and select File &rarr; Input Alignment &rarr; from File and open your file. A window with sequences should appear.
-#Copy the sequences for ankyrin domain models (below), click on the Jalview window, select File &rarr; Add sequences &rarr; from Textbox and paste them into the Jalview textbox. Paste two separate copies of the CD00204 consensus sequence and one copy of 1SW6.
-##When all the sequences are present, click on '''Add'''.
-Jalview now displays all the sequences, but of course this is not yet an alignment.
+* In our case it seems the best results are had when searching the [http://prosite.expasy.org/prosite.html Prosite] database with the [http://prosite.expasy.org/scanprosite/ ScanProsite] interface.
-}}
+Let's have a first look at ScanProsite, using the yeast Mbp1 sequence. We need the UniProt ID to search Prosite. With your protein database loaded in a fresh '''R''' session, type
-;Ankyrin domain models
+<source lang="RSplus">
- >CD00204 ankyrin repeat consensus sequence from CDD
+# (commands indented, to align their components and
- NARDEDGRTPLHLAASNGHLEVVKLLLENGADVNAKDNDGRTPLHLAAKNGHLEIVKLLL
+# help you understand their relationship)
- EKGADVNARDKDGNTPLHLAARNGNLDVVKLLLKHGADVNARDKDGRTPLHLAAKNGHL
- >1SW6 from PDB - unstructured loops replaced with xxxx
+       refDB$protein$uniProtID
- GPIITFTHDLTSDFLSSPLKIMKALPSPVVNDNEQKMKLEAFLQRLLFxxxxSFDSLLQE
+                               which(refDB$protein$name == "MBP1")
- VNDAFPNTQLNLNIPVDEHGNTPLHWLTSIANLELVKHLVKHGSNRLYGDNMGESCLVKA
+       refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
- VKSVNNYDSGTFEALLDYLYPCLILEDSMNRTILHHIIITSGMTGCSAAAKYYLDILMGW
+uID <- refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
- IVKKQNRPIQSGxxxxDSILENLDLKWIIANMLNAQDSNGDTCLNIAARLGNISIVDALL
+uID
- DYGADPFIANKSGLRPVDFGAG
+</source>
-===Computing alignments===
+* Navigate to [http://prosite.expasy.org/scanprosite/ ScanProsite], paste the UniprotID for yeast Mbp1 into the text field, select '''Table''' output for STEP 3, and '''START THE SCAN'''.
-The EBI has a very convenient [http://www.ebi.ac.uk/Tools/msa/ page to access a number of MSA algorithms]. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.
+You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because we will want to fetch a few of these. Prosite does not have a nice API interface like UniProt, but the principles of using '''R''''s <code>httr</code> package to send POST requests and retrieve the results are the same. Getting data informally from Webpages is called '''screenscraping''' and really a life-saving skill. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.
-But even easier is to calculate the alignments directly from Jalview. Or at least it is, when the service is actually available. (Not today. <small>Bummer.</small>)
+(I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)
-;Calculate a MAFFT alignment when the Jalview Web service is available:
+* Use the menu and access '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
-{{task|1=
+*Use the browser's back button, and again: '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. This is the page that accepts user input in a so called <code>form</code> via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can  construct a valid <code>POST</code> request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are much harder to identify thean the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled <code>POST</code> header that it sent to the Prosite server!
-#In Jalview, select '''Web Service &rarr; Alignment &rarr; MAFFT Multiple Protein Sequence Alignment'''. The alignment is calculated in a few minutes and displayed in a new window.
-}}
-;Calculate a MAFFT alignment when the Jalview Web service is NOT available:
+* On the scanProsite page, open '''View''' &rarr; '''Developer''' &rarr; '''Developer Tools''' in the Chrome menu. '''Then''' click again on '''START THE SCAN'''. The Developer Tools page will show you information about what just happened in the transaction it negotiated to retrieve the results page. Click on the '''Network''' tab, and then on the top element: <code>PSScan.cgi</code>. This contains the form data. Then click on the '''Headers''' tab and scroll down until you see the '''Request Payload'''. This has all the the required <code>POST</code> elements nicely spelled out. No guesswork required. What worked from the browser should work the same way from an '''R''' script. Analogous to our UniProt fetch code, we create a <code>POST</code> query:
-{{task|1=
+<source lang="RSplus">
-#In Jalview, select '''File &rarr; Output to Textbox &rarr; FASTA'''
-#Copy the sequences.
-#Navigate to the [http://www.ebi.ac.uk/Tools/msa/mafft/ '''MAFFT Input form'''] at the EBI.
-#Paste your sequences into the form.
-#Click on '''Submit'''.
-#Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply ''''File &rarr; Input Alignment &rarr; from Textbox''', paste and click '''New Window'''.
-}}
+URL <- "http://prosite.expasy.org/cgi-bin/prosite/PSScan.cgi"
+response <- POST(URL,
+                 body = list(meta = "opt1",
+                             meta1_protein = "opt1",
+                             seq = "P39678",
+                             skip = "on",
+                             output = "tabular"))
+# Note how the list-elements correspond to the page header's
+# Request Payload. We include everything but the value of the
+# submit button (which is for display only) in our POST
+# request.
-In any case, you should now have an alignment.
+# Send off this request, and you should have a response in a few
+# seconds.
-{{task|1=
+# The text contents of the response is available with the
-#Choose '''Colour &rarr; Hydrophobicity''' and '''&rarr; by Conservation'''. Then select '''Modify Conservation Threshold...'''  and adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.
+# content() function:
-}}
+content(response, "text")
-===Editing ankyrin domain alignments===
+# ... should show you the same as the page contents that
+# you have seen in the browser. Now we need to extract
+# the data from the page: we need regular expressions, but
+# only simple ones. First, we strsplit() the response into
+# individual lines, since each of our data elements is on
+# its own line. We simply split on the "\\n" newline character.
+lines <- unlist(strsplit(content(response, "text"), "\\n"))
+head(lines)
-A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since the alignment reflects the result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.
+# Now we define a query pattern for the lines we want:
+# we can use the uID, bracketed by two "|" pipe
+# characters:
-A '''poor''' MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.
+pattern <- paste("\\|", uID, "\\|", sep="")
-Often errors or inconsistencies are easy to spot, and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal of manual editing is to make an alignment biologically more plausible. Most comonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:
+# ... and select only the lines that match this
+# pattern:
-;Reduce number of indels
+lines <- lines[grep(pattern, lines)]
- From a Probcons alignment:
+lines
-_DEBHA    ILKTE-K<span style="color: rgb(255, 0, 0);">-</span>T<span style="color: rgb(255, 0, 0);">---</span>K--SVVK      ILKTE----KTK---SVVK
-_GIBZE    MLGLN<span style="color: rgb(255, 0, 0);">-</span>PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
-_CANAL    ILKTE-K<span style="color: rgb(255, 0, 0);">-</span>I<span style="color: rgb(255, 0, 0);">---</span>K--NVVK      ILKTE----KIK---NVVK
-_SCHPO    ELDDI-I<span style="color: rgb(255, 0, 0);">-</span>ESGDY--ENVD      ELDDI-IESGDY---ENVD
-_ASPFU    ----N<span style="color: rgb(255, 0, 0);">-</span>PGLREIC--HSIT  -&gt;  ----NPGLREIC---HSIT
-_USTMA    LVKTC<span style="color: rgb(255, 0, 0);">-</span>PALDPHI--TKLK      LVKTCPALDPHI---TKLK
-_ASPTE    VLDAN<span style="color: rgb(255, 0, 0);">-</span>PGLREIS--HSIT      VLDANPGLREIS---HSIT
-_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
-_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR
-<small>Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22</small>
+# ... captures the four lines of output.
+# Now we break the lines apart into
+# apart in tokens: this is another application of
+# strsplit(), but this time we split either on
+# "pipe" characters, "|" OR on tabs "\t". Look at the
+# regex "\\t|\\|" in the strsplit() call:
-;Move indels to more plausible position
+strsplit(lines[1], "\\t|\\|")
- From a CLUSTAL alignment:
-_CANGL     MKHEKVQ------GGYGRFQ---GTW      MKHEKV<span style="color: rgb(0, 170, 0);">Q</span>------GGYGRFQ---GTW
-_CANAL     KIKNVVK------VGSMNLK---GVW      KIKNVV<span style="color: rgb(0, 170, 0);">K</span>------VGSMNLK---GVW
-_SCHPO     VDSKHP<span style="color: rgb(255, 0, 0);">-</span>----------<span style="color: rgb(255, 0, 0);">Q</span>ID---GVW  -&gt;  VDSKHP<span style="color: rgb(0, 170, 0);">Q</span>-----------ID---GVW
-_ASPFU     EICHSIT------GGALAAQ---GYW      EICHSI<span style="color: rgb(0, 170, 0);">T</span>------GGALAAQ---GYW
-<small>The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.</small>
+# Its parts are (\\t)=tab (|)=or (\\|)=pipe.
+# Both "t" and "|" need to be escaped with a backslash.
+# "t" has to be escaped because we want to match a tab (\t),
+# not the literal character "t". And "|" has to be escaped
+# because we mean the literal pipe character, not its
+# usual (special) meaning OR. Thus sometimes the backslash
+# turns a special meaning off, and sometimes it turns a
+# special meaning on. Unfortunately there's no easy way
+# to tell - you just need to remember the characters - or
+# have a reference handy. The special characters are
+# (){}[]^$?*+.|&-   ... and some of them have different
+# meanings depending on where in the regex they are.
-;Conserve motifs
+# Let's put the tokens into named slots of a vector.
- From a CLUSTAL alignment:
-_SCHPO      --DKR<span style="color: rgb(255, 0, 0);">V</span>A---<span style="color: rgb(255, 0, 0);">G</span>LWVPP      --DKR<span style="color: rgb(0, 255, 0);">V</span>A--<span style="color: rgb(0, 255, 0);">G</span>-LWVPP
- XBP1_SACCE      GGYIK<span style="color: rgb(255, 0, 0);">I</span>Q---<span style="color: rgb(255, 0, 0);">G</span>TWLPM      GGYIK<span style="color: rgb(0, 255, 0);">I</span>Q--<span style="color: rgb(0, 255, 0);">G</span>-TWLPM
-_ASPTE      --DE<span style="color: rgb(255, 0, 0);">I</span>A<span style="color: rgb(255, 0, 0);">G</span>---NVWISP  -&gt;  ---DE<span style="color: rgb(0, 255, 0);">I</span>A--<span style="color: rgb(0, 255, 0);">G</span>NVWISP
-_KLULA      GGYIK<span style="color: rgb(255, 0, 0);">I</span>Q---<span style="color: rgb(255, 0, 0);">G</span>TWLPY      GGYIK<span style="color: rgb(0, 255, 0);">I</span>Q--<span style="color: rgb(0, 255, 0);">G</span>-TWLPY
-<small>The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.</small>
+features <- list()
+for (line in lines) {
+    tokens <- unlist(strsplit(line, "\\t|\\|"))
+    features <- rbind(features, c(uID   =  tokens[2],
+                                  start =  tokens[4],
+                                  end   =  tokens[5],
+                                  psID  =  tokens[6],
+                                  psName = tokens[7]))
+}
+features
+</source>
+This forms the base of a function that collects the features automatically from a PrositeScan result. We still need to do a bit more on the database part, but this is mostly bookkeeping:
-The Ankyrin domains are quite highly diverged, the boundaries not well defined and not even CDD, SMART and SAS agree on the precise annotations. We expect there to be alignment errors in this region. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required <i>indels</i> would be placed between the secondary structure elements, not in their middle. But judging from the sequence alignment alone, we cannot judge where the secondary structure elements ought to be. You should therefore add the following "sequence" to the alignment; it contains exactly as many characters as the Swi6 sequence above and annotates the secondary structure elements. I have derived it from the 1SW6 structure
+* We need to put the feature annotations into a database table and link them to a protein ID and to a description of the feature itself.
+* We need a function that extracts feature sequences in FASTA format.
+* And, since we are changing the structure of the database, we need a way to migrate your old database contents to a newer version.
- >SecStruc 1SW6 E: strand   t: turn   H: helix   _: irregular
+I don't think much new can be learned from this, so I have written those functions and put them into dbUtilities.R But you can certainly learn something from having a look at the code of
- _EEE__tt___ttt______EE_____t___HHHHHHHHHHHHHHHH_xxxx_HHHHHHH
- HHHH_t_____t_____t____HHHHHHH__tHHHHHHHHH____t___tt____HHHHH
- HH__HHHH___HHHHHHHHHHHHHEE_t____HHHHHHHHH__t__HHHHHHHHHHHHHH
- HHHHHH__EEE_xxxx_HHHHHt_HHHHHHH______t____HHHHHHHH__HHHHHHHH
- H____t____t____HHHH___
-<div class="reference-box">[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=1sw6&template=protein.html&r=wiring&l=1&chain=A '''1SW6_A''' at the PDBSum database of structure annotations] You can compare the diagram there with this text string.</div>
+*<code>fetchPrositeFeatures()</code>
+*<code>addFeatureToDB()</code>
+*<code>getFeatureFASTA()</code>
+Also, have a quick look back at our [[BIO_Assignment_Week_3#The_Protein_datamodel|database schema:]] this update has implemented the proteinFeature and the feature table. Do you remember what they were good for?
-To proceed:
+Time for a database update. You must be up to date with the latest version of <code>dbUtilities.R</code> for this to work. When you are, execute the following steps:
-#Manually align the Swi6 sequence with yeast Mbp1
-#Bring the Secondary structure annotation into its correct alignment with Swi6
-#Bring both CDD ankyrin profiles into the correct alignment with yeast Mbp1
-Proceed along the following steps:
+<source lang="R">
-{{task|1=
+updateVerifiedFile("363ffbae3ff21ba80aa4fbf90dcc75164dbf10f8")
-#Add the secondary structure annotation to the sequence alignment in Jalview. Copy the annotation, select File &rarr; Add sequences &rarr; from Textbox and paste the sequence.
-#Select Help &rarr; Documentation and read about '''Editing Alignments''', '''Cursor Mode''' and '''Key strokes'''.
-#Click on the yeast Mbp1 sequence '''row''' to select the entire row. Then use the cursor key to move that sequence down, so it is directly above the 1SW6 sequence. Select the row of 1SW6 and use shift/mouse to move the sequence elements and edit the alignment to match yeast Mbp1. Refer to the alignment given in the [[Mbp1_annotation|Mbp1 annotation page]] for the correct alignment.
-#Align the secondary structure elements with the 1SW6 sequence: Every character of 1SW6 should be matched with either E, t, H, or _. The result should be similar to the [[Mbp1_annotation|Mbp1 annotation page]]. If you need to insert gaps into all sequences in the alignment, simply drag your mouse over all row headers - movement of sequences is constrained to selected regions, the rest is locked into place to prevent inadvertent misalignments. Remember to save your project from time to time: '''File &rarr; save''' so you can reload a previous state if anything goes wrong and can't be fixed with '''Edit &rarr; Undo'''.
-#Finally align the two CD00204 consensus sequences to their correct positions (again, refer to the [[Mbp1_annotation|Mbp1 annotation page]]).
-#You can now consider the principles stated above and see if you can improve the alignment, for example by moving indels out of regions of secondary structure if that is possible without changing the character of the aligned columns significantly. Select blocks within which to work to leave the remaining alignment unchanged. So that this does not become tedious, you can restrict your editing to one Ankyrin repeat that is structurally defined in Swi6. You may want to open the 1SW6 structure in VMD to define the boundaries of one such repeat. You can copy and paste sections from Jalview into your assignment for documentation or export sections of the alignment to HTML (see the example below).
-}}
-=== Editing ankyrin domain alignments - Sample===
+# Make a backup copy of your protein database.
+# Load your protein database. Then merge the data in your database
+# with the updated reference database. (Obviously, substitute the
+# actual filename in the placeholder strings below. And don't type
+# the angled brackets!)
-This sample was created by
+<my-new-database> <- mergeDB(<my-old-database>, refDB)
-# Editing the alignments as described above;
+# check that this has worked:
-# Copying a block of aligned sequence;
+str(<my-new-database>)
-# Pasting it To New Alignment;
-# Colouring the residues by Hydrophobicity and setting the colour saturation according to Conservation;
-# Choosing File &rarr; Export Image &rarr; HTML and pasting the resulting HTML source into this Wikipage.
+# and save your database.
-<table border="1"><tr><td>
+save(<my-new-database>, file="<my-DB-filename.02>.RData")
-<table border="0" cellpadding="0" cellspacing="0">
-<tr><td colspan="6"></td>
+# Now, for each of your proteins, add the domain annotations to
-<td colspan="9">10<br>|</td><td></td>
+# the database. You could write a loop to do this but it's probably
-<td colspan="9">20<br>|</td><td></td>
+# better to check the results of each annotation before committing
-<td colspan="9">30<br>|</td><td></td>
+# it to the database. So just paste the UniProt Ids as argument of
-<td colspan="3"></td><td colspan="3">40<br>|</td>
+# the function fetchPrositeFeatures(), execute and repeat.
-</tr>
-<tr><td nowrap="nowrap">MBP1_USTMA/341-368&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
+features <- fetchPrositeFeatures(<one-of-my-proteins-uniProt-IDs>)
-<td>-</td>
+refDB <- addFeatureToDB(refDB, features)
-<td>-</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+# When you are done, save your database.
-<td>-</td>
+</source>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#ffd8d8">I</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
+Finally, we can create a sequence selection of APSES domains
-<td>-</td>
+from our reference proteins. The function <code>getFeatureFasta()</code>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#fbeef1">F</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#cfaddc">G</td>
+* accepts a feature name such as <code>"HTH_APSES"</code>;
-<td bgcolor="#dad8fd">E</td>
+* finds the corresponding feature ID;
-<td bgcolor="#d9c2e7">T</td>
+* finds all matching entries in the proteinFeature table;
-<td bgcolor="#d3c2ee">P</td>
+* looks up the start and end position of each feature;
-<td bgcolor="#f7adb3">L</td>
+* fetches the corresponding substring from the sequence entries;
-<td bgcolor="#ccaddf">T</td>
+* adds a meaningful header line; and
-<td bgcolor="#ecc2d5">M</td>
+* writes everything to output.
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
+... so that you can simply execute:
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f4eef8">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1B_SCHCO/470-498&nbsp;&nbsp;</td>
-<td>-</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#eeeefe">D</td>
+<source lang="R">
-<td bgcolor="#f4eef7">G</td>
+cat(getFeatureFasta(<my-new-database>, "HTH_APSES"))
-<td bgcolor="#eeeefe">D</td>
+</source>
-<td bgcolor="#f3eef9">Y</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#f4eef8">S</td>
-<td>-</td>
+Here are the first five sequences from that result:
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+<source lang="text">
-<td bgcolor="#f7d8e0">F</td>
+>CC1G_01306_COPCI    HTH_APSES 6:112
-<td bgcolor="#fbd8db">L</td>
+IFKATYSGIPVYEMMCKGVAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHE
-<td>-</td>
+KVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEFTPAAKSPPL
-<td>-</td>
+>CNBB4890_CRYNE    HTH_APSES 17:123
-<td>-</td>
+IYKATYSGVPVYEMVCRDVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHE
-<td>-</td>
+KVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPP
-<td bgcolor="#dad8fd">D</td>
+>COCMIDRAFT_338_BIPOR    HTH_APSES 9:115
-<td bgcolor="#fdeeef">L</td>
+IYSATYSNVPVYECNVNGHHVMRRRADDWINATHILKVADYDKPARTRILEREVQKGVHE
+KVQGGYGKYQGTWIPLEEGRGLAERNGVLDKMRAIFDYVPGDRSPPP
+>WALSEDRAFT_68476_WALME    HTH_APSES 83:192
+IYSAVYSGVGVYEAMIRGIAVMRRRADGYMNATQILKVAGVDKGRRTKILEREILAGLHE
+KIQGGYGKYQGTWIPFERGRELALQYGCDHLLAPIFDFNPSVMQPSAGRS
+>PGTG_08863_PUCGR    HTH_APSES 90:196
+IYKATYSGVPVLEMPCEGIAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREIQKGTHE
+KIQGGYGKYQGTWVPLDRGIDLAKQYGVDHLLSALFNFQPSSNESPP
+[...]
+</source>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">E</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b0adfa">N</td>
+At the bottom of these sequences, you should see the APSES sequences from
-<td bgcolor="#ffc2c2">I</td>
+YFO, '''in particular the Mbp1 RBM sequence from YFO'''. Email me if you have trouble getting to that stage.
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#fcc2c4">V</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeefe">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_ASHGO/465-494&nbsp;&nbsp;</td>
+We'll need to align these sequences with the template...
-<td>F</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
+}}
-<td>-</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#f4eef8">T</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+-->
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#ffd8d8">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+===Template choice and template sequence===
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#efc2d0">C</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#d3c2ee">P</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e5adc6">M</td>
-<td bgcolor="#c5c2fb">N</td>
+The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are different from the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#eeeefe">D</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_CLALU/550-586&nbsp;&nbsp;</td>
-<td>G</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#f4eef7">G</td>
+Template choice is the first step. Often more than one related structure can be found in the PDB. The degree of sequence identity is the most important criterion, but there are many other factors to consider. Please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I discuss more details and alternatives. To find related structures, you can search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modelling is sequence similarity.
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#f4eef8">S</td>
-<td>N</td>
-<td>D</td>
-<td>K</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#eeeefe">E</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
+Defining a '''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target''' that you can build a model based on that '''template'''. To find suitable PDB structures, we will perform a BLAST search at the PDB.
-<td bgcolor="#ffd8d8">I</td>
-<td>S</td>
-<td>K</td>
-<td>F</td>
-<td>L</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#edadbd">F</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#ffc2c2">I</td>
+<!-- NOTE TO SELF: use the following sequence to test the procedure
-<td bgcolor="#e4adc7">A</td>
+>Mbp1_SCHPO/2-100 NP_593032
-<td bgcolor="#e4adc7">A</td>
+AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQG
-<td bgcolor="#c6ade5">Y</td>
+TWVPFQRGVDLATKYKVDGIMSPILSL
-<td bgcolor="#c5c2fb">N</td>
+>1BM8_A
-<td bgcolor="#f9eef3">M</td>
+QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQG
-<td bgcolor="#f4eef8">S</td>
+TWVPLNIAKQLAEKFSVYDQLKPLFDF
-</tr>
+-->
-<tr><td nowrap="nowrap">MBPA_COPCI/514-542&nbsp;&nbsp;</td>
-<td>-</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#fbeef1">F</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f4eef8">S</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+{{task|1=
-<td>-</td>
+# Retrieve your '''aligned''' YFO's Mbp1 RBM APSES domain sequence from the <tt>APSES.mfa</tt> selection you have prepared for the phylogeny assignment. This YFO sequence is your '''target''' sequence.
-<td>-</td>
+# Navigate to the [http://www.pdb.org/pdb/home/home.do PDB].
-<td>-</td>
+# Click on '''Advanced''' to enter the advanced search interface.
-<td bgcolor="#fbd8db">L</td>
+# Open the menu to '''Choose a Query Type:'''
-<td bgcolor="#fdd8da">V</td>
+# Find the '''Sequence features''' section and choose '''Sequence (BLAST...)'''
-<td>-</td>
+# Paste your '''target''' sequence into the '''Sequence''' field, select '''not''' to mask low-complexity regions and '''Submit Query'''. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.
-<td>-</td>
-<td>-</td>
-<td>-</td>
+All hits that are homologs are potentially suitable '''templates''', but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">E</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
+:*sequence similarity to your target
-<td bgcolor="#ebc2d5">A</td>
+:*size of expected model (= length of alignment)
-<td bgcolor="#ffadad">I</td>
+:*presence or absence of ligands
-<td bgcolor="#b0adfa">N</td>
+:*experimental method and quality of the data set
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#fcc2c4">V</td>
-<td bgcolor="#f4eef7">G</td>
+Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.
-<td bgcolor="#eeeefe">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_DEBHA/507-550&nbsp;&nbsp;</td>
-<td>I</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">E</td>
+# There is a menu to create '''Reports:''' - select '''customizable table'''.
-<td bgcolor="#ffeeee">I</td>
+# Select (at least) the following information items:
-<td>-</td>
+;Structure Summary
-<td>-</td>
+* Experimental Method
-<td>-</td>
+;Sequence
-<td bgcolor="#eeeefe">E</td>
+* Chain Length
-<td bgcolor="#eeeefe">N</td>
+;Ligands
-<td>K</td>
+* Ligand Name
-<td>K</td>
+;Biological details
+* Macromolecule Name
+; refinement Details
+* Resolution
+* R Work
+* R free
+# click: '''Create report'''.
-<td>L</td>
+Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. And there is a new structure from January 2015, with a lower resolution. Some of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice for our template: 1BM8.
-<td>S</td>
-<td>L</td>
-<td>S</td>
-<td>D</td>
-<td>K</td>
-<td>K</td>
-<td>E</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#ffd8d8">I</td>
+;Finally: Click on the 1BM8 ID to navigate to the structure page for the '''template''' and save the FASTA sequence to your computer. This is '''the template sequence'''.
-<td>A</td>
-<td>K</td>
-<td>F</td>
-<td>I</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#ffc2c2">I</td>
+}}
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#edadbd">F</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#fbadaf">V</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#c6ade5">Y</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#eeeefe">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1A_SCHCO/388-415&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
+&nbsp;
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9eef3">A</td>
+===Sequence numbering===
-<td bgcolor="#eeeefe">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fdd8da">V</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
+&nbsp;
-<td bgcolor="#fbeef1">F</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">E</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">E</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
+It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file <small>(one of the related PDB structures)</small> '''is''' the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with <code>MSNQIY...</code>, but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with  ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#ccaddf">T</td>
-<td bgcolor="#ecc2d5">M</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#efc2d0">C</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f4eef8">S</td>
+Fortunately, the numbering for the residues in the coordinate section of our '''target''' structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence <small>(e.g. by using the bio3D R package)</small>. If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.
-</tr>
-<tr><td nowrap="nowrap">MBP1_AJECA/374-403&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#ffeeee">I</td>
+<!--
-<td>-</td>
+BELOW IS NOT NECESSARY FOR THE 1BM8 TEMPLATE. ALSO extraction can be done with bio3D
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#f9eef3">M</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
+The homology '''model''' will be based on an alignment of '''target''' and '''template'''. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit  and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for [http://swift.cmbi.ru.nl/servers/html/index.html '''WhatIf'''], a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
+*Navigate to the '''Administration''' sub-menu of the [http://swift.cmbi.ru.nl/servers/html/index.html WhatIf Web server]. Follow the link to '''Make sequence file from PDB file'''. Enter the PDB-ID of your template into the form field and '''Send''' the request to the server. The server accesses the PDB file and extracts sequence information directly from the <code>ATOM&nbsp;&nbsp;</code> records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this '''implied''' sequence to check if and how it differs from the sequence ...
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#faeef2">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PARBR/380-409&nbsp;&nbsp;</td>
-<td>I</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#f2eefa">P</td>
+:*... listed in the <code>SEQRES</code> records of the coordinate file;
-<td bgcolor="#f2eefa">P</td>
+:*... given in the FASTA sequence for the template, which is provided by the PDB;
-<td bgcolor="#efeefd">H</td>
+:*... stored in the protein database of the NCBI.
-<td bgcolor="#eeeefe">Q</td>
+: and record your results.
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#fdeeef">L</td>
+* Establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+:(*) <small>These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the ''Sequence Viewer'' extension of VMD.</small>.
-<td>-</td>
+:<small>Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence.</small>.
-<td bgcolor="#fbd8db">L</td>
+&nbsp;
-<td bgcolor="#fbd8db">L</td>
+&nbsp;
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#f4eef8">S</td>
+-->
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#faeef2">C</td>
-</tr>
+&nbsp;
-<tr><td nowrap="nowrap">MBP1_NEOFI/363-392&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#faeef2">C</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+===The input alignment===
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#faeef2">C</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
+&nbsp;
-<td bgcolor="#dad8fd">D</td>
+The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#fcc2c4">V</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
+The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_ASPNI/365-394&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#fbeef1">F</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#f2eefa">P</td>
+In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the '''template sequence''' and the '''target sequence''' from your species, proceed as follows.
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeee">V</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+&nbsp;
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#faeef2">C</td>
-<td bgcolor="#eeeefe">Q</td>
+{{task|1=
-<td bgcolor="#c5c2fb">D</td>
+Choose one of the following options to align your '''target''' and '''template''' sequence. Make sure your '''template''' sequence is included, i.e. the FASTA sequence of 1BM8.
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#fdeeee">V</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#fbadaf">V</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#fcc2c4">V</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#fdeeee">V</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_UNCRE/377-406&nbsp;&nbsp;</td>
+;In Jalview...
-<td>M</td>
+* Load your APSES domain sequences plus the 1BM8 sequence in Jalview. Include the sequence of your '''template protein''' and align using Muscle.
-<td bgcolor="#f3eef9">Y</td>
+* Delete all sequence you no longer need, i.e. keep only the APSES domains of the '''target''' (from your species) and the '''template''' (from the PDB) and choose '''Edit &rarr; Remove empty columns'''. This is your '''input alignment'''.
-<td bgcolor="#f2eefa">P</td>
+* Choose '''File&rarr;Output to textbox&rarr;FASTA''' to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeee">V</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+;Using a different MSA program
-<td>-</td>
+* Copy the FASTA formatted sequences of the Mbp1 proteins in the reference  species from the [[Reference APSES domains (reference species)|'''Reference APSES domain page''']].
-<td>-</td>
+* Access the [http://www.ebi.ac.uk/Tools/msa/ '''MSA tools page at the EBI'''].
-<td>-</td>
+* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the input form.
-<td>-</td>
+*Run an alignment (I like T-coffee) and save the output.
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f2d8e5">A</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
+;Using the '''R''' bioconductor [[BIO_Assignment_Week_4#Computing_an_MSA_in_R|MSA package that you used previously]].
-<td bgcolor="#d9c2e7">T</td>
+Refer back to the page if you are lacking notes how to go about this.
-<td bgcolor="#ebc2d5">A</td>
+}}
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#faeef2">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PENCH/439-468&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#faeef2">C</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">D</td>
+Whatever method you use: the result should be a two sequence alignment in '''multi-FASTA''' format, that was constructed from a number of supporting sequences and that contains your aligned '''target''' and '''template''' sequence. This is your '''input alignment''' for the homology modeling server. For a ''Schizosaccharomyces pombe'' model, which I am using as an example here, it looks like this:
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f9eef3">M</td>
-<td>-</td>
-<td>-</td>
+ >1BM8_A
-<td>-</td>
+ QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
-<td>-</td>
+ LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
-<td>-</td>
+ >Mbp1_SCHPO 2-100 NP_593032
-<td>-</td>
+ AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
-<td>-</td>
+ LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#faeef2">C</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
+In this case, there are no indels and therefore no hyphens - in your case there may be.
-<td bgcolor="#c5c2fb">Q</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#fbadaf">V</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#fcc2c4">V</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_TRIVE/407-436&nbsp;&nbsp;</td>
+&nbsp;
-<td>V</td>
-<td bgcolor="#fbeef1">F</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+==Homology model==
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+The alignment defines the residue by residue relationship between '''target''' and '''template''' sequence. All we need to do now is to change every residue of the template to the target sequence
-<td bgcolor="#e6d8f0">S</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
+===SwissModel===
-<td bgcolor="#faeef2">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PHANO/400-429&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#f4eef9">W</td>
-<td bgcolor="#ffeeee">I</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#eeeefe">E</td>
+&nbsp;<br>
-<td bgcolor="#fdeeee">V</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f4eef8">T</td>
-<td bgcolor="#eeeeff">R</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+Access the Swissmodel server at '''http://swissmodel.expasy.org''' and click on the '''Start Modelling''' button. Under the '''Supported Inputs''', choose '''Target-Template Alignment'''.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
+{{task|1=
-<td>-</td>
+*Paste the aligned sequences of the YFO target and the 1BM8 template into the form field. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">Q</td>
+* Click '''Validate Target Template Alignment''' and check that the returned alignment is correct. All non-identical residues are shown in light-grey.
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#ffadad">I</td>
-<td bgcolor="#e5adc6">M</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
+*Click '''Build Model''' to start the modeling process. This will take about a minute or so.
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_SCLSC/294-313&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
+* The resulting page returns information about the resulting model and its quality. You can rotate the model in the window on the right with the mouse. Regions that have a reddish hue have lower quality scores, i.e. they were harder to model or could not be modelled well with good geometry. Hovering the mouse over parts of the structure highlights the respective region of the sequence alignment.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+*Mouse over the '''Model 01''' dropdown menu (under the icon of the template structure), and choose the '''PDB file'''. Note that the B-factor column of the coordinate section contains the QMEAN scores (between 0 and 1) that the server has calculated. Higher is better. Save the PDB file on your computer.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+* Open the [http://swissmodel.expasy.org/docs/help SwissModel documentation] in a new tab. Read about the modelling process. there are a number of important technical details that help to understand what the computed coordinates of your model mean, you should pay special attention to the '''GQME''' and '''QMEAN''' quality scores.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">D</td>
+* Also save:
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
+** The output page as pdf (for reference)
-<td bgcolor="#ffadad">I</td>
+** The modeling report (as pdf)
-<td bgcolor="#b3adf7">H</td>
+}}
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_PYRIS/363-392&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#f4eef9">W</td>
-<td bgcolor="#ffeeee">I</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeee">V</td>
+==Model interpretation==
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f4eef8">T</td>
-<td bgcolor="#eeeeff">R</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
+We have spent a significant amount of time to prepare data for the analysis and in practice it usually seems to turn out that way, that the preparation of data occupies the greatest part of our efforts. The actual computational analysis is generally quite fast. And, unfortunately, the '''interpretation of results''' is often somewhat neglected. Don't be that way. Data does not explain itself. The interpreattion of your computational results is the most important part.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">Q</td>
-<td bgcolor="#eeeefe">N</td>
+We will look at our homology model with two different questions:
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#ffadad">I</td>
-<td bgcolor="#e5adc6">M</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
+* Can we define the DNA binding residues?
-<td bgcolor="#adadff">R</td>
+* Can we tell which residues are conserved for functional reasons, rather than for structural reasons?
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_/361-390&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#fdeeef">L</td>
-<td>G</td>
-<td>V</td>
-<td>L</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
+&nbsp;
-<td>-</td>
+&nbsp;
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+=== The PDB file ===
-<td>-</td>
+&nbsp;<br>
-<td bgcolor="#f7d8e0">F</td>
-<td bgcolor="#f3d8e4">M</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#f4eef8">T</td>
+{{task|1=
-<td bgcolor="#eeeefe">Q</td>
+Open your '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font (like "courier") so all the columns line up correctly) and consider the following questions:
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#f7adb3">L</td>
+*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your '''model''' correspond to that region?
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
+That's not easy to tell. But it should be.
-<tr><td nowrap="nowrap">MBP1_ASPFL/328-364&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeee">V</td>
-<td>I</td>
+}}
-<td>T</td>
-<td>L</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeeff">R</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f7d8e0">F</td>
-<td bgcolor="#ffd8d8">I</td>
-<td>S</td>
-<td>E</td>
+===R code: renumbering the model ===
-<td>I</td>
-<td>V</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#cfaddc">G</td>
+As you have seen above, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. (An alternative renumbering would renumber the model correspond to the sequence it came from. Remember that we have only excised a domain from the full-length sequence.) Carefully doing this by hand will take you a bit less than an hour. Fortunately there is a very useful R package that will help: '''bio3d'''.
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b0adfa">N</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#adadff">R</td>
+{{task|1=
-<td bgcolor="#ebc2d5">A</td>
+# Navigate to the [http://thegrantlab.org/bio3d/index.php '''bio3D'''] home page to . '''bio3d''' has recently been made available via CRAN - previously it had to be compiled from source.
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f4eef8">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_MAGOR/375-404&nbsp;&nbsp;</td>
-<td>Q</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#fbeef1">F</td>
-<td bgcolor="#fdeeee">V</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">Q</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+# Explore and execute the following '''R''' script. I am assuming that your model is in your <code>PROJECTDIR</code> folder, change paths and filenames as required.
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">Q</td>
+<source lang="rsplus">
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#fbadaf">V</td>
-<td bgcolor="#b3adf7">H</td>
+setwd(PROJECTDIR)
-<td bgcolor="#f9c2c7">L</td>
+PDB_INFILE      <- "YFOmodel.pdb"
-<td bgcolor="#e4adc7">A</td>
+PDB_OUTFILE     <- "YFOmodelRenumbered.pdb"
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#b0adfa">Q</td>
-<td bgcolor="#c2c2ff">R</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#f4eef8">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_CHAGL/361-390&nbsp;&nbsp;</td>
-<td>S</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
+# The bio3d package provides functions for working with
-<td>-</td>
+# protein structures in R
-<td bgcolor="#eeeefe">Q</td>
+if (!require(bio3d, quietly=TRUE)) {
-<td bgcolor="#eeeefe">Q</td>
+	install.packages("bio3d")
-<td>-</td>
+	library(bio3d)
-<td>-</td>
+}
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+# == Read the YFO pdb file
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+iFirst <- 4  # residue number for the first residue
-<td>-</td>
-<td bgcolor="#dad8fd">D</td>
+YFOmodel <- read.pdb(PDB_INFILE) # read the PDB file into a list
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
+YFOmodel           # examine the information
-<td bgcolor="#d9c2e7">T</td>
+YFOmodel$atom[1,]  # get information for the first atom
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#fbadaf">V</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e5adc6">M</td>
-<td bgcolor="#c2c2ff">R</td>
+# Explore ?read.pdb and study the examples.
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PODAN/372-401&nbsp;&nbsp;</td>
-<td>V</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">E</td>
+# == Modify residue numbers for each atom
-<td bgcolor="#eeeefe">E</td>
+resNum <- as.numeric(YFOmodel $atom[,"resno"])
-<td bgcolor="#fdeeee">V</td>
+resNum
-<td>-</td>
+resNum <- resNum - resNum[1] + iFirst  # add offset
-<td>-</td>
+YFOmodel $atom[ , "resno"] <- resNum   # replace old numbers with new
-<td>-</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f9eef3">A</td>
-<td>-</td>
-<td>-</td>
+# check result
-<td>-</td>
+YFOmodel $atom[ , "resno"]
-<td>-</td>
+YFOmodel $atom[1, ]
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
+# == Write output to file
-<td bgcolor="#fbd8db">L</td>
+write.pdb(pdb = YFOmodel, file=PDBout)
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
+# Done. Open the PDB file you have written in a text editor
-<td bgcolor="#c5c2fb">E</td>
+# and confirm that this has worked.
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#fcc2c4">V</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#f9eef3">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_LACTH/458-487&nbsp;&nbsp;</td>
+</source>
+}}
-<td>F</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+&nbsp;
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#eeeefe">N</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#ffd8d8">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+===First visualization===
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">Q</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
+&nbsp;<br>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#fbadaf">V</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#b0adfa">Q</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
+Since a homology model inherits its structural details from the '''template''', your model of the YFO sequence should look very similar to the original 1BM8 structure.
-<td bgcolor="#eeeefe">D</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_FILNE/433-460&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">E</td>
+{{task|1=
-<td bgcolor="#fdeeef">L</td>
+# Start Chimera and load the '''model''' coordinates that you have just renumbered.
-<td>-</td>
+# From the PDB, also load the '''template''' structure. (Use File &rarr; Fetch by ID ...)
-<td>-</td>
+# In the '''Favourites''' &rarr; '''Model Panel''' window you can switch between the two molecules.
-<td>-</td>
+# Hide the ribbon and choose '''backbone only &rarr; full'''. You will note that the backbone of the two structures is virtually identical.
-<td bgcolor="#f9eef3">A</td>
+# Next, choose '''Actions &rarr; Atoms/Bonds &rarr; show''' to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: '''Select &rarr; Chemistry &rarr; Element &rarr; H''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''
-<td bgcolor="#eeeefe">D</td>
+# Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. You can drag your mouse in the  '''Favourites &rarr; Sequence''', window to select the range then '''Select &rarr; Invert (selected model)''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''. Or you can use Chimera's commandline: <code>~display</code> to undisplay everything, <code>show #:50-74</code> to show this residue range for all models.
-<td>-</td>
+# Study the result: a model of the HTH subdomain of YFO's RBM to Mbp1.
-<td>-</td>
+}}
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fdd8da">V</td>
-<td bgcolor="#ffd8d8">I</td>
+&nbsp;
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#fbeef1">F</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">E</td>
+==Coloring the model by energy ==
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">E</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#ccaddf">T</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
+SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB file's B-factor field.
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#f4eef8">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_KLULA/477-506&nbsp;&nbsp;</td>
-<td>F</td>
-<td bgcolor="#f4eef8">T</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">D</td>
+{{task|1=
-<td bgcolor="#fdeeee">V</td>
+# Back in Chimera, use the model panel to '''close''' the 1BM8 structure. Select all and show Atoms, bonds to view the entire model structure.
-<td>-</td>
+# Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''OK'''.
-<td>-</td>
+# Study the result: It seems that residues in the core of the protein have better energies (higher values) than residues at the surface. Why could that be the case?
-<td>-</td>
+}}
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. You can simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. Then render this property to map it on the 3D structure of your molecule...
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#ffd8d8">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#d3c2ee">P</td>
+&nbsp;
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#d5c2ec">Y</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#ccaddf">T</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#eeeefe">D</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_SCHST/468-501&nbsp;&nbsp;</td>
-<td>A</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#f2eefa">P</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#eeeeff">K</td>
+&nbsp;
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#eeeefe">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+==Modelling DNA binding==
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#ffd8d8">I</td>
-<td>A</td>
+One of the really interesting questions we can discuss with reference to our homology model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
-<td>K</td>
-<td>F</td>
-<td>I</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#eeeefe">D</td>
+Since there is currently no software available that would reliably model such a complex from first principles<ref>''Rosetta'' may get the structure approximately right, ''Autodock'' may get the complex approximately right, but the coordinate changes involved in induced fit makes the result unreliable - and we have no good way to validate whether the predicted complex is correct. </ref>, we will base a model of  a bound complex on homology modelling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. It so happens that early in 2015 an APSES domain structure with bound DNA was published. You probably noticed it as a result of the PDB BLAST search: [http://www.rcsb.org/pdb/explore/explore.do?structureId=4UX5 '''4UX5'''], from the ''Magnaporthe oryzae'' Mbp1 orhologue PCG2<ref>{{#pmid: 25550425}}</ref>.
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#edadbd">F</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#eaadc0">C</td>
-<td bgcolor="#caade0">S</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#eeeefe">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_SACCE/496-525&nbsp;&nbsp;</td>
-<td>F</td>
-<td bgcolor="#f4eef8">S</td>
-<td bgcolor="#f2eefa">P</td>
+<!-- But can we also find (and align) distant relatives based purely on '''structural similarity''', ideally a protein-DNA complex? -->
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#f3eef9">Y</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#eeeefe">E</td>
-<td bgcolor="#fdeeef">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+===A homologous protein/DNA complex structure===
-<td>-</td>
-<td bgcolor="#fbd8db">L</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f4eef8">T</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c2c2ff">K</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#ebc2d5">A</td>
-<td bgcolor="#f7adb3">L</td>
+{{task|1=
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#caade0">S</td>
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeefe">D</td>
-</tr>
-<tr><td nowrap="nowrap">CD00204/1-19&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c5c2fb">E</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#d8d8ff">R</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#d3c2ee">P</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#caade0">S</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#efeefd">H</td>
-</tr>
-<tr><td nowrap="nowrap">CD00204/99-118&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fdd8da">V</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeeff">R</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#c2c2ff">K</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#d8d8ff">R</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#d3c2ee">P</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">K</td>
-<td bgcolor="#c5c2fb">N</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#efeefd">H</td>
-</tr>
-<tr><td nowrap="nowrap">1SW6/203-232&nbsp;&nbsp;</td>
-<td>L</td>
-<td bgcolor="#eeeefe">D</td>
-<td bgcolor="#fdeeef">L</td>
-<td bgcolor="#eeeeff">K</td>
-<td bgcolor="#f4eef9">W</td>
-<td bgcolor="#ffeeee">I</td>
-<td bgcolor="#ffeeee">I</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">N</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f3d8e4">M</td>
-<td bgcolor="#fbd8db">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dad8fd">N</td>
-<td bgcolor="#f9eef3">A</td>
-<td bgcolor="#eeeefe">Q</td>
-<td bgcolor="#c5c2fb">D</td>
-<td bgcolor="#d8c2e8">S</td>
-<td bgcolor="#eeeefe">N</td>
-<td bgcolor="#cfaddc">G</td>
-<td bgcolor="#dad8fd">D</td>
-<td bgcolor="#d9c2e7">T</td>
-<td bgcolor="#efc2d0">C</td>
-<td bgcolor="#f7adb3">L</td>
-<td bgcolor="#b0adfa">N</td>
-<td bgcolor="#ffc2c2">I</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#e4adc7">A</td>
-<td bgcolor="#adadff">R</td>
-<td bgcolor="#f9c2c7">L</td>
-<td bgcolor="#f4eef7">G</td>
-<td bgcolor="#eeeefe">N</td>
-</tr>
-<tr><td nowrap="nowrap">SecStruc/203-232&nbsp;&nbsp;</td>
-<td>t</td>
-<td bgcolor="#f5eef6">_</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#efeefd">H</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#efeefd">H</td>
-<td bgcolor="#efeefd">H</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#ead8ed">_</td>
-<td bgcolor="#ead8ed">_</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#ead8ed">_</td>
-<td bgcolor="#f5eef6">_</td>
-<td bgcolor="#f5eef6">_</td>
-<td bgcolor="#dec2e3">_</td>
-<td bgcolor="#d9c2e7">t</td>
-<td bgcolor="#f5eef6">_</td>
-<td bgcolor="#d2add8">_</td>
-<td bgcolor="#ead8ed">_</td>
-<td bgcolor="#dec2e3">_</td>
-<td bgcolor="#c7c2f9">H</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#c7c2f9">H</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#b3adf7">H</td>
-<td bgcolor="#c7c2f9">H</td>
-<td bgcolor="#f5eef6">_</td>
-<td bgcolor="#f5eef6">_</td>
-</tr>
-</table>
-</td></tr>
-</table>
-;Aligned sequences before editing. The algorithm has placed gaps into the Swi6 helix <code>LKWIIAN</code> and the four-residue gaps before the block of well aligned sequence on the right are poorly supported.
-<table border="1"><tr><td>
-<table border="0" cellpadding="0" cellspacing="0">
-<tr><td colspan="6"></td>
-<td colspan="9">10<br>|</td><td></td>
-<td colspan="9">20<br>|</td><td></td>
-<td colspan="9">30<br>|</td><td></td>
-<td colspan="3"></td><td colspan="3">40<br>|</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_USTMA/341-368&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">E</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#c2abe8">P</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#bf99d7">T</td>
-<td bgcolor="#e5abc5">M</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#e2d2ee">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1B_SCHCO/470-498&nbsp;&nbsp;</td>
-<td>-</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#e2d2ee">S</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f2bfcc">F</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">E</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#9d99f9">N</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#fcabae">V</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_ASHGO/465-494&nbsp;&nbsp;</td>
-<td>F</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#e2d2ed">T</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#eaabbf">C</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#c2abe8">P</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#df99b8">M</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#d4d2fc">D</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_CLALU/550-586&nbsp;&nbsp;</td>
-<td>G</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#d4d2fc">D</td>
-<td>K</td>
-<td>K</td>
-<td>E</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>L</td>
-<td>I</td>
-<td>S</td>
-<td>K</td>
-<td bgcolor="#f2bfcc">F</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#e999ad">F</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#b899df">Y</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#f0d2df">M</td>
-<td bgcolor="#e2d2ee">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_COPCI/514-542&nbsp;&nbsp;</td>
-<td>-</td>
+; The PCG2 / DNA complex
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#e2d2ee">S</td>
-<td>-</td>
+* Open Chimera and load the '''<code>4UX5</code>''' structure. Spend some time exploring it. There are two domains of identical sequence bound to one DNA molecule. The first question I would have is whether the two molecules bind to the same DNA motif - the CGCG core of the "MCB-box", and whether the observed protein:DNA interfaces are actually with the cognate sequence, or whether one (or both) proteins are non-specific complexes. The conditions under which proteins crystallize can be harsh, and physiological function under these conditions is not guaranteed.<ref>This particular crystal structure however was crystallized from a Tris-buffer with 50mM NaCl at pH 8.0 - comparatively gentle conditions actually.</ref> Indeed, Liu ''et al.'' (2015) report that at low concentrations a 1:1 complex is formed and the 2:1 Protein:DNA complex only forms at high concentrations. Figure 3. of their paper shows that the detailed contacts between protein and DNA are in fact '''not''' identical.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
+* Without taking this question too far, let's get a quick view of the comparison by duplicating one domain of the structure and superimposing it on the other. The authors feel that chain <code>A</code> represents the tighter, more specific mode of interaction; so we will duplicate chain <code>B</code> and superpose the copy on <code>A</code>.
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#fcbfc1">V</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d4d2fc">Q</td>
+* In Chimera, open the '''Favorites''' &rarr; '''Model Panel''' and use the '''copy/combine''' button to create a copy of the <code>4UX5</code> model. Call it <code>test</code>.
-<td bgcolor="#afabfa">D</td>
+* '''Select''' chain B of the <code>test</code> model, then use '''Select''' &rarr; '''Invert (selected models)''' to apply the selection to everything in the <code>test</code> model '''except''' chain B.
-<td bgcolor="#afabfa">E</td>
+* Use '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''delete''' to remove everything ''but'' Chain B.
-<td bgcolor="#d5d2fb">H</td>
+* Select and colour the chain red.
-<td bgcolor="#c399d4">G</td>
+* Back on the Model Panel, select both models and use the '''match...''' dialogue to open a '''MatchMaker''' dialogue window.  Choose the radio button two match two specific chains and select <code>4UX5</code> chain A as the '''Reference chain''', <code>test</code> chain B as the '''Chain to match'''. Click '''Apply'''.
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#ff9999">I</td>
-<td bgcolor="#9d99f9">N</td>
+You will see that the superimposed structures are very similar, that the main difference is in the orientation of the disordered C-terminus, but also that there is a structural difference between the two structures around Gly 84 which inserts into the minor groove of the double helix.
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#fcabae">V</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_DEBHA/507-550&nbsp;&nbsp;</td>
+* Select one of the residues of that loop in chain A by &lt;control&gt;-clicking on it and use '''Action''' &rarr; '''Set pivot''' to set the centre of rotation to that residue: this makes it easier to visualize the binding situation when you make the molecules larger.
-<td>I</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">N</td>
+* Select residues 81 to 87 and the corresponding (sequence <code>VQGGYGKY</code>) and in both chains turn their ribbon display off and display this range as "sticks".
-<td>K</td>
+* Select '''nucleic acid''' in the '''structure''' submenu and turn ribbons and nucleotide objects off to display the DNA as sticks as well. Colour the DNA by element.
-<td>K</td>
+* Study the situation. Focus on Gly 84.A, especially the interaction of its carbonyl oxygen, which hydrogen bonds to the N2 atom of G8.D chain. Gln 89.A hydrogen bonds to the N2 atom of G8.C chain. Gly 84 and Gln 82 thus recognize a G:C C:G pair. In the B chain, Gly 84.B does not contact the DNA well, since it contacts residues of chain A, especially Gln 82.A. The carbonyl atom of Gly 84.B hydrogen bonds to Gln 89.B. and therefore Gln89.B is not available to contact nucleotide bases. What do you think<ref>Besides the coordinate difference between the chains, if indeed chain B would be representative of a DNA "scanning" conformation, perhaps one should expect that the local DNA structure that chain B binds to is structurally closer to canonical B-DNA than the DNA binding interface of chain A...</ref>? It seems to me that a crucial interaction for the cognate sequence is contributed by Guanine 8,
-<td>L</td>
+* Finally, use the Model Panel to select <code>test</code> and '''close''' it.
-<td>S</td>
-<td>L</td>
-<td>S</td>
-<td>D</td>
-<td>K</td>
-<td>K</td>
-<td>E</td>
-<td>L</td>
-<td>I</td>
-<td>A</td>
-<td>K</td>
-<td bgcolor="#f2bfcc">F</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#e999ad">F</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#fb999c">V</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#b899df">Y</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d4d2fc">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1A_SCHCO/388-415&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fcbfc1">V</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">E</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">E</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#bf99d7">T</td>
-<td bgcolor="#e5abc5">M</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#eaabbf">C</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#e2d2ee">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_AJECA/374-403&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#f0d2df">M</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f4d2dc">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PARBR/380-409&nbsp;&nbsp;</td>
-<td>I</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#fbd2d5">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f4d2dc">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_NEOFI/363-392&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#f4d2dc">C</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#fbd2d5">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#f4d2dc">C</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#fcabae">V</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_ASPNI/365-394&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#fbd2d5">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#f4d2dc">C</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#fb999c">V</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#fcabae">V</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#fcd2d3">V</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_UNCRE/377-406&nbsp;&nbsp;</td>
-<td>M</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#fbd2d5">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#eabfd3">A</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f4d2dc">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PENCH/439-468&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#f4d2dc">C</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#f0d2df">M</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#f4d2dc">C</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">Q</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#fb999c">V</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#fcabae">V</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_TRIVE/407-436&nbsp;&nbsp;</td>
-<td>V</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#fbd2d5">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#d6bfe7">S</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f4d2dc">C</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PHANO/400-429&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#e2d2ef">W</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#e2d2ed">T</td>
-<td bgcolor="#d2d2ff">R</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">Q</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#ff9999">I</td>
-<td bgcolor="#df99b8">M</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_SCLSC/294-313&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#ff9999">I</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_PYRIS/363-392&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#e2d2ef">W</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#e2d2ed">T</td>
-<td bgcolor="#d2d2ff">R</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">Q</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#ff9999">I</td>
-<td bgcolor="#df99b8">M</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_/361-390&nbsp;&nbsp;</td>
-<td>N</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f2bfcc">F</td>
-<td bgcolor="#ebbfd3">M</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#e2d2ed">T</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_ASPFL/328-364&nbsp;&nbsp;</td>
-<td>T</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#e2d2ed">T</td>
-<td>L</td>
-<td>G</td>
-<td>R</td>
-<td>F</td>
-<td>I</td>
-<td>S</td>
-<td>E</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#fcbfc1">V</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#9d99f9">N</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#e2d2ee">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBPA_MAGOR/375-404&nbsp;&nbsp;</td>
-<td>Q</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#fb999c">V</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9d99f9">Q</td>
-<td bgcolor="#ababff">R</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#e2d2ee">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_CHAGL/361-390&nbsp;&nbsp;</td>
-<td>S</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#fb999c">V</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#df99b8">M</td>
-<td bgcolor="#ababff">R</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_PODAN/372-401&nbsp;&nbsp;</td>
-<td>V</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fcd2d3">V</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#f0d2e0">A</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">E</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#fcabae">V</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_LACTH/458-487&nbsp;&nbsp;</td>
-<td>F</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#d4d2fc">N</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">Q</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#fb999c">V</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9d99f9">Q</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">D</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_FILNE/433-460&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fcbfc1">V</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f5d2db">F</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">E</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">E</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#bf99d7">T</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#e2d2ee">S</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_KLULA/477-506&nbsp;&nbsp;</td>
-<td>F</td>
-<td bgcolor="#e2d2ed">T</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#fcd2d3">V</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#c2abe8">P</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#c5abe5">Y</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#bf99d7">T</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#d4d2fc">D</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_SCHST/468-501&nbsp;&nbsp;</td>
-<td>A</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#d4d2fc">D</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>L</td>
-<td>I</td>
-<td>A</td>
-<td>K</td>
-<td bgcolor="#f2bfcc">F</td>
-<td bgcolor="#ffbfbf">I</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#e999ad">F</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#e699b1">C</td>
-<td bgcolor="#be99d9">S</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d4d2fc">N</td>
-</tr>
-<tr><td nowrap="nowrap">MBP1_SACCE/496-525&nbsp;&nbsp;</td>
-<td>F</td>
-<td bgcolor="#e2d2ee">S</td>
-<td bgcolor="#ded2f2">P</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#dfd2f0">Y</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#d4d2fc">E</td>
-<td bgcolor="#fbd2d5">L</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#e2d2ed">T</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#ababff">K</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#e3abc6">A</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#be99d9">S</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">D</td>
-</tr>
-<tr><td nowrap="nowrap">CD00204/1-19&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#afabfa">E</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#bfbfff">R</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#c2abe8">P</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#be99d9">S</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d5d2fb">H</td>
-</tr>
-<tr><td nowrap="nowrap">CD00204/99-118&nbsp;&nbsp;</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#fcbfc1">V</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d2d2ff">R</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#ababff">K</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#bfbfff">R</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#c2abe8">P</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">K</td>
-<td bgcolor="#afabfa">N</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d5d2fb">H</td>
-</tr>
-<tr><td nowrap="nowrap">1SW6/203-232&nbsp;&nbsp;</td>
-<td>L</td>
-<td bgcolor="#d4d2fc">D</td>
-<td bgcolor="#fbd2d5">L</td>
-<td bgcolor="#d2d2ff">K</td>
-<td bgcolor="#e2d2ef">W</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#ffd2d2">I</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">N</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#ebbfd3">M</td>
-<td bgcolor="#f9bfc4">L</td>
-<td bgcolor="#c2bffc">N</td>
-<td bgcolor="#f0d2e0">A</td>
-<td bgcolor="#d4d2fc">Q</td>
-<td bgcolor="#afabfa">D</td>
-<td bgcolor="#caabe0">S</td>
-<td bgcolor="#d4d2fc">N</td>
-<td bgcolor="#c399d4">G</td>
-<td bgcolor="#c2bffc">D</td>
-<td bgcolor="#cbabdf">T</td>
-<td bgcolor="#eaabbf">C</td>
-<td bgcolor="#f699a1">L</td>
-<td bgcolor="#9d99f9">N</td>
-<td bgcolor="#ffabab">I</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#dd99b9">A</td>
-<td bgcolor="#9999ff">R</td>
-<td bgcolor="#f7abb2">L</td>
-<td bgcolor="#e4d2ec">G</td>
-<td bgcolor="#d4d2fc">N</td>
-</tr>
-<tr><td nowrap="nowrap">SecStruc/203-232&nbsp;&nbsp;</td>
-<td>t</td>
-<td bgcolor="#e6d2e9">_</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td bgcolor="#d5d2fb">H</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td>-</td>
-<td bgcolor="#dcbfe1">_</td>
-<td bgcolor="#dcbfe1">_</td>
-<td bgcolor="#dcbfe1">_</td>
-<td bgcolor="#e6d2e9">_</td>
-<td bgcolor="#e6d2e9">_</td>
-<td bgcolor="#d2abd8">_</td>
-<td bgcolor="#cbabdf">t</td>
-<td bgcolor="#e6d2e9">_</td>
-<td bgcolor="#c799cf">_</td>
-<td bgcolor="#dcbfe1">_</td>
-<td bgcolor="#d2abd8">_</td>
-<td bgcolor="#b2abf7">H</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#b2abf7">H</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#a199f6">H</td>
-<td bgcolor="#b2abf7">H</td>
-<td bgcolor="#e6d2e9">_</td>
-<td bgcolor="#e6d2e9">_</td>
-</tr>
-</table>
-</td></tr>
-</table>
-;Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.
-===Final analysis===
-{{task|1=
-* Compare the distribution of indels in the ankyrin repeat regions of your alignments.
-**'''Review''' whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity.
-**Think about whether the assertion that ''indels should not be placed in elements of secondary structure'' has merit in your alignment.
-**Recognize that an indel in an element of secondary structure could be interpreted in a number of different ways:
-*** The alignment is correct, the annotation is correct too: the indel is tolerated in that particular case, for example by extending the length of an &alpha;-helix or &beta;-strand;
-*** The alignment algorithm has made an error, the structural annotation is correct: the indel should be moved a few residues;
-*** The alignment is correct, the structural annotation is wrong, this is not a secondary structure element after all;
-*** Both the algorithm and the annotation are probably wrong, but we have no data to improve the situation.
-(<small>NB: remember that the structural annotations have been made for the yeast protein and might have turned out differently for the other proteins...</small>)
-You should be able to analyse discrepancies between annotation and expectation in a structured and systematic way. In particular if you notice indels that have been placed into an '''annotated''' region of secondary structure, you should be able to comment on whether the location of the indel has strong support from aligned sequence motifs, or whether the indel could possibly be moved into a different location without much loss in alignment quality.
-*Considering the whole alignment and your experience with editing, you should be able to state whether the position of indels relative to structural features of the ankyrin domains in your organism's Mbp1 protein is reliable. That would be the result of this task, in which you combine multiple sequence and structural information.
-*You can also critically evaluate database information that you have encountered:
-# Navigate to the [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=precalc&SEQUENCE=6320147 '''CDD annotation'''] for yeast Mbp1.
-# You can check the precise alignment boundaries of the ankyrin domains by clicking on the (+) icon to the left of the matching domain definition.
-# Confirm that CDD extends the ankyrin domain annotation beyond the 1SW6 domain boundaries. Given your assessment of conservation in the region beyond the structural annotation:  do you think that extending the annotation is reasonable also in YFO's protein? Is there evidence for this in the alignment of the CD00204 consensus with well aligned blocks of sequence beyond the positions that match Swi6?
 }}
-==R code: load alignment and compute information scores==
-<!-- Add sequence weighting and sampling bias correction ? -->
-As discussed in the lecture, Shannon information is calculated as the difference between expected and observed entropy, where entropy is the negative sum over probabilities times the log of those probabilities:
+&nbsp;
+===Superimposing your model===
+Both your homology model and the template structure provide valuable information:
+* The template structure shows how conserved the structure is at the protein/DNA interface. You have seen what subtle differences can give rise to a sequence specific complex and a non-specific binding mode. For Mbp1 we know that the APSES domain binds to the same cognate DNA sequence as PCG2. Since your model structure is heavily biased towards the template, evaluating the template in the context of a real protein/DNA complex allows you to judge which binding residues appear to be conserved and possibly modelled in an orientation that is productive for binding.
-Here we compute Shannon information scores for aligned positions of the APSES domain, and plot the values in '''R'''. You can try this with any part of your alignment, but I have used only the aligned residues for the APSES domain for my example. This is a good choice for a first try, since there are (almost) no gaps.
+* The model structure maps sequence variation into that context: are the crucial residues for sequence specific binding conserved?
 {{task|1=
-# Export only the sequences of the aligned APSES domains to a file on your computer, in FASTA format. You could call this: <code>Mbp1_All_APSES.fa</code>.
-# Explore the R-code below. Be sure that you understand it correctly. Note that there is no sampling bias correction, so positions with large numbers of gaps will receive artificially high scores.
+* Start by loading your model and the 1BM8 structure into your chimera session. Select all, turn all ribbons off, and set all atoms to stick representation. Then select H atoms by element and '''hide''' them.
-<source lang="rsplus">
+* We need to visualize and evaluate differences in binding between different proteins and for me it works well to colour everything by element, and give the carbon atoms some identifying, distinct colour. This is best achieved through the Chimera command line that you can turn on with the little "computer" icon on the left-hand side of the graphics window. Have a look at the [https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/framecommand.html Chimera Users guide], and choose '''select''' to learn how Chimera's selection syntax works.
+* Open the Model Panel to check which protein has which Chimera-internal model number. Then you can use the following selection syntax. Instead of the model numbers, I will type <code>&lt;YFO&gt;</code>, <code>&lt;4ux5&gt;</code>, and <code>&lt;1BM8&gt;</code> - you will certainly know by now that these are placeholder labels and you need to replace them with the numbers <code>0</code>, <code>1</code>, and <code>2</code> instead.
-# CalculateInformation.R
+:* To colour the DNA carbon atoms white, type:<br />
-# Calculate Shannon information for positions in a multiple sequence alignment.
+::<code>color white #&lt;4ux5&gt;:.C,.D & C</code>
-# Requires: an MSA in multi FASTA format
-# It is good practice to set variables you might want to change
-# in a header block so you don't need to hunt all over the code
-# for strings you need to update.
-#
-setwd("/your/R/working/directory")
-mfa      <- "MBP1_All_APSES.fa"
-# ================================================
-#    Read sequence alignment fasta file
-# ================================================
-# read MFA datafile using seqinr function read.fasta()
-library(seqinr)
-tmp  <- read.alignment(mfa, format="fasta")
-MSA  <- as.matrix(tmp)  # convert the list into a characterwise matrix
-                        # with appropriate row and column names using
-                        # the seqinr function as.matrix.alignment()
-                        # You could have a look under the hood of this
-                        # function to understand beter how to convert a
-                        # list into something else ... simply type
-                        # "as.matrix.alignment" - without the parentheses
-                        # to retrieve the function source code (as for any
-                        # function btw).
-### Explore contents of and access to the matrix of sequences
+:* To colour the 4ux5 A chain carbon atoms grey, type:<br />
-MSA
+::<code>color #878795 #&lt;4ux5&gt;:.A & C</code>  <small>Note: the color values after the first hash are rgb triplets in the hexadecimal numbering systems - exactly like in '''R'''.</small>
-MSA[1,]
-MSA[,1]
-MSA["MBP1_SACCE/1-75",1:10]
-length(MSA[,1])
+:* To undisplay the 4ux5 B chain, type:<br />
+::<code>~display #&lt;4ux5&gt;:.B</code> <small>Note: this is the tilde character, not a hyphen or minus sign.</small>
-# ================================================
+:* To colour the YFO model carbon atoms a pale reddish color, type:<br />
-#    define function to calculate entropy
+::<code>color #b06268 #&lt;YFO&gt; & C</code>
-# ================================================
-entropy <- function(v) { # calculate shannon entropy for the aa vector v
+:* To colour the 1BM8 structure carbon atoms a pale greenish color, type:<br />
-	                     # Note: we are not correcting for small sample sizes
+::<code>color #92b098 #&lt;1BM8&gt; & C</code>
-	                     # here. Thus if there are a large number of gaps in
-	                     # the alignment, this will look like small entropy
-	                     # since only a few amino acids are present. In the
-	                     # extreme case: if a position is only present in
-	                     # one sequence, that one amino acid will be treated
-	                     # as 100% conserved - zero entropy. Sampling error
-	                     # corrections are discussed eg. in Schneider et al.
-	                     # (1986) JMB 188:414
-	l <- length(v)
-	a <- rep(0, 21)      # initialize a vector with 21 elements (20 aa plus gap)
-	                     # the set the name of each row to the one letter
-	                     # code. Through this, we can access a row by its
-	                     # one letter code.
-	names(a)  <- unlist(strsplit("acdefghiklmnpqrstvwy-", ""))
-	for (i in 1:l) {       # for the whole vector of amino acids
-		c <- v[i]          # retrieve the character
-		a[c] <- a[c] + 1   # increment its count by one
-	} # note: we could also have used the table() function for this
-	tot <- sum(a) - a["-"] # calculate number of observed amino acids
-	                       # i.e. subtract gaps
-	a <- a/tot             # frequency is observations of one amino acid
-	                       # divided by all observations. We assume that
-	                       # frequency equals probability.
-	a["-"] <- 0
-	for (i in 1:length(a)) {
-		if (a[i] != 0) { # if a[i] is not zero, otherwise leave as is.
-			             # By definition, 0*log(0) = 0  but R calculates
-			             # this in parts and returns NaN for log(0).
-			a[i] <- a[i] * (log(a[i])/log(2)) # replace a[i] with
-			                                  # p(i) log_2(p(i))
-		}
-	}
-	return(-sum(a)) # return Shannon entropy
-}
-# ================================================
+* Ready? Let's superimpose the chains.
-#    calculate entropy for reference distribution
+** Select all models in the Model Panel and click on '''match'''.
-#    (from UniProt, c.f. Assignment 2)
+** Set 4ux5 Chain A as the Reference chain.
-# ================================================
+** Select YFO as a '''Chain to match''', select the button for specific reference and specific match, and click '''Apply'''.
+** Repeat this with 1BM8 as the match chain.
-refData <- c(
+* Easy. Now enlarge the binding site. Remember that 4ux5 and 1bm8 are independently determined crystal structures, wheres YFO was modelled on 1bm8 and is expected to be '''very''' similar to it. To give you some guidance what you should focus on, select 4ux5 residue 84 CA atom and display it as '''Ball & Stick'''. You can also repeat the '''Action''' "Set Pivot in case the pivot has shifted.
-    "A"=8.26,
-    "Q"=3.93,
-    "L"=9.66,
-    "S"=6.56,
-    "R"=5.53,
-    "E"=6.75,
-    "K"=5.84,
-    "T"=5.34,
-    "N"=4.06,
-    "G"=7.08,
-    "M"=2.42,
-    "W"=1.08,
-    "D"=5.45,
-    "H"=2.27,
-    "F"=3.86,
-    "Y"=2.92,
-    "C"=1.37,
-    "I"=5.96,
-    "P"=4.70,
-    "V"=6.87
-    )
-### Calculate the entropy of this distribution
+* Study the scene. This is where stereo- vision will help '''a lot'''.
-H.ref <- 0
+* What do you think? Is this what you expected? Can you explain what you see? Was the modelling process succesful?
-for (i in 1:length(refData)) {
-	p <- refData[i]/sum(refData) # convert % to probabilities
-    H.ref <- H.ref - (p * (log(p)/log(2)))
-}
-# ================================================
+<!-- I see that the model is very good regarding the global fold, but completely different in the binding loop. This is not expected. -->
-#    calculate information for each position of
-#    multiple sequence alignment
-# ================================================
-lAli <- dim(MSA)[2] # length of row in matrix is second element of dim(<matrix>).
+* Now turn the display of 4ux5 chain B back on and turn chain A off instead. Then superimpose the 1BM8 template and your model on Chain B.
-I <- rep(0, lAli)   # initialize result vector
-for (i in 1:lAli) {
-	I[i] = H.ref - entropy(MSA[,i])  # I = H_ref - H_obs
-}
-### evaluate I
-I
-quantile(I)
-hist(I)
-plot(I)
-# you can see that we have quite a large number of columns with the same,
+* Again, focus on the binding region. What do you think of that? What would you have expected? Do you see a difference? What does this all mean?
-# high value ... what are these?
-which(I > 4)
-MSA[,which(I > 4)]
-# And what is in the columns with low values?
+}}
-MSA[,which(I < 1.5)]
-# ===================================================
-#    plot the information
-#    (c.f. Assignment 5, see there for explanations)
-# ===================================================
-IP <- (I-min(I))/(max(I) - min(I) + 0.0001)
-nCol <- 15
-IP <- floor(IP * nCol) + 1
-spect <- colorRampPalette(c("#DD0033", "#00BB66", "#3300DD"), bias=0.6)(nCol)
-# lets set the information scores from single informations to grey. We
-# change the highest level of the spectrum to grey.
-#spect[nCol] <- "#CCCCCC"
-Icol <- vector()
-for (i in 1:length(I)) {
-	Icol[i] <- spect[ IP[i] ]
-}
-plot(1,1, xlim=c(0, lAli), ylim=c(-0.5, 5) ,
-     type="n", bty="n", xlab="position in alignment", ylab="Information (bits)")
-# plot as rectangles: height is information and color is coded to information
-for (i in 1:lAli) {
-   rect(i, 0, i+1, I[i], border=NA, col=Icol[i])
-}
-# As you can see, some of the columns reach very high values, but they are not
+Nb. I haven't seen this before and I am completely intrigued by the results. In fact, I think I understand the protein much, much better now through this exercise. I'm very pleased how this turned out.
-# contiguous in sequence. Are they contiguous in structure? We will find out in
-# a later assignment, when we map computed values to structure.
-</source>
-}}
-[[Image:InformationPlot.jpg|frame|none|Plot of information vs. sequence position produced by the '''R''' script above, for an alignment of Mbp1 ortholog APSES domains.]]
+&nbsp;
+== Links and resources ==
+:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
+:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
-&nbsp;
-== Links and resources ==
-{{#pmid: 22407712}}
+<!-- ;Reference sequences
+:* [[Reference Mbp1 orthologues (all fungi)|'''Mbp1 ortholog sequences (all fungi)''']]
+-->
 <!-- {{#pmid: 19957275}} -->
@@ Line 3,831: / Line 728: @@
 {{#lst:BIO_Assignment_Week_1|assignment_footer}}
+<table style="width:100%;"><tr>
+<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_7|&lt;&nbsp;Assignment&nbsp;7]]</td>
+<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_9|Assignment&nbsp;9&nbsp;&gt;]]</td>
+</tr></table>
 &nbsp;
 [[Category:Bioinformatics]]
 </div>