Difference between revisions of "BIO Assignment 5 2011"

Latest revision as of 17:51, 1 December 2014

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the Vendian period of the Proterozoic era of Precambrian times.

In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

In this assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 orthologue in your assigned species, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and(4) discuss whether the available evidence allows you to distinguish between different modes of ligand binding,

For the following, please remember the following terminology:

Target: The protein that you are planning to model.
Template: The protein whose structure you are using as a guide to build the model.
Model: The structure that results from the modeling process. It has the Target sequence and is similar to the Template structure.

A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.

Preparation, submission and due date

Read carefully.: Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you are trying to guess, rather than confirm possibly important information.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, December 5. at 12:00 noon.

Your documentation for the procedures you follow in this assignment will be worth 2 marks - 1 mark for generating the model and 1 mark for the selection/superposition and visualization of protein/DNA complexes.

(1) Preparation

(1.1) Template choice and template sequence

The SWISS-MODEL server provides several different options for constructing homology models. The easiest is probably the Automated Mode that requires only a target sequence as input, in this mode the program will automatically choose suitable templates and create an input alignment. I disagree however that that is the best way to use such a service: the reason is that template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. Therefore we will use the Alignment Mode of Swiss-Model in this assignment, choose our own template and upload our own alignment.

Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lecture and I have posted a short summary of template choice principles on this Wiki. One can either search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But one can always also use the BLAST interface at the NCBI, since the sequences contained in PDB files are accessible as a database subsection on the BLAST menu.

In assignment 2 you have already searched for structures of APSES domains in the PDB. If you need to repeat this:

Use the NCBI BLAST interface to identify all PDB files that are clearly homologous to your target APSES domain.
In Assignment 2, you have defined the extent of the APSES domain in yeast Mbp1. In Assignment 3, you have aligned reference APSES domains with those you found in your species. In assignment 4 you have confirmed by phylogenetic analysis and Recoprocal Best Match which of these APSES domain sequences is the closest related orthologue to yeast Mbp1. This sequence is the best candidate for having a conserved function similar to yeast Mbp1. Therefore, this sequence is the target for the homology modeling procedure.
Defining a template means finding a PDB coordinate set that has sufficient sequence similarity to your target that you can huild a model based on theat template. To find suitable PDB structures, use your target sequence as input for a BLAST search, and select Protein Data Bank proteins(pdb) as the Database you search in. Hits that are homologues are all suitable templates in principle, but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model; comment briefly on

sequence similarity to your target
size of expected model (= length of alignment)
presence or absence of ligands
experimental method and quality of the data set

Then choose the template you consider the most suitable and note why you have decided to use this template.

It is not straightforward at all how to number sequence in such a project. The "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file (one of the related PDB strcuctures) is the first residue of Mbp1 protein, but the last five residues are an artifical His tag. Is H125 of 1MB1 thus equivalent to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, therefore N is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the ATOM records; whereas the SEQRES records start with MET ... and so on. You need to remember: a sequence number is not absolute, but assigned in a particular context.

The homology model will be based on an alignment of target and template. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for WhatIf, a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.

Navigate to the Administration sub-menu of the WhatIf Web server. Follow the link to Make sequence file from PDB file. Enter the PDB-ID of your template into the form field and Send the request to the server. The server accesses the PDB file and extracts sequence information directly from the ATOM records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this implied sequence to check if and how it differs from the sequence ...

... listed in the SEQRES records of the coordinate file;
... given in the FASTA sequence for the template, which is provided by the PDB;
... stored in the protein database of the NCBI.

and record your results.

Establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.

(*) These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the Sequence Viewer extension of VMD..

Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence..

(1.2) The input alignment

The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these only because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions - (and for the ones in which we do see indels, we might suspect that these are actually gene-model errors). Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the template sequence and the target sequence from your species, proceed as follows.

In Jalview...

Load your Jalview project with aligned APSES domain sequences or recreate it from the Mbp1 orthologue in your species and the APSES domains from the Reference APSES domain page that I prepared for Assignment 4. Include the sequence of your template protein and re-align.
Delete all sequence you no longer need, i.e. keep only the APSES domains of the target (from your species) and the template (from the PDB) and choose Edit → Remove empty columns. This is your input alignment.
Choose File→Output to textbox→FASTA to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.

Using a different MSA program

Copy the FASTA formatted sequences of the Mbp1 proteins in the reference species from the Reference APSES domain page.
Access e.g. the MSA tools page at the EBI.
Paste the Mbp1 sequence set, your target sequence and the template sequence into the input form.
Run the alignment and save the output.

By hand

APSES domains are strongly conserved and have few if any indels. You could also simply align by hand.

Copy the CLUSTAL formatted reference alignment of the Mbp1 proteins in the reference species from the Reference APSES domain page.
Open a new file in a text editor.
Paste the Mbp1 sequence set, your target sequence and the template sequence into the file.
Align by hand, replace all spaces with hyphens and save the output.

Whatever method you use: the result should be a multiple sequence alignment in multi-FASTA format, that was constructed from a number of supporting sequences and that contains your aligned target and template sequence. This is your input alignment for the homology modeling server.

(2) Homology model

(2.1) SwissModel

Access the Swissmodel server at http://swissmodel.expasy.org . Navigate to the Alignment Mode page.

Paste your alignment for target and model into the form field. Refer to the Fallback Data file if you are not sure about the format. Make sure to select the correct option (FASTA) for the alignment input format on the form.

Click submit alignment and on the returned page define your target and template sequence. For the template sequence define the PDB ID of the coordinate file it came from. Enter the correct Chain-ID (usually "A", note: upper-case).

If you run into problems, compare your input to the fallback data. It has worked for me, it will work for you. In particular we have seen problems that arise from "special" characters in the FASTA header like the pipe "|" character that the NCBI uses to separate IDs - keep the header short and remove all non-alphanumeric characters to be safe.

Click submit alignment and review the alignment on the returned page. Make sure it has been interpreted correctly by the server. The conserved residues have to be lined up and matching. Then click submit alignment again, to start the modeling process.

The resulting page returns information about the resulting model. Save the model coordinates on your computer. Read the information on what is being returned by the server (click on the red questionmark icon). Paste the Anolea profile into your assignment.

Do not paste a screenshot of the result, but copy and paste the image from the Web-page! You do not need to submit the actual coordinate files with your assignment.

(3) Model analysis

(3.1) The PDB file

Open your model coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions:

What is the residue number of the first residue in the model? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your model correspond to that region?

(3.2) First visualization

In assignment 2 you have already studied a Mbp1 structure and compared it with your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the template, the model should look very similar to the original structure but contain the sequence of the target.

Save your model coordinates to your computer and visualize the structure in VMD. Make an informative (parallel, not cross-eyed!) stereo view that shows the general orientation of the helix-turn-helix motif and the "wing", and paste it into your assignment.

(4) The DNA ligand

(4.1) Finding a similar protein-DNA complex

One of the really interesting questions we can discuss with reference to our model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.

Since there is currently no software available that would accurately model such a complex from first principles, we will base a model of a bound complex on homology modeling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a structurally similar protein-DNA complex?

Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.

At the NCBI, VAST is provided as a search tool for structural similarity search.

At the EBI there are a number of very well designed structure analysis tools linked off the Structural Analysis page. As part of its MSD Services, PDBeFold provides a convenient interface for structure searches.

However we have also read previously that the APSES domains are members of a much larger superfamily, the "winged helix" DNA binding domains , of which hundreds of structures have been solved.

Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.

APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of a protein-DNA complex. Superfamilies of such structural domains are compiled in the CATH database. Unfortunately CATH itself does not provide information about whether the structures have been determined as complexes. But we can search the PDB with CATH codes and restrict the results to complexes. Essentially, this should give us a list of all winged helix domains for which the structure of complexes with DNA have been determined. This works as follows:

For reference, access CATH domain 1.10.10.10; this is the domain you will use to find protein-DNA complexes.
Navigate to the PDB home page and follow the link to Advanced Search
In the options menu for "Choose a Query Type" select Structure Features → CATH Classification Browser. A window will open that allows you to navigate down through the CATH tree. You can view the Class/Architecture/Topology names on the CATH page linked above. Click on the triangle icons (not the text) for "Mainly Alpha"→"Orthogonal Bundle"→"ARC repressor mutant, subunit A" then click on the link to "winged helix repressor DNA binding domain". Or, just enter "winged helix" into the search field. This subquery should match more than 500 coordinate entries.
Click on the (+) button behind "Add search criteria" to add an additional query. Select the option "Structure Features"→"Macromolecule type". In the option menus that pop up, select "Contains Protein → Yes", "Contains DNA → Yes""Contains RNA → Ignore" "Contains DNA/RNA hybrid → Ignore". This selects files that contain Protein-DNA complexes.
Check the box below this subquery to "Remove Similar Sequences at 90% identity" and click on "Submit Query". This query should retrieve more than 90 complexes.
Scroll down to the beginning of the list of PDB codes and locate the "Generate reports" menu. Under the heading Custom reports select Image collage. This is a fast way to obtain an overview of the structures that have been returned. First of all you may notice that in fact not all of the structures are really different, despite selecting only to retrieve dissimilar sequences. This appears to be a deficiency of the algorithm. But you can also easily recognize how in most of the the structures the recognition helix inserts into the major groove of B-DNA (eg. 1BC8, 1CF7). There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way, through the beta-strands of the "wing". This is interesting since it suggests there is more than one way for winged helix domains to bind to DNA. We can therefore use structural superposition of your homology model and two of the winged-helix proteins to decide whether the canonical or the non-canonical mode of DNA binding seems to be more plausible for Mbp1 orthologues.

Follow the procedure outlined above, from a CATH entry page up to viewing a Collage (or alternatively a tabular view) of the retrieved coordinate files. You can be maximally concise in your documentation for the procedure I have defined above, but I expect you to have spent enough time on this process to understand the key elements of the PDB's advanced search interface.

(4.2) Preparation and superposition of a canonical complex

The structure we shall use as a reference for the canonical binding mode is the Elk-1 transcription factor.

Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.

The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, you should delete the second copy of the complex from the PDB file. (Remember that PDB files are simply text files that can be edited.)

Access the PDB and navigate to the 1DUX structure explorer page. Download the coordinates to your computer.
Open the coordinate file in a text-editor and delete the coordinates for chains D,E and F; you may also delete all HETATM records and the MASTER record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which
You could use the Extensions→Analysis→RMSD calculator interface to superimpose the two strutcures IF you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the multiseq extension window, select the check-boxes next to both protein structures, and open the Tools→Stamp Structural Alignment interface.
In the "'Stamp Alignment Options'" window, check the radio-button for Align the following ... Marked Structures and click on OK.
In the Graphical Representations window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that your model's side-chain orientations have not been determined experimentally but inferred from the template, and that the template's structure was determined in the absence of bound DNA ligand.

Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best. Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.

(4.2) Preparation and superposition of a non-canonical complex

The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.

Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coresponds to the recognition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).

Before we can work with this however, we have to fix an annoying problem caused by the way the PDB stores replicates in biological assemblies. The PDB generates additional chains as copies of the original and delineates them with MODEL and ENDMDL records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The PDB file thus contains the same molecule in two different orientations, not two independent molecules. This is an important difference regarding how such molecules are displayed by VMD. If you were to use the biological unit file of the PDB, VMD does not recognize that there is a second molecule present and displays only one. We have to edit the file to merge the two molecules by removing the MODEL and ENDMDL records - and while we're editing the file we'll also remove unneeded heteroatoms and the second copy of the protein chain (which we don't need, we need only the second B-DNA strand). But then we end up with residues that have exactly the same residue number in the same file. That won't work for visualization, since the program expects residue numbers to be unique, therefore we have to renumber the residues. Here's how:

On the structure explorer page for 1DP7, select the option Download Files → Biological Assembly.
Dowload, save and uncompress the file.
Open the file in a text editor.
Delete both MODEL and both ENDMDL records.
Also delete all HETATM records for HOH, PEG and EDO, as well as the entire second protein chain and the MASTER record. The resulting file should only contain the DNA chain and its copy and one protein chain. Save the file with a new name.
Access the Whatif Web interface and click on Administration and Renumber a PDB File from 1. Upload your edited file, access the results and save the file.
Open the renumbered file with VMD. You should see one protein chain and a B-DNA double helix. Switch to stereo viewing and spend some time to see how amazingly beautiful the complementarity between the protein and the DNA helix is (you might want to display chain P and chain D in separate representations and color the DNA chain by Position → Radial for clarity) ... in particular, appreciate how Arginine 76 interacts with the base of Guanine 92!
Then clear all molecules
In VMD, open Extensions→Analysis→MultiSeq. When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default, or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
A window will appear - the MultiSeq window - it contains the sequence of the APSES domain you are visualizing. MultiSeq will also generate an additional cartoon representation of the structure. Choose File→Import Data, browse to your directory and load:
- Your model;
- The 1DUX complex;
- The 1DP7 complex.
Mark all three protein chains by selecting the checkbox next to their name and run the STAMP structural alignment.
In the graphical representations window, double-click on the cartoon representations that multiseq has generated to undisplay them, also undisplay the Tube representation of 1DUX. Then create a Tube representation for 1DP7, and select a Color by ColorID (a different color that you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.

Orient and scale your superimposed structures so that their structural similarity is apparent, the orientation is similar to the scene generated above and the 1DP7 "wing" can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best. Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.

(4.3) Coloring by conservation

With the superimposed coordinates, you can begin to get a sense whether either or both binding modes could be appropriate for a protein-DNA complex in your Mbp1 orthologue. But these are geometrical criteria only, and the protein in your species may be flexible enough to adopt a different conformation in a complex, and different again from your model. A more powerful way to analyze such hypothetical complexes is to look at conservation patterns. With VMD, you can import a sequence alignment into the MultiSeq extension and color residies by conservation. The protocol below assumes

You have prealigned the reference Mbp1 proteins with your species' Mbp1 orthologue;
You have saved the alignment in a CLUSTAL format.

You can use Jalview or any other MSA server to do so. You can even do this by hand - there should be few if any indels and the correct alignment is easy to see.

Load the Mbp1 APSES alignment into MultiSeq.

(A) In the MultiSeq Window, navigate to File → Import Data...; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable ALN files (these are CLUSTAL formatted multiple sequence alignments).

(B) Open the alignment file, click on Ok to import the data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.

(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).

You will see that the 1MB1 sequence and the APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.

Bring the 1MB1 sequence in register with the APSES alignment.: (A)MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the entire first column of the sequences you have imported.; (B) Select Edit → Enable Editing... → Gaps only to allow changing indels.; (C) Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1: S I M ...; (D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.); (E) When you are done, it may be prudent to save the state of your alignment. Use File → Save Session...

Color by similarity: (A) Use the View → Coloring → Sequence similarity → BLOSUM30 option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.; (B) You can adjust the color scale in the usual way by navigating to VMD main → Graphics → Colors..., choosing the Color Scale tab and adjusting the scale midpoint.; (C) Navigate to the Representations window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use User coloring of your Tube and Licorice representations to apply the sequence similarity color gradient that MultiSeq has calculated.

Once you have colored the residues of your model by conservation, create another informative stereo-image and paste it into your assignment.

(4.4) Interpretation

Analysis (2 marks)

Considering the conservation patterns for Mbp1 orthologues, and assuming that all these orthologues bind DNA in a similar way, which model appears to be more plausible for protein-DNA interactions in APSES domains? Is it the canonical, or the non-canonical binding mode? Discuss briefly what you would expect to find and how this relates to your observations. Distinguish clearly between experimental evidence, computational inference and empirical hypothesis. You are welcome to upload detail views (stereo !) of particular sidechains, or surfaces etc. if this helps your arguments. Sometimes a picture is worth many words. But this is not a requirement, we are more interested in evidence-based reasoning than in the form of the presentation.

(5) Summary of Resources

Links and background reading

Review (PDF, restricted) Manuel Peitsch on Homology Modeling
Review (PDF, restricted) Aravind et al. Helix-turn-helix domains
Review (PDF, restricted) Gajiwala & Burley, winged-Helix domains
PDB file format (see the Coordinate Section if you are unsure about chain identifiers)
Wikipedia on Structural Superposition (although the article is called "Structural Alignment")

Data

Fallback Data page - Refer to this page in case your own efforts fail, or you have insurmountable problems with your input files.

Reference sequences and alignments

Reference APSES domains page

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the 2011 Course Mailing List .

Difference between revisions of "BIO Assignment 5 2011"

Latest revision as of 17:51, 1 December 2014

Contents

(1) Preparation

(1.1) Template choice and template sequence

(1.2) The input alignment

(2) Homology model

(2.1) SwissModel

(3) Model analysis

(3.1) The PDB file

(3.2) First visualization

(4) The DNA ligand

(4.1) Finding a similar protein-DNA complex

(4.2) Preparation and superposition of a canonical complex

(4.2) Preparation and superposition of a non-canonical complex

(4.3) Coloring by conservation

(4.4) Interpretation

(5) Summary of Resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 1: / Line 1: @@
+<!-- {{Template:Inactive}} -->
+{{Template:Active}}
 __TOC__
 &nbsp;
@@ Line 4: / Line 8: @@
 <div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
-Assignment 5 - Homology modeling
+Assignment 5 (last: 2011) - Homology modeling
 </div>
-Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
+<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
+::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
+</div>
 &nbsp;
-<!-- '''Please note: This assignment is currently active. All changes will be announced on the course mailing list.'''-->
 &nbsp;
-<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html Vendian period] of the Proterozoic era of Precambrian times.
-Introduction
-&nbsp;
-;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
+In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
-:''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
-</div>
-Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and looked at how these domains have evolved over time. We have seen that this is an ancient family, that had several members already in the cenancestor of all fungi, an organism that lived in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html vendian period] of the proterozoic era of precambrian times, more than 600,000,000 years ago.
-In order to understand how particular residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to consider an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. In particular, it would be interesting to correlate the conservation patterns we have observed in the MSAs with specific DNA binding interactions. Unfortunately, the 1MB1 structure does not have DNA bound and the evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to define the details of how a DNA double helix might be bound. These details would require the structure of a complex that contains protein as well as DNA. No such complex of an APSES domain has yet been crystallized.
-''In this assignment you will construct a molecular model of the Mbp1 orthologue in your assigned organism, identify similar structures of distantly related domains for which protein-DNA complexes are known, define whether the available evidence allows you to distinguish between different modes of ligand binding, and assemble a hypothetical complex structure.''
+''In this assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 orthologue in your assigned species, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and(4) discuss whether the available evidence allows you to distinguish between different modes of ligand binding, ''
 For the following, please remember the following terminology:
@@ Line 36: / Line 32: @@
 ;Model
 :The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
-A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains all links to other sites and resources you might require.
+&nbsp;
+A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.
+{{Template:Preparation|
+care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you are trying to guess, rather than confirm possibly important information.|
+num=5|
+ord=fifth|
+due = Monday, December 5. at 12:00 noon}}
-<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+&nbsp;
-Preparation, submission and due date
+;Your documentation for the procedures you follow in this assignment will be worth 2 marks - 1 mark for generating the model and 1 mark for the selection/superposition and visualization of protein/DNA complexes.
+&nbsp;
+<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+==(1) Preparation==
 </div>
-Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you have a tendency to guess, rather than confirm possibly important information.
-Prepare a Microsoft Word document with a title page that contains:
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-*your full name
+===(1.1) Template choice and template sequence===
-*your Student ID
+</div>
-*your e-mail address
+The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest is probably the '''Automated Mode''' that requires only a target sequence as input, in this mode the program will automatically choose suitable templates and create an input alignment. I disagree however that that is the best way to use such a service: the reason is that template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. Therefore we will use the '''Alignment Mode''' of Swiss-Model in this assignment, choose our own template and upload our own alignment.
-*the organism name you have been [[Organism_list_2006|assigned]]
+Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lecture and I have posted a short summary of [[Template_choice_principles|template choice principles]] on this Wiki. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But one can always also use the BLAST interface at the NCBI, since the sequences contained in PDB files are accessible as a database subsection on the BLAST menu.
-Follow the steps outlined below. You are encouraged to  write your answers in short answer form or point form, '''like you would document an analysis in a laboratory notebook'''. However, you must
-*document what you have done,
-*note what Web sites and tools you have used,
-*paste important data sequences, alignments, information etc.
-'''If you do not document the process of your work, we will deduct marks.'''  Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.
+<div style="padding: 5px; background: #DDDDEE;">
+In assignment 2 you have already searched for structures of APSES domains in the PDB. If you need to repeat this:
+*Use the NCBI BLAST interface to identify all PDB files that are clearly homologous to your target APSES domain.
+*In Assignment 2, you have defined the extent of the APSES domain in yeast Mbp1. In Assignment 3, you have aligned reference APSES domains with those you found in your species. In assignment 4 you have confirmed by phylogenetic analysis and ''Recoprocal Best Match'' which of these APSES domain sequences is the closest related orthologue to yeast Mbp1. This sequence is the best candidate for having a conserved function similar to yeast Mbp1. Therefore, this sequence is the '''target''' for the homology modeling procedure.
+*Defining a ''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target'' that you can huild a model based on theat '''template'''. To find suitable PDB structures, use your '''target''' sequence as input for a BLAST search, and select Protein Data Bank proteins(pdb) as the '''Database''' you search in. Hits that are homologues are all suitable '''templates''' in principle, but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model; comment briefly on
+:*sequence similarity to your target
+:*size of expected model (= length of alignment)
+:*presence or absence of ligands
+:*experimental method and quality of the data set
+Then choose the '''template''' you consider the most suitable and note why you have decided to use this template.
+</div>
-Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
-<code>A5_family name.given name.doc</code>
-<small>(for example my fifth assignment would be named: A5_steipe.boris.doc - and don't switch the order of your given name and familyname please!)</small>
-Finally e-mail the document to [mailto: boris.steipe@utoronto.ca] before the due date.
+It is not straightforward at all how to number sequence in such a project. The "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file <small>(one of the related PDB strcuctures)</small> '''is''' the first residue of Mbp1 protein, but the last five residues are an artifical His tag. Is H125 of 1MB1 thus equivalent to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, therefore N is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records; whereas the SEQRES records start with MET ... and so on. You need to remember: a sequence number is not absolute, but assigned in a particular context.
-Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
+The homology '''model''' will be based on an alignment of '''target''' and '''template'''. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit  and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for [http://swift.cmbi.ru.nl/servers/html/index.html '''WhatIf'''], a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.
-We do not have the resources to correct formatting errors or to convert assignments into different formats. <!-- Becoming familiar and proficient with technologies is part of the course objectives and that includes e-mail attachments. I will also not accept files that are significantly in excess of 1.5 MB. This will be enforced in this assignment, as as the assignment includes a number of image files and as a proficient user of your computer you should be aware of an image's size, its resolution, its displayed size and its file format, all of which influence the displayed image and the size of its file.--> Keep your image-file sizes manageable!
+<div style="padding: 5px; background: #DDDDEE;">
+*Navigate to the '''Administration''' sub-menu of the [http://swift.cmbi.ru.nl/servers/html/index.html WhatIf Web server]. Follow the link to '''Make sequence file from PDB file'''. Enter the PDB-ID of your template into the form field and '''Send''' the request to the server. The server accesses the PDB file and extracts sequence information directly from the <code>ATOM&nbsp;&nbsp;</code> records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this '''implied''' sequence to check if and how it differs from the sequence ...
-:<small>Image sizes are measured in pixels - 600px across is sufficient for the assignment, resolutions are measured in dpi (dots per imperial inch) - 72 dpi is the standard resolution for images that are viewed on a monitor; the displayed size may be scaled (in %) by an application program: stereo images should be presented so that equivalent points are approximately 6 cm apart; images can be stored uncompressed as .tiff or.bmp, or compressed as .gif or .jpg. .gif is preferred for images with large, monochrome areas and sharp, high-contrast edges; '''.jpg is preferred for images with shades and halftones such as the structure views required here;''' .tiff is preferred to archive master copies of images in a lossless fashion, use LZW compression for .tiff files if your system/application supports it; .bmp is not preferred for anything, its used because its easier to code.</small>
+:*... listed in the <code>SEQRES</code> records of the coordinate file;
+:*... given in the FASTA sequence for the template, which is provided by the PDB;
+:*... stored in the protein database of the NCBI.
+: and record your results.
-<!--Make it a habit to focus on information, pure and simple, and avoid HTML and RTF formatting and the like, where it does not contribute significantly to emphasize actual information. -->Information that you present (such as added colouring, formatting etc.) should be meaningful. If you have technical difficulties, post your questions to the list and/or contact me.
+* Establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.
-All required stereo views are to be presented as divergent stereo frames (left eye's view in the left frame). <!--Marks will be deducted if they are not.--> Remember to list the Rasmol command input you have used to generate the images.
+</div>
-With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
+:(*) <small>These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the ''Sequence Viewer'' extension of VMD.</small>.
+:<small>Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence.</small>.
+&nbsp;
+&nbsp;
-'''The due date for the assignment is Wednesday, December 20. at 10:00 in the morning.'''
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+===(1.2) The input alignment===
-Grading
 </div>
+&nbsp;<br>
+The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these only because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
+The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions - (and for the ones in which we do see indels, we might suspect that these are actually gene-model errors). Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the '''template sequence''' and the '''target sequence''' from your species, proceed as follows.
+<div style="padding: 5px; background: #DDDDEE;">
+;In Jalview...
+* Load your Jalview project with aligned APSES domain sequences or recreate it from the Mbp1 orthologue in your species and the APSES domains from the [[Reference APSES domain sequences (reference species)|'''Reference APSES domain page''']] that I prepared for Assignment 4. Include the sequence of your '''template protein''' and re-align.
+* Delete all sequence you no longer need, i.e. keep only the APSES domains of the '''target''' (from your species) and the '''template''' (from the PDB) and choose '''Edit &rarr; Remove empty columns'''. This is your '''input alignment'''.
+* Choose '''File&rarr;Output to textbox&rarr;FASTA''' to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.
-Don't wait until the last day to find out there are problems! This assignment has been structured so that it should be doable in three or four  hours. The assignment is excellent preparation for the exam, so even if its due later, its a good idea to do it earlier. Assignments that are received past the due date will have one mark deducted at the first minute of every twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you '''must''' arrange this beforehand.
+;Using a different MSA program
+* Copy the FASTA formatted sequences of the Mbp1 proteins in the reference  species from the [[Reference APSES domain sequences (reference species)|'''Reference APSES domain page''']].
+* Access e.g. the MSA tools page at the EBI.
+* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the input form.
+*Run the alignment and save the output.
-Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will
-* count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
-* be divided by two for BCH1441 (graduates).
-&nbsp;
+;By hand
-&nbsp;
+APSES domains are strongly conserved and have few if any indels. You could also simply align by hand.
-<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+* Copy the CLUSTAL formatted reference alignment of the Mbp1 proteins in the reference species from the [[Reference APSES domain sequences (reference species)|'''Reference APSES domain page''']].
-==(1) Preparation==
+* Open a new file in a text editor.
+* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the file.
+*Align by hand, replace all spaces with hyphens and save the output.
 </div>
-<!--
+Whatever method you use: the result should be a multiple sequence alignment in '''multi-FASTA''' format, that was constructed from a number of supporting sequences and that contains your aligned '''target''' and '''template''' sequence. This is your '''input alignment''' for the homology modeling server.
+<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+==(2) Homology model==
+</div>
+&nbsp;
+&nbsp;
 <div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-===Choosing a template (X marks)===
+=== (2.1) SwissModel===
 </div>
 &nbsp;<br>
-Often more than one related structure can be found in the PDB. We have discussed principles of selecting template structures in the lecture. Interestingly the PDB itself cannot be searched for the contents of its holdings, by structural- or sequence similarity, but there is always BLAST since the NCBI conveniently allows you to search against all sequences in PDB files.
-*Use BLAST to identify all PDB files that contain APSES domains that are clearly homologuous to your target. (Document that you have searched in the correct subsection of the Genbank holdings). For the hits you find, consider how these structures differ and which features would make each more or less suitable for your task. Comment briefly on what options you have, select one template and note why you have decided to use this particular structure as a template. Include aspects of sequence similarity, length of the sequence, presence or absence of ligands and their potential effect on the structure, and experimental method and quality in your reasoning.
+Access the Swissmodel server at '''http://swissmodel.expasy.org''' . Navigate to the '''Alignment Mode''' page.
-*Note which sequence is contained in the coordinate section of the PDB file; note if and how this implied sequence differs from the sequences ...
+&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+*Paste your alignment for target and model into the form field. Refer to the [[Homology_modeling_fallback_data|'''Fallback Data file''']] if you are not sure about the format. Make sure to select the correct option (FASTA) for the alignment input format on the form.
-:*listed in the seqres records;
+* Click '''submit alignment ''' and on the returned page define your '''target''' and '''template''' sequence. For the '''template sequence''' define the PDB ID of the coordinate file it came from. Enter the correct Chain-ID <small>(usually "A", note: upper-case)</small>.
-:*given in the FASTA sequence for the template that the PDB provides;
+:<small>If you run into problems, compare your input to the fallback data. It has worked for me, it will work for you. In particular we have seen problems that arise from "special" characters in the FASTA header like the pipe "<code>|</code>" character that the NCBI uses to separate IDs - keep the header short and remove all non-alphanumeric characters to be safe.</small>
-:*and that stored by the NCBI.
-* In a table, establish the correspondence of the coordinate sequence numbering (defined by the residue numbers/insertion codes in the atom records) with your target sequence numbering.
+*Click '''submit alignment''' and review the alignment on the returned page. Make sure it has been interpreted correctly by the server. '''The conserved residues have to be lined up and matching'''. Then click '''submit alignment''' again, to start the modeling process.
-* Retrieve the most suitable template structure coordinate file from the PDB.
+* The resulting page returns information about the resulting model. Save the '''model coordinates''' on your computer. Read the information on what is being returned by the server (click on the red questionmark icon). Paste the Anolea profile into your assignment.
+:<small>Do not paste a screenshot of the result, but copy and paste the image from the Web-page! You do not need to submit the actual coordinate files with your assignment.</small>
+</div>
--->
+<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+==(3) Model analysis==
+</div>
 &nbsp;
 &nbsp;
 <div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-=== The input alignment (X marks)===
+=== (3.1) The PDB file ===
 </div>
 &nbsp;<br>
-The sequence alignment between target and template is the single most important factor that determines the quality of your model.
+Open your '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions:
-No homology modeling process will repair an incorrect alignment and it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment, rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient, rather than the more sophisticated methods and more informed procedures we have discussed. Detailed analysis of fallacious models rarely leads to good results.
-The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Typically such an alignment will also include additional optimization steps to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+<br><div style="padding: 5px; background: #DDDDEE;">
+*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your '''model''' correspond to that region?
+</div>
+<!-- discuss flagging of loops - setting of B-factor to 99.0 phps. ANOLEA vs. Gromos ... packing vs. energy? -->
-Here is an excerpt from the T-coffee aligned Mbp1 sequences: it contains all the residues of the yeast sequence that are found in the 1MB1 crystal structure - the '''template''' sequence for our homology model - and it has been edited to remove the N-terminal gaps in the sequence. Thus the N-terminus is 21 amino acids longer than the definition of the APSES domain in CDD (which starts with <code>SIMKR...</code>), the C- terminus is slightly shorter.
-Since the sequences are very similar between each other, there is no ambiguity in the alignment and the construction of a homology model should be straightforward. Normally one would spend considerable some effort at this stage to consider which parts of the target sequence and the template sequence appear to  correctly aligned and to edit the alignment manually. In our case, evolutionary pressure was so strong that essentially all have evolved without a single indel in their sequence.
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+===(3.2) First visualization===
+</div>
+&nbsp;<br>
-I have added to the alignment the APSES domain of [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=116197493&dopt=GenPept XP_001224558], the ''Chaetomium globosum'' Mbp1 orthologue (MBP1_CHAGL). This will serve as the reference and fallback sequence.
+In assignment 2 you have already studied a Mbp1 structure and compared it with your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
-MB1            NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
+&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
- MBP1_CANGL      NQIYSAKYSGVDVYEFIHPTG---SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEV
+*Save your '''model''' coordinates to your computer and visualize the structure in VMD. Make an informative (parallel, not cross-eyed!) stereo view that shows the general orientation of the helix-turn-helix motif and the "wing", and paste it into your assignment.
- MBP1_EREGO      TQIYSAKYSGVEVYEFLHPTG---SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEV
- MBP1_KLULA      NQIYSAKYSGVDVYEFIHPTG---SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEV
- MBP1_CANAL      SQIYSATYSNVPAFEFVTSEG---PIMRRKKDSWINATHILKIAKFPKAKRTRILEKDV
- MBP1_DEBHA      TQIYSATYSNVPVFEFVTLEG---PIMRRKLDSWINATHILKIAKFPKAKRTRILEKDV
- MBP1_YARLI      MSIYKATYSGVPVYEFQCKNV---AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEV
- MBP1_SCHPO      SAVHVAVYSGVEVYECFIKGV---SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQV
- MBP1_USTMA      KTIFKATYSGVPVYECIINNV---AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREI
- MBP1_ASPNI      SNVYSATYSSVPVYEFKIGTD---SVMRRRSDDWINATHILKVAGFDKPARTRILEREV
- MBP1_ASPTE      SKIYSATYSSVPVYEFKIEGD---SVMRRRADDWINATHILKVAGFDKPARTRILEREV
- MBP1_CRYNE      PKVYASVYSGVPVFEAMIRGI---SVMRRASDSWVNATQILKVAGVHKSARTKILEKEV
- MBP1_GIBZE      G-IYSASYSGVDVYEMEVNNI---AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEI
- MBP1_NEUCR      IYSLQATYSGVGVYEMEVNNV---AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEI
- MBP1_MAGGR      P-IYTAVYSNVEVYEFEVNGV---AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEI
- MBP1_ASPFU      PQIYKAVYSNVSVYEMEVNGV---AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEI
- MBP1_CHAGL      AGIYSATYSGIPVYEYQFGPDMKEHVMRRREDNWINATHILKAAGFDKPARTRILERDV
-MB1            LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
- MBP1_CANGL      LKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLF
- MBP1_EREGO      IKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLF
- MBP1_KLULA      ITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLF
- MBP1_CANAL      QTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIF
- MBP1_DEBHA      QTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIF
- MBP1_YARLI      QKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIF
- MBP1_SCHPO      QIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPIL
- MBP1_USTMA      QKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPIT
- MBP1_ASPNI      QKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIF
- MBP1_ASPTE      QKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIF
- MBP1_CRYNE      LNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVF
- MBP1_GIBZE      QTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLL
- MBP1_NEUCR      QIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
- MBP1_MAGGR      QTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLL
- MBP1_ASPFU      AAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLL
- MBP1_CHAGL      QKDVHEKIQGGYGKYQGTWIPLEQGRALAQRNNIYDRLRPIF
+</div>
 &nbsp;<br>
+&nbsp;<br>
+<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
-It should be obvioous to you by now how you can copy a srtring of amino acids from such an alignment and create a FASTA file. However we need to take a little detour: this detaour brings us to the question of sequence numbers.
+==(4) The DNA ligand==
+</div>
+&nbsp;
+&nbsp;
-It is not straightforward at all how to number sequence in such a project. The "natural" way would be to start a sequential numbering from the start-codon of the full length protein and go sequentially from there. However imagine what would happen if a curator would discover that one of the splice-sites for a gene has been missed in automatic annotation. All of a sudden a corrected sequence would have a different length than the one that may have been used for earlier studies. Unfortunatlety, there is no mechanism (''wouldn't it be nice!'') that automatically goes back through the literature and your lab-journal and updates the revised sequence numbering... But there are other possible complications, regarding sequence numbers. The first residue of the CDD-APSES domain is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file ''is'' the first residue of Mbp1 protein, but the last five residues are an artifiical His tag. Is H125 of 1MB1 the equivalent residue to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, whereas the SEQRES records start with MET ... and so on. The take-home message is that a sequence number is nothing absolute, but something that makes sense only in a particular context. To emphasize this, we will write a FASTA header for our '''target''' sequence that lists the residues of the source sequence it correspond to. In terms of actual sequence numbering, we will adopt the numbering of the 1MB1 protein throughout to be able to consistently label particular amino acids.
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-Access the sequence of "your" organism's Mbp1 Orthologue at UniProt. (You can use the links I have provided in the table below).
+===(4.1) Finding a similar protein-DNA complex===
+</div>
+&nbsp;<br>
+One of the really interesting questions we can discuss with reference to our model is how sequence variation might result in changed DNA recognition sites, and then lead to changed cognate DNA binding sequences. In order to address this, we would need to generate a plausible structural model for how DNA is bound to APSES domains.
-<table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
+Since there is currently no software available that would accurately model such a complex from first principles, we will base a model of  a bound complex on homology modeling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a structurally similar protein-DNA complex?
-<tr style="background: #BDC3DC;">
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot Accession</b></td>
-</tr>
-<tr style="background: #FFFFFF;">
+Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4WGN2_ASPFU Q4WGN2]</td>
-</tr>
-<tr style="background: #E9EBF3;">
+At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is provided as a search tool for structural similarity search.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5B8H6_EMENI Q5B8H6]</td>
-</tr>
-<tr style="background: #FFFFFF;">
+At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, [http://www.ebi.ac.uk/msd-srv/ssm/ '''PDBeFold'''] provides a convenient interface for structure searches.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q0CQJ5_ASPTE Q0CQJ5]</td>
-</tr>
-<tr style="background: #E9EBF3;">
+However we have also read previously that the APSES domains are members of a much larger superfamily, the "winged helix" DNA binding domains , of which hundreds of structures have been solved.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5ANP5_CANAL Q5ANP5]</td>
-</tr>
-<tr style="background: #FFFFFF;">
+&nbsp;<br>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6FWD6_CANGL Q6FWD6]</td>
-</tr>
-<tr style="background: #E9EBF3;">
+[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.]]
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5KHS0_CRYNE Q5KHS0]</td>
-</tr>
-<tr style="background: #FFFFFF;">
+&nbsp;<br>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6BSN6_DEBHA Q6BSN6]</td>
-</tr>
-<tr style="background: #E9EBF3;">
+APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of a protein-DNA complex. Superfamilies of such structural domains are compiled in the CATH database. Unfortunately CATH itself does not provide information about whether the structures have been determined as complexes. '''But''' we can search the PDB with CATH codes and restrict the results to complexes. Essentially, this should give us a list of all winged helix domains for which the structure of complexes with DNA have been determined. This works as follows:
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q752H3_ASHGO Q752H3]</td>
-</tr>
-<tr style="background: #FFFFFF;">
+* For reference, access [http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH domain 1.10.10.10]; this is the domain you will use to find protein-DNA complexes.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
+* Navigate to the [http://www.pdb.org/ PDB home page] and follow the link to [http://www.pdb.org/pdb/search/advSearch.do Advanced Search]
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4IEY8_GIBZE Q4IEY8]</td>
+* In the options menu for "Choose a Query Type" select Structure Features &rarr; CATH Classification Browser. A window will open that allows you to navigate down through the CATH tree. You can view the Class/Architecture/Topology names on the CATH page linked above. Click on '''the triangle icons''' (not the text) for "Mainly Alpha"&rarr;"Orthogonal Bundle"&rarr;"ARC repressor mutant, subunit A" then click on the link to "winged helix repressor DNA binding domain". Or, just enter "winged helix" into the search field. This subquery should match more than 500 coordinate entries.
-</tr>
+* Click on the (+) button behind "Add search criteria" to add an additional query. Select the option "Structure Features"&rarr;"Macromolecule type". In the option menus that pop up, select "Contains Protein &rarr; Yes",  "Contains DNA &rarr; Yes""Contains RNA &rarr; Ignore" "Contains DNA/RNA hybrid &rarr; Ignore". This selects files that contain Protein-DNA complexes.
+* Check the box below this subquery to "Remove Similar Sequences at 90% identity" and click on "Submit Query". This query should retrieve more than 90 complexes.
+* Scroll down to the beginning of the list of PDB codes and locate the "Generate reports" menu. Under the heading '''Custom reports''' select '''Image collage'''. This is a fast way to obtain an overview of the structures that have been returned. First of all you may notice that in fact not all of the structures are really different, despite selecting only to retrieve dissimilar sequences. This appears to be a deficiency of the algorithm. But you can also easily recognize how in most of the the structures the '''recognition helix inserts into the major groove of B-DNA''' (eg. 1BC8, 1CF7). There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way, through the beta-strands of the "wing". This is interesting since it suggests there is more than one way for winged helix domains to bind to DNA. We can therefore use structural superposition of '''your homology model''' and '''two of the winged-helix proteins''' to decide whether the canonical or the non-canonical mode of DNA binding seems to be more plausible for Mbp1 orthologues.
-<tr style="background: #E9EBF3;">
+&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
+* Follow the procedure outlined above, from a CATH entry page up to viewing a Collage (or alternatively a tabular view) of the retrieved coordinate files. You can be maximally concise in your documentation for the procedure I have defined above, but I expect you to have spent enough time on this process to understand the key elements of the PDB's advanced search interface.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_KLULA P39679]</td>
+</div>
-</tr>
-<tr style="background: #FFFFFF;">
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q3S405_MAGGR Q3S405]</td>
-</tr>
-<tr style="background: #E9EBF3;">
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q7SBG9_NEUCR Q7SBG9]</td>
-</tr>
-<tr style="background: #FFFFFF;">
+===(4.2) Preparation and superposition of a canonical complex===
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
+</div>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_YEAST P39678]</td>
+&nbsp;<br>
-</tr>
-<tr style="background: #E9EBF3;">
+The structure we shall use as a reference for the '''canonical binding mode''' is the Elk-1 transcription factor.
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=RES2_SCHPO P41412]</td>
-</tr>
-<tr style="background: #FFFFFF;">
+[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4P117_USTMA Q4P117]</td>
-</tr>
-<tr style="background: #E9EBF3;">
+The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, you should delete the second copy of the complex from the PDB file. (Remember that PDB files are simply text files that can be edited.)
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
-  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6CGF5_YARLI Q6CGF5]</td>
-</tr>
-</table>
+* Access the PDB and navigate to the 1DUX structure explorer page. Download the coordinates to your computer.
+* Open the coordinate file in a text-editor and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
+* Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which
+* You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
+* In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
+* In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
+* You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that your '''model''''s side-chain orientations have not been determined experimentally but inferred from the '''template''', and that the template's structure was determined in the absence of bound DNA ligand.
+&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+* Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best.  Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.
-<div style="padding: 5px; background: #EEEEEE;">
-*Copy your organism's Mbp1 sequence from the alignment above. Then define the start- and end- sequence numbers of the '''target''' sequence relative to the full-length protein. (You can easily access the full-length protein sequence at the NCBI through the [[Assignment_3|links in the RefSeq column table of Assignment 3]]  Prepare a FASTA formatted file for the '''target''' sequence in your organism, giving it an appropriate header and include the sequence numbers. Refer to the [[Assignment_5_fallback_data|'''Fallback data''']] file if you are not sure about the format.
 </div>
 &nbsp;<br>
+&nbsp;
-Your FASTA sequence should look similar to this, and most importantly contain the '''exact''' same number of residues. Except that if both sequences have aligned gaps, you can delete the gaps corr.
-  >1MB1: Mbp1_SACCE 1..100
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
- NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
- LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
-Instruction
+===(4.2) Preparation and superposition of a non-canonical complex===
-&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
-*Task.
 </div>
-&nbsp;
-&nbsp;
-<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
+[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coresponds to the recognition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
+Before we can work with this however, we have to fix an annoying problem caused by the way the PDB stores replicates in biological assemblies. The PDB generates additional chains as copies of the original and delineates them with <code>MODEL</code> and <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The PDB file thus contains the '''same molecule in two different orientations''', not '''two independent molecules'''. This is an important difference regarding how such molecules are displayed by VMD. If you were to use the biological unit file of the PDB, VMD does not recognize that there is a second molecule present and displays only one. We have to edit the file to merge the two molecules by removing the MODEL and ENDMDL records - and while we're editing the file we'll also remove unneeded heteroatoms and the second copy of the protein chain (which we don't need, we need only the second B-DNA strand). But then we end up with residues that have '''exactly the same residue number''' in the same file. That won't work for visualization, since the program expects residue numbers to be unique, therefore we have to renumber the residues. Here's how:
+* On the structure explorer page for 1DP7, select the option '''Download Files''' &rarr; '''Biological Assembly'''.
+* Dowload, save and uncompress the file.
+* Open the file in a text editor.
+* Delete both <code>MODEL</code> and both <code>ENDMDL</code> records.
+* Also delete all <code>HETATM</code> records for <code>HOH</code>, <code>PEG</code> and <code>EDO</code>, as well as the entire second protein chain and the <code>MASTER</code> record. The resulting file should only contain the DNA chain and its copy and one protein chain. Save the file with a new name.
+* Access the [http://swift.cmbi.ru.nl/servers/html/index.html '''Whatif Web interface'''] and click on '''Administration''' and '''Renumber a PDB File from 1'''. Upload your edited file, access the results and save the file.
+* Open the renumbered file with VMD. You should see '''one protein chain''' and a '''B-DNA double helix'''. Switch to stereo viewing and spend some time to see how '''amazingly beautiful''' the complementarity between the protein and the DNA helix is (you might want to display ''chain P'' and ''chain D'' in separate representations and color the DNA chain by ''Position'' &rarr; ''Radial'' for clarity) ... in particular, appreciate how Arginine 76 interacts with the base of Guanine 92!
+* Then clear all molecules
+* In VMD, open '''Extensions&rarr;Analysis&rarr;MultiSeq'''. When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default, or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
+* A window will appear - the MultiSeq window - it contains the sequence of the APSES domain you are visualizing. MultiSeq will also generate an additional cartoon representation of the structure. Choose '''File&rarr;Import Data''', browse to your directory and load:
+** Your model;
+** The 1DUX complex;
+** The 1DP7 complex.
+* Mark all three protein chains by selecting the checkbox next to their name and run the STAMP structural alignment.
+* In the graphical representations window, double-click on the cartoon representations that multiseq has generated to undisplay them, also undisplay the Tube representation of 1DUX. Then create a Tube representation for 1DP7, and select a Color by ColorID (a different color that you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
-==(2) Homology model==
+&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+* Orient and scale your superimposed structures so that their structural similarity is apparent, the orientation is similar to the scene generated above and the 1DP7 "wing" can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best.  Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.
 </div>
-&nbsp;
-&nbsp;
 <div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-=== SUB section Heading (X marks)===
+===(4.3) Coloring by conservation===
 </div>
-&nbsp;<br>
-Instruction
+With the superimposed coordinates, you can begin to get a sense whether either or both binding modes could be appropriate for a protein-DNA complex in your Mbp1 orthologue. But these are geometrical criteria only, and the protein in your species may be flexible enough to adopt a different conformation in a complex, and different again from your model. A more powerful way to analyze such hypothetical complexes is to look at conservation patterns. With VMD, you can import a sequence alignment into the MultiSeq extension and color residies by conservation. The protocol below assumes
-&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
-*Task
+*You have prealigned the reference Mbp1 proteins with your species' Mbp1 orthologue;
-</div>
+*You have saved the alignment in a CLUSTAL format.
-&nbsp;<br>
+You can use Jalview or any other MSA server to do so. You can even do this by hand - there should be few if any indels and the correct alignment is easy to see.
+;Load the Mbp1 APSES alignment into MultiSeq.
+:(A) In the MultiSeq Window, navigate to '''File &rarr; Import Data...'''; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable <code>ALN</code> files (these are CLUSTAL formatted multiple sequence alignments).
+:(B) Open the alignment file, click on '''Ok''' to import the data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.
+:(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).
-Instruction
+You will see that the 1MB1 sequence and the APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.
-&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
-*Task.
-</div>
-&nbsp;
+;Bring the 1MB1 sequence in register with the APSES alignment.
-&nbsp;
+:(A)MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequences you have imported.
+:(B) Select '''Edit &rarr; Enable Editing... &rarr; Gaps only''' to allow changing indels.
+:(C) Pressing the spacebar once should insert a gap character before the '''selected column''' in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1: <code>S I M ...</code>
+:(D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)
+:(E) When you are done, it may be prudent to save the state of your alignment. Use '''File &rarr; Save Session...'''
+;Color by similarity
+:(A) Use the '''View &rarr; Coloring &rarr; Sequence similarity &rarr; BLOSUM30''' option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
+:(B) You can adjust the color scale in the usual way by navigating to '''VMD main &rarr; Graphics &rarr; Colors...''', choosing the Color Scale tab and adjusting the scale midpoint.
+:(C) Navigate to the '''Representations''' window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use '''User''' coloring of your ''Tube'' and ''Licorice'' representations to apply the sequence similarity color gradient that MultiSeq has calculated.
-<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
-==(3) Model analysis==
+* Once you have colored the residues of your model by conservation, create another informative stereo-image and paste it into your assignment.
 </div>
-&nbsp;
-&nbsp;
 <div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-=== SUB section Heading (X marks)===
+===(4.4) Interpretation===
 </div>
-&nbsp;<br>
-Instruction
-&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+<div style="padding: 5px; background: #FFCC99;">
-*Task
+;Analysis (2 marks)
+* Considering the conservation patterns for Mbp1 orthologues, and assuming that all these orthologues bind DNA in a similar way, which model appears to be more plausible for protein-DNA interactions in APSES domains? Is it the canonical, or the non-canonical binding mode? Discuss briefly what you would expect to find and how this relates to your observations. Distinguish clearly between experimental evidence, computational inference and empirical hypothesis. You are welcome to upload detail views (stereo !) of particular sidechains, or surfaces etc. if this helps your arguments. Sometimes a picture is worth many words. But this is not a requirement, we are more interested in evidence-based reasoning than in the form of the presentation.
 </div>
-&nbsp;<br>
-Instruction
-&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
-*Task.
-</div>
-&nbsp;
+<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
-&nbsp;
-==(3) Summary of Resources==
+==(5) Summary of Resources==
 </div>
 &nbsp;<br>
-;Links
+;Links and background reading
 :* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
-:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains] (background reading, not required reading)
+:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains]
-:* [[Organism_list_2006|Assigned Organisms]]
+:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/2000_Gajiwala_WingedHelixDomains.pdf '''Review (PDF, restricted)''' Gajiwala &amp; Burley, winged-Helix domains]
-:* [http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html '''PDB file format''']
+:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
 :* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
-;Alignments
-:'''Mbp1 proteins:'''
-:* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
-:* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
-:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
-:'''APSES domains:'''
+;Data
-:* [[APSES_domains_PSI-BLAST|All APSES domains - alignment based on '''PSI-BLAST''' results]]
-:* [[APSES_domains_CLUSTAL|All APSES domains -  '''CLUSTAL-W''' alignment]]
-:* [[APSES_domains_probcons|All APSES domains -  '''probcons''' alignment]]
-;Trees
+:* [[Homology_modeling_fallback_data|'''Fallback Data page''']] <small> - Refer to this page in case your own efforts fail, or you have insurmountable problems with your input files.</small>
-:*[[APSES_domains_reference_tree|'''APSES domains reference tree''']]
-:*[[Revised_Mbp1_APSES_domain_tree| '''revised Mbp1 APSES domain tree''']]
-&nbsp;
+;Reference sequences and alignments
-&nbsp;
+:* [[Reference APSES domain sequences (reference species)|'''Reference APSES domains page''']]
-<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
-[End of assignment]
-</div>
-If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
+{{Template:Assignment_Footer}}