Difference between revisions of "BIO Assignment 4 2011"

From "A B C"
Jump to navigation Jump to search
Line 1: Line 1:
<!-- {{Template:Active}} -->
+
{{Template:Active}}
{{Template:Inactive}}
+
<!-- {{Template:Inactive}} -->
  
  
Line 18: Line 18:
 
&nbsp;
 
&nbsp;
  
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and looked at how these domains have evolved over time. We have seen that this is an ancient family, that had several members already in the cenancestor of all fungi, an organism that lived in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html vendian period] of the proterozoic era of precambrian times, more than 600,000,000 years ago.
+
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html Vendian period] of the Proterozoic era of Precambrian times.
  
In order to understand how particular residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to consider an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. In particular, it would be interesting to correlate the conservation patterns we have observed in the MSAs with specific DNA binding interactions. Unfortunately, the 1MB1 structure does not have DNA bound and the evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to define the details of how a DNA double helix might be bound. These details would require the structure of a complex that contains protein as well as DNA. No such complex of an APSES domain has yet been crystallized.
+
In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
  
''In this assignment you will construct a molecular model of the Mbp1 orthologue in your assigned organism, identify similar structures of distantly related domains for which protein-DNA complexes are known, define whether the available evidence allows you to distinguish between different modes of ligand binding, and assemble a hypothetical complex structure.''
+
''In this assignment you will (1) construct a molecular model of the Mbp1 orthologue in your assigned organism, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and(4) discuss whether the available evidence allows you to distinguish between different modes of ligand binding, ''
  
 
For the following, please remember the following terminology:
 
For the following, please remember the following terminology:
Line 34: Line 34:
 
&nbsp;
 
&nbsp;
  
A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might require.
+
A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.
  
 
{{Template:Preparation|
 
{{Template:Preparation|
Line 40: Line 40:
 
num=4|
 
num=4|
 
ord=fourth|
 
ord=fourth|
due = Monday, November 5 at 10:00 in the morning}}
+
due = Monday, November 12 at 10:00 in the morning}}
  
  
Line 47: Line 47:
 
</div>
 
</div>
  
 
<!--
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===Choosing a template (1 marks)===
+
===(1.1) Template choice and sequence (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Often more than one related structure can be found in the PDB. We have discussed principles of selecting template structures in the lecture. Interestingly the PDB itself cannot be searched for the contents of its holdings, by structural- or sequence similarity, but there is always BLAST since the NCBI conveniently allows you to search against all sequences in PDB files.
+
Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lecture and there is a short summary of [[Template_choice_principles|template choice principles]] on this Wiki. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But one can always also use the BLAST interface at the NCBI, since the sequences contained in PDB files are accessible as a database subsection on the BLAST menu.
  
*Use BLAST to identify all PDB files that contain APSES domains that are clearly homologuous to your target. (Document that you have searched in the correct subsection of the Genbank holdings). For the hits you find, consider how these structures differ and which features would make each more or less suitable for your task. Comment briefly on what options you have, select one template and note why you have decided to use this particular structure as a template. Include aspects of sequence similarity, length of the sequence, presence or absence of ligands and their potential effect on the structure, and experimental method and quality in your reasoning.
+
<div style="padding: 5px; background: #DDDDEE;">
 +
*Use the NCBI BLAST interface to identify all PDB files that are clearly homologous to your target APSES domain, if you haven't already done so in Assignment 2. Document that you have searched in the correct subsection of the database by selecting "pdb" on the database options menu. For the hits you find, consider how these coordinate sets differ and which features would make each more or less suitable for your task by commenting briefly on  
 +
:*sequence similarity to your target
 +
:*size of expected model (length of alignment)
 +
:*presence or absence of ligands
 +
:*experimental method and quality of the data set
 +
Then choose the '''template''' you consider the most suitable and note why you have decided to use this template.
 +
 
 +
* Retrieve the most suitable template structure coordinate file from the PDB.
  
*Note which sequence is contained in the coordinate section of the PDB file; note if and how this implied sequence differs from the sequences ...
+
(0.5 marks)
 +
</div>
  
:*listed in the seqres records;
+
It is not straightforward at all how to number sequence in such a project. The "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain as the CDD defines it is not Residue 1 of the Mbp1 protein. The first residue of the e.g. 1MB1 FASTA file '''is''' the first residue of Mbp1 protein, but the last five residues are an artifical His tag. Is H125 of 1MB1 thus equivalent to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, therefore N is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records; whereas the SEQRES records start with MET ... and so on. You need to remember: a sequence number is not absolute, but derived from a particular context.  
:*given in the FASTA sequence for the template that the PDB provides;
 
:*and that stored by the NCBI.
 
  
* In a table, establish the correspondence of the coordinate sequence numbering (defined by the residue numbers/insertion codes in the atom records) with your target sequence numbering.
+
The homology model will be based on an alignment of target and template. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit  and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for [http://swift.cmbi.ru.nl/servers/html/index.html '''WhatIf'''], a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.  
  
* Retrieve the most suitable template structure coordinate file from the PDB.
+
<div style="padding: 5px; background: #DDDDEE;">
 +
*Navigate to the '''Administration''' sub-menu of the [http://swift.cmbi.ru.nl/servers/html/index.html WhatIf Web server]. Follow the link to '''Make sequence file from PDB file'''. Enter the PDB-ID of your template into the form field and '''Send''' the request to the server. The server accesses the PDB file and extracts sequence information directly from the <code>ATOM&nbsp;&nbsp;</code> records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this '''implied''' sequence to check if and how it differs from the sequence ...
  
-->
+
:*... listed in the <code>SEQRES</code> records of the coordinate file;
 +
:*... given in the FASTA sequence for the template, which is provided by the PDB;
 +
:*... stored in the protein database of the NCBI.
 +
: and record your results.
  
 +
* In a table, establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.
  
 +
(0.5 marks)
 +
</div>
 +
 +
:(*) <small>These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the ''Sequence Viewer'' extension of VMD.</small>.
 +
:<small>Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence.</small>.
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
=== The input alignment (1 marks)===
+
 
 +
===(1.2) The input alignment (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
The sequence alignment between target and template is the single most important factor that determines the quality of your model.
+
The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these only because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
  
No homology modeling process will repair an incorrect alignment and it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment, rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient, rather than the more sophisticated methods and more informed procedures we have discussed. Detailed analysis of fallacious models rarely leads to good results.
+
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
  
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Typically such an alignment will also include additional optimization steps to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+
In the case of Mbp1 genes however, all orthologues we have considered have no indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species.
  
Here is an excerpt from the T-coffee aligned Mbp1 sequences: it contains all the residues of the yeast sequence that are found in the 1MB1 crystal structure - the '''template''' sequence for our homology model - and it has been edited to remove the N-terminal gaps in the sequence. Thus the N-terminus is 21 amino acids longer than the definition of the APSES domain in CDD (which starts with <code>SIMKR...</code>), the C- terminus is slightly shorter.  
+
Accordingly, all we need to do is to write the APSES domain sequences one under the other.
  
Since the sequences are very similar between each other, there is no ambiguity in the alignment and the construction of a homology model should be straightforward. Normally one would spend considerable some effort at this stage to consider which parts of the target sequence and the template sequence appear to  correctly aligned and to edit the alignment manually. In our case, evolutionary pressure was so strong that essentially all have evolved without a single indel in their sequence.
+
<div style="padding: 5px; background: #DDDDEE;">
 
+
* Copy the FASTA formatted sequence for the APSES domain of your organism's Mbp1 orthologue from the sequences [[All_APSES_domains|defined in Assignment 3]] and save it as FASTA formatted text file. This is your '''target''' sequence. Compare this with the FASTA formatted file you have extracted from the PDB coordinate set. This is your '''template''' sequence. Then generate a multi-FASTA formatted file that contains both sequences, and '''pad''' the sequence(s) where required with hyphens as gap characters, so that target and template sequences have exactly the same length and are aligned. Refer to the [[Assignment_4_fallback_data|'''Fallback data''']] if you are not sure about the format.  
I have added to the alignment the APSES domain of [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=116197493&dopt=GenPept XP_001224558], the ''Chaetomium globosum'' Mbp1 orthologue (MBP1_CHAGL). This will serve as the reference and fallback sequence.
 
 
 
1MB1            NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
 
MBP1_CANGL      NQIYSAKYSGVDVYEFIHPTG---SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEV
 
MBP1_EREGO      TQIYSAKYSGVEVYEFLHPTG---SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEV
 
MBP1_KLULA      NQIYSAKYSGVDVYEFIHPTG---SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEV
 
MBP1_CANAL      SQIYSATYSNVPAFEFVTSEG---PIMRRKKDSWINATHILKIAKFPKAKRTRILEKDV
 
MBP1_DEBHA      TQIYSATYSNVPVFEFVTLEG---PIMRRKLDSWINATHILKIAKFPKAKRTRILEKDV
 
MBP1_YARLI      MSIYKATYSGVPVYEFQCKNV---AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEV
 
MBP1_SCHPO      SAVHVAVYSGVEVYECFIKGV---SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQV
 
MBP1_USTMA      KTIFKATYSGVPVYECIINNV---AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREI
 
MBP1_ASPNI      SNVYSATYSSVPVYEFKIGTD---SVMRRRSDDWINATHILKVAGFDKPARTRILEREV
 
MBP1_ASPTE      SKIYSATYSSVPVYEFKIEGD---SVMRRRADDWINATHILKVAGFDKPARTRILEREV
 
MBP1_CRYNE      PKVYASVYSGVPVFEAMIRGI---SVMRRASDSWVNATQILKVAGVHKSARTKILEKEV
 
MBP1_GIBZE      G-IYSASYSGVDVYEMEVNNI---AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEI
 
MBP1_NEUCR      IYSLQATYSGVGVYEMEVNNV---AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEI
 
MBP1_MAGGR      P-IYTAVYSNVEVYEFEVNGV---AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEI
 
MBP1_ASPFU      PQIYKAVYSNVSVYEMEVNGV---AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEI
 
MBP1_CHAGL      AGIYSATYSGIPVYEYQFGPDMKEHVMRRREDNWINATHILKAAGFDKPARTRILERDV
 
 
1MB1            LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 
MBP1_CANGL      LKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLF
 
MBP1_EREGO      IKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLF
 
MBP1_KLULA      ITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLF
 
MBP1_CANAL      QTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIF
 
MBP1_DEBHA      QTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIF
 
MBP1_YARLI      QKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIF
 
MBP1_SCHPO      QIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPIL
 
MBP1_USTMA      QKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPIT
 
MBP1_ASPNI      QKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIF
 
MBP1_ASPTE      QKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIF
 
MBP1_CRYNE      LNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVF
 
MBP1_GIBZE      QTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLL
 
MBP1_NEUCR      QIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
 
MBP1_MAGGR      QTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLL
 
MBP1_ASPFU      AAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLL
 
MBP1_CHAGL      QKDVHEKIQGGYGKYQGTWIPLEQGRALAQRNNIYDRLRPIF
 
  
 +
(1 mark)
 +
</div>
 +
&nbsp;<br>
 
&nbsp;<br>
 
&nbsp;<br>
  
It should be obvious to you by now how you can copy a string of amino acids from such an alignment and create a FASTA file. However we need to take a little detour: this detour brings us to the question of sequence numbers.
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
It is not straightforward at all how to number sequence in such a project. The "natural" way would be to start a sequential numbering from the start-codon of the full length protein and go sequentially from there. However imagine what would happen if a curator would discover that one of the splice-sites for a gene has been missed in automatic annotation. All of a sudden a corrected sequence would have a different length than the one that may have been used for earlier studies. Unfortunatlety, there is no mechanism (''wouldn't it be nice!'') that automatically goes back through the literature and your lab-journal and updates the revised sequence numbering... But there are other possible complications, regarding sequence numbers. The first residue of the CDD-APSES domain is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file ''is'' the first residue of Mbp1 protein, but the last five residues are an artifiical His tag. Is H125 of 1MB1 the equivalent residue to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, whereas the SEQRES records start with MET ... and so on. The take-home message is that a sequence number is nothing absolute, but something that makes sense only in a particular context. To emphasize this, we will write a FASTA header for our '''target''' sequence that lists the residues of the source sequence it correspond to. In terms of actual sequence numbering, we will adopt the numbering of the 1MB1 protein throughout to be able to consistently label particular amino acids.
+
==(2) Homology model==
 +
</div>
 +
&nbsp;
 +
&nbsp;
  
Access the sequence of "your" organism's Mbp1 Orthologue at UniProt. (You can use the links I have provided in the table below).
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 +
=== (2.1) SwissModel (1 mark)===
 +
</div>
 +
&nbsp;<br>
  
 +
Access the Swissmodel server at '''http://swissmodel.expasy.org''' . Navigate to the '''Alignment Interface'''.
  
<table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
+
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
<tr style="background: #BDC3DC;">
+
*Paste your alignment for target and model into the form field. Refer to the [[Assignment_4_fallback_data|'''Fallback Data file''']] if you are not sure about the format. Make sure to select the correct option for the alignment input format on the form.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
+
:<small>(You have to choose the correct format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. In the past we have seen problems with uploading alignments that have not been saved as "text only" and including periods i.e.   "." in sequence names of CLUSTAL formatted alignments. Underscores appear to be safe.</small>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot Accession</b></td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
* Click '''submit alignment ''' and on the returned page define your '''target''' and '''template''' sequence. For the '''template sequence''' define the PDB ID of the coordinate file. Enter the correct Chain-ID.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
+
:<small>Recently the PDB has undergone a "remediation" process in which archived coordinate files were altered by the database to conform to new format standards. One of the changes was to assign a chain identifier of "A" to all chains that did not previously have a chain identifier. SwissModel uses a derivative of coordinate sets from the PDB (a dataset they call ExPDB). Apparently the PDB proper and ExPDB have now gone out of synchrony; when I entered the (correct, according to PDB) chain designation "A" for 1MB1, SwissModel rejected the alignment with a nondescript error message. When I entered an underscore "_" instead, which would be the designation for a chain without explicit chain identifier, such as the pre-remidation versio of the coordinates, the alignment was accepted and processed. I have e-mailed SwissModel about the problem; they are in the process of correcting it and may or may not be done while you are working on your assignments. If your template chain has the chain identifier "A" and your alignment gets rejected, try entering entering an underscore instead.</small>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4WGN2_ASPFU Q4WGN2]</td>
+
:<small>'''Enter''' the correct chain ID into the form-field even if you think it already appears there, don't simply accept the preloaded default. There is a bug in SwissModel's parser code that may cause incorrect strings to be sent to the server from that field. I have e-mailed SwissModel about the problem which may or may not be corrected while you are working on your assignments.</small>
</tr>
 
  
<tr style="background: #E9EBF3;">
+
*Click '''submit alignment''' and review the alignment on the returned page. Make sure it has been interpreted correctly by the server. The conserved residues have to be lined up and matching. Then click '''submit alignment''' again, to start the modeling process.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5B8H6_EMENI Q5B8H6]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
* The resulting page returns information about the resulting model. Save the '''model coordinates''' on your computer. Read the information on what is being returned by the server (click on the red questionmark icon). Paste the Anolea profile into your assignment.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
+
:<small>Do not paste a screenshot of the result, but copy and paste the image from the Web-page! You do not need to submit the actual coordinate files with your assignment.</small>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q0CQJ5_ASPTE Q0CQJ5]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
(1 mark)
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5ANP5_CANAL Q5ANP5]</td>
 
</tr>
 
 
 
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6FWD6_CANGL Q6FWD6]</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5KHS0_CRYNE Q5KHS0]</td>
 
</tr>
 
 
 
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6BSN6_DEBHA Q6BSN6]</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q752H3_ASHGO Q752H3]</td>
 
</tr>
 
 
 
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4IEY8_GIBZE Q4IEY8]</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_KLULA P39679]</td>
 
</tr>
 
 
 
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q3S405_MAGGR Q3S405]</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q7SBG9_NEUCR Q7SBG9]</td>
 
</tr>
 
 
 
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_YEAST P39678]</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=RES2_SCHPO P41412]</td>
 
</tr>
 
 
 
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4P117_USTMA Q4P117]</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6CGF5_YARLI Q6CGF5]</td>
 
</tr>
 
 
 
</table>
 
 
 
 
 
<div style="padding: 5px; background: #EEEEEE;">
 
*Copy your organism's Mbp1 sequence from the alignment above. Then define the start- and end- sequence numbers of the '''target''' sequence relative to the full-length protein. Prepare a FASTA formatted file for the '''target''' sequence in your organism, giving it an appropriate header and include the sequence numbers. Refer to the [[Assignment_5_fallback_data|'''Fallback data''']] file if you are not sure about the format. (1 mark)
 
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
 +
In case you do not wish to submit the modelling job yourself, or have insurmountable problems when using the SwissModel interface, you may access the result files from the  [[Assignment_4_fallback_data|'''Fallback Data file''']]. Document the problems and note this in your assignment.
  
Your FASTA sequence should look similar to this:
 
  
  >1MB1: Mbp1_SACCE 1..100
+
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;">
NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
 
LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 
  
 +
==(3) Model analysis==
 +
</div>
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 +
=== (3.1) The PDB file (1 mark)===
 +
</div>
 +
&nbsp;<br>
 +
 
 +
Open your '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the [[Assignment_5_fallback_data|'''Fallback Data file''']].)
  
==(2) Homology model==
+
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
 +
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the '''model''' correspond to that?
 +
(1 mark)
 
</div>
 
</div>
 +
 +
<!-- discuss flagging of loops - setting of B-factor to 99.0 phps. ANOLEA vs. Gromos ... packing vs. energy? -->
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
=== (2.1) SwissModel (1 mark)===
+
===(3.2) First visualization (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
Access the Swissmodel server at [http://swissmodel.expasy.org '''http://swissmodel.expasy.org'''] . Navigate to the '''Alignment Interface'''.
+
In assignment 2 you have already studied a Mbp1 structure and compared it with your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
*Copy from the alignment above the 1MB1 sequence and the sequence from your organism, and paste it into the form field. Refer to the [[Assignment_5_fallback_data|'''Fallback Data file''']] if you are not sure about the format.
+
*Save your '''model''' coordinates to your harddisk and visualize the structure in VMD. (Alternatively, copy and save the coordinates linked to the [[Assignment_4_fallback_data|'''Fallback Data file''']] to your harddisk.) Make an informative stereo view that shows the general orientation of the helix-turn-helix motif and the "wing", and paste it into your assignment.
:(You have to choose the format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. Other common problems uploading your alignment may include uploading a file that has not been saved as "text only" and periods i.e.  "."  in sequence names. Underscores appear to be safe.)
 
  
* Click '''submit''' and define your '''target''' and '''template''' sequence. For the '''template sequence''' define the coordinate file and chain. (In our case the coordinate file is <code>'''1MB1'''</code> and the chain is "<code>'''_'''</code>" i.e. none, since the PDB file does not contain more than one chain.
+
* Discuss briefly which parts of the model may be unreliable and color these (if any) distinctly in your submitted image.
  
*Click '''submit''' and request the construction of a homology model: Enter your e-mail address and check the button for '''Normal Mode''', not "Swiss-PDB Viewer mode. (Important, since there will be problems with the output otherwise). Click '''submit'''. You should receive four files files by e-mail within half an hour or so. (1 mark)
+
(1 mark)
 
 
(You do not need to submit any coordinate files with your assignment.)
 
  
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
In case you do not wish to submit the modelling job yourself, you can access the result files for the  from the  [[Assignment_5_fallback_data|'''Fallback Data file''']].
+
&nbsp;<br>
 
 
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(3) Model analysis==
+
==(4) The DNA ligand==
 
</div>
 
</div>
 
&nbsp;
 
&nbsp;
Line 276: Line 193:
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
=== (3.1) The PDB file (1 mark)===
+
 
 +
===(4.1) Finding a similar protein-DNA complex (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
Open your  '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the [[Assignment_5_fallback_data|'''Fallback Data file''']].)
+
One of the really interesting questions we can discuss with reference to our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for how DNA is bound to APSES domains.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
Since there is currently no software available that would accurately model such a complex from first principles, we will base a model of  a bound complex on homology modeling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a strcturally similar protein-DNA complex?
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the '''model''' correspond to that? (1 mark)
+
 
</div>
+
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.
 +
 
 +
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is provided as a search tool for structural similarity search.
  
<!-- discuss flagging of loops - setting of B-factor to 99.0 -->
+
At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, [http://www.ebi.ac.uk/msd-srv/ssm/ '''MSDfold'''] provides a convenient interface for structure searches.
  
&nbsp;
+
However we have also read previously that the APSES domains are members of a much larger superfamily, the "winged helix" DNA binding domains , of which hundreds of structures have been solved.
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
===(3.2) first visualization (3 marks)===
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
  
In assignment 2 you have already studied the 1MB1 coordinate file and compared it to your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
+
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.]]
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
&nbsp;<br>
*Save the attachment of your '''model''' coordinates to your harddisk and visualize it in RasMol. (Alternatively, copy and save the coordinates from the  [[Assignment_5_fallback_data|'''Fallback Data file''']] to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (3 marks)
 
  
</div>
+
APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of  a protein-DNA complex. CATH does not provide information on complexes, but we can search the PDB with CATH codes in the following way:
&nbsp;<br>
 
  
 +
* Access [http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH domain 1.10.10.10].
 +
* Navigate to the [http://www.pdb.org/ PDB home page] and follow the link to [http://www.pdb.org/pdb/search/advSearch.do Advanced Search]
 +
* In the options menu for "Choose a Query Type" select Structure Features &rarr; CATH classification. A window will open that allows you to navigate down through the CATH tree. The interface is awkward because it does not display the actual CATH codes along with the class names, but you can view the class names on the CATH page linked above. Click on '''the triangle icons''' before "Mainly Alpha"&rarr;"Orthogonal Bundle"&rarr;"ARC repressor mutant, subunit A" then click on the link to "winged helix repressor DNA binding domain". As of this writing, this subquery matches 295 structures.
 +
* Click on the (+) button behind the subquery to add an additional query. Select the option "Structure Summary"&rarr;"Molecule / Chain type". In the option menus that pop up, select "Contains Protein &rarr; Yes",  "Contains DNA &rarr; Yes""Contains RNA &rarr; Ignore". This selects files that contain Protein-DNA complexes.
 +
* Check the box below this subquery to "Remove Similar Sequences at 90% identity" and click on "Evaluate Query". As of this writing, seventy complexes were returned.
 +
* In the left-hand menu, under the Tabulate section, click on the "Collage" function to display icons of the structure files. This is a fast way to obtain an overview of the structures that have been returned. First of all you may notice that in fact not all of the structures are really different, despite selecting only to retrieve dissimilar sequences. This appears to be a deficiency of the algorithm. But you can also easily recognize how the recognition helix inserts into the major groove of most of the structures that were returned (at least those where the domain is not a very small part of a much larger complex). There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way. We shall use structural superposition of your homology model and two of the winged-helix proteins to decide which mode of DNA binding seems to be more plausible for Mbp1 homologues.
  
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76).]]
+
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
 +
* Follow the procedure outlined above, from a CATH entry page up to viewing a Collage (or alternatively a tabular view) of the retrieved coordinate files. You can be maximally concise documenting the procedure I have defined above, but do spend a bit of time to understand the key elements of the PDB's advanced search interface.
  
&nbsp;
+
(1 mark)
&nbsp;
+
</div>
  
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
===(3.3) modeling a DNA ligand (4 marks)===
+
===(4.2) Preparation and superposition of a canonical complex (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
The really interesting question we could begin to address with our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for a bound DNA molecule to our model.
+
The structure we shall use as a reference for the canonical binding mode is the Elk-1 transcription factor.
  
Since there is currently no software available that would accurately model such a complex from first principles, we will base this on homology modeling as well. This means we need to find a similar structure for which the complex structure is known. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of a protein-DNA complex. Now what?
+
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
  
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures.
+
The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, let's delete the second copy.
  
However, very similar to BLAST, we might not want to search with the entire protein, if all we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless. The arrangement of the residues from 50 to 74 that we have already discussed in Assignment 2 suggests that the compact subdomain from 36 to 76 (see the image above) might be a useful structure to search with: it contains the residues we are interested in and enough of connected secondary structure elements to be structurally meaningful.
+
* Access the PDB and navigate to the 1DUX structure explorer page. Download the coordinates to your computer.
 +
* Open the coordinate file in a text-editor and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
 +
* Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which
 +
* You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
 +
* In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
 +
* In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
 +
* You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that the model's side-chain orientations have not been experimentally determined but inferred from the template, and that the template's strcture was determined in the absence of bound ligand.
  
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is a search tool for structural similarity search tool for this purpose. Unfortunately it does not seem to be able to handle a query with such a structural subdomain (the process did not finish after several days) but at least you can get a list of structural neighbors of the 1MB1 full-length template structure, by entering the PDB ID in a small form field on the VAST home page, and then clicking on the colored bar labeled "Chain" on the MMDB structure summary page. This precomputed page for the 1MB1 structure shows a number of diverse proteins matching to various helices and strands of the structure.
+
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
 +
* Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best.  Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.
  
At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, the SSM (Secondary Structure Matching service) provides a well thought out interface for searching files from the PDB or uploading coordinates.
+
(1 mark)
 +
</div>
 +
&nbsp;<br>
 +
&nbsp;
  
After uploading the coordinates for residues 36 to 76 of the 1MB1 structure running the search and sorting the results by alignment length, the top hits include a number of nucleotide binding proteins such as a replication terminator (1F4K), the LexA repressor (1MVD) and a "Winged Helix" protein (1KQ8). These are all members of a much larger superfamily, the "winged helix" DNA binding domains ([http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH 1.10.10.10]), of which hundreds of structures have been solved. They represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of the beta strand binding into the minor groove.
 
  
<!-- The other service the EBI structure links to is the DALI server. DALI was one of the first algorithms capable of large-scale protein structure searches; it was developed by Liisa Holm and is now hosted by her group in Helsinki. Submitting our search domain generates the e-mailed result linked to here. Both results (there are only two) are also found in the top 100 list of the SSM service. The winged helix domain 1DP7 merits some comment though: its structure shows a novel mode of binding for DNA. Here, it is the beta-wing, not the "recognition helix" that inserts into the major groove! We will consider this in more detail below.
+
<div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
  
First we shall explore some of the structures that SSM has returned. The SSM server presents its result details in Web pages, but it also allows to download the entire result set in an XML formatted file. This is a common method of data-interchange in bioinformatics but you would not want to actually read such a file and manually extract information (even though you could, in principle). Thus I have prepared a summary file of the alignment details of the SSM results. This should allow you to rapidly find the exact aligned residues in the matched domains. While I have derived this file from the output through a computer program I have written, you could easily have accessed the same information on the Web, had you run the query yourself. -->
+
===(4.2) Preparation and superposition of a non-canonical complex (1 mark)===
 +
</div>
 +
&nbsp;<br>
  
This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can pick one of these for which a DNA complex structure is known. I have picked one such structure from the list of hits that were returned by SSM: it is the Elk-1 transcription factor.
+
The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
  
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (pdb|1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
+
[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coressponds to the recogition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
  
Now all that is left to do is to bring the DNA molecule into the correct orientation for our '''model''' and then to combine the two files. We need to superimpose the Elk-1 protein/DNA complex onto our '''model'''.
+
The 1DP7 coordinate-file contains only one protein domain and only one B-DNA monomer in its asymmetric unit. This is a file for which we have to generate ''biological unit'' coordinates! Then, for simplicity we will delete the second protein monomer. As you know, there are at least two systems that make the so-called biological units available: the PDB itself, through the Biological Unit file that is accessible via the "Download Files" section on any Structure Explorer page, and the EBI through the PQS service. '''How''' the biological units are stored is subtly different for both cases and for our purpose this small difference is important. The PDB generates additional chins as copies of the original and delineates them with <code>MODEL</code>, <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The EBI's PQS service creates copies that have distinct atomnumbers and chain IDs. The difference is that the PDB file thus '''contains the same molecule in two different orientations''', wheras the PQS file contains '''two independent molecules'''. This is an important difference when it comes to selecting residues, visualizing and superimposing structures. For VMD, the PQS way of doing things is the right way to go, since by default only the first <code>MODEL</code> will be displayed if several are available.
  
;Structure superposition
+
* Access the [http://pqs.ebi.ac.uk/ '''EBI PQS server'''], enter 1DP7 into the '''PDBidcode''' form field and click on '''Submit'''.
There are quite a number of superposition servers available on the Web, a remarkably comprehensive overview can be found in [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia]. However, overengineering and black-box mentality makes our task more difficult than it need be: most tools do not allow users to specify particular alignment zones but attempt to automatically define the zones of residues to be supoerimposed according to some geometric target function. Almost none return the actual rotation matrix and translation vector that is used for the superposition. And almost none transform the coordinates of heteratoms such as solvent, ligands or DNA molecules along with the protein coordinates. An exception that I have found to be very useable is the [http://www.predictioncenter.org/local/lga/lga.html Local-Global Alignment server ('''LGA''')], written by Adam Zemla. The procedure is quite straightforward:
+
* On the results page, click on the link under '''1dp7_0''', which is the unique suggestion for a biological unit that the server has identified.
 +
* On the PQS OUTPUT page that is retrieved, click on the '''1dp7.mmol''' link, this will load the PDB formatted coordinate file.
 +
* Save the coordinates as 1DP7_complex.pdb (or some other name that makes sense to you), open it in a text editor, delete the <code>HETATM</code> records from the end and the entire chain "B". Also make sure not to delete any of the <code>TER</code> records for chains "D", "P" or "A". Save the file.
 +
* In the multiseq window, choose File&rarr;Import Data, '''Browse...''' to your 1DP7_complex file, select it and click on '''Open'''. Click '''OK''' to load the file.
 +
* Mark all three protein chains by selecting the checkbox next to thier name and again run the STAMP structural alignment.
 +
* In the graphical representations window, double-click again on all cartoon representations that multiseq has generated to undisplay them, undisplay also the Tube representation of 1DUX, create a Tube representatrion for 1DP7, and select a Color by ColorID (a differnet color you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
  
*Define the structure to be rotated (1DUX in this case). This is a dimer, so download the file from the PDB and manually edit to contain only DNA chains A and B and protein chain C.
+
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
*Define the structure to be held constant (1MB1 in this case). Download from PDB.
+
* Orient and scale your superimposed structures so that their structural similarity is apparent, the orientation is similar to the scene generated above and the 1DP7 "wing" can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best.  Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.
*Use the "browse" option to define both files as input on the LGA inpput form
 
*Use the option to have both coordinate sets included in your output: <code>-o2</code>
 
*Submit
 
  
The results arrive per e-mail. I have linked the resulting PDB file to the [[Assignment_5_fallback_data|'''Fallback Data page''']]. <small>If you run this analysis on your own, you may want to review the types of edits the edits I made to the PDB file to get it displayed correctly in Rasmol.</small>
+
(1 mark)
 +
</div>
 +
&nbsp;<br>
  
 +
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
===(4.3) Interpretation (2 marks)===
*Save the superimposed  coordinates in a file, open and view in Rasmol and note how well the "recognition helix" and adjacent beta strands superimpose! (Alternatively, copy and save the coordinates from the c to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (4 marks)
 
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
&nbsp;
 
  
 +
In your previous assignment, you have commented on conservation patterns in Mbp1 orthologues. You can refer back to your last results (easier to do), or you can import the APSES domain alignment for Mbp1 proteins and again color by conservation (easier to study) to briefly discuss the following question.
 +
 +
&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
 +
* Considering the conservation patterns for Mbp1 orthologues, and assuming that all these orthologues bind DNA in a similar way, which model appears to be more plausible for protein-DNA interactions in APSES domains? Is it the canonical, or the non-canonical binding mode? Discuss briefly what you would expect to find and how this relates to your observations. Distinguish clearly between experimental evidence, computational inference and empirical hypothesis. You are of course welcome to paste detail views (stereo !) of particular sidechains, or surfaces etc. if this helps your arguments. Sometimes a picture is worth many words. But this is not a requirement, we are more interested in evidence-based reasoning than in the form of the presentation.
 +
 +
(2 marks)
 +
</div>
 +
&nbsp;<br>
 +
&nbsp;<br>
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(4) Summary of Resources==
+
==(5) Summary of Resources==
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
;Links
+
;Links and background reading
 +
 
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains] (background reading, not required reading)
+
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains]
:* [[Organism_list_2006|Assigned Organisms]]
+
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/2000_Gajiwala_WingedHelixDomains.pdf '''Review (PDF, restricted)''' Gajiwala &amp; Burley, winged-Helix domains]
:* [http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html '''PDB file format''']
+
:* [[Organism_list_2007|Assigned Organisms]]
 +
:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
 
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
 
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
  
:* [[Assignment_5_fallback_data|'''Fallback Data page''']]
+
;[[Assignment_4_fallback_data|'''Fallback Data page''']]
  
 
;Alignments
 
;Alignments
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
+
:* [[APSES_domains_MUSCLE|APSES domains MUSCLE aligned]]
  
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
+
{{Template:Assignment_Footer}}
[End of assignment]
 
</div>
 
 
 
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
 

Revision as of 13:52, 1 November 2007

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 


   

Assignment 4 - Homology modeling

How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
Max Perutz (on his first glimpse of the Hemoglobin structure)

   

Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the Vendian period of the Proterozoic era of Precambrian times.

In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

In this assignment you will (1) construct a molecular model of the Mbp1 orthologue in your assigned organism, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and(4) discuss whether the available evidence allows you to distinguish between different modes of ligand binding,

For the following, please remember the following terminology:

Target
The protein that you are planning to model.
Template
The protein whose structure you are using as a guide to build the model.
Model
The structure that results from the modeling process. It has the Target sequence and is similar to the Template structure.

 

A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.

Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you are trying to guess, rather than confirm possibly important information.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, November 12 at 10:00 in the morning.

   


(1) Preparation


(1.1) Template choice and sequence (1 mark)

 
Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lecture and there is a short summary of template choice principles on this Wiki. One can either search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But one can always also use the BLAST interface at the NCBI, since the sequences contained in PDB files are accessible as a database subsection on the BLAST menu.

  • Use the NCBI BLAST interface to identify all PDB files that are clearly homologous to your target APSES domain, if you haven't already done so in Assignment 2. Document that you have searched in the correct subsection of the database by selecting "pdb" on the database options menu. For the hits you find, consider how these coordinate sets differ and which features would make each more or less suitable for your task by commenting briefly on
  • sequence similarity to your target
  • size of expected model (length of alignment)
  • presence or absence of ligands
  • experimental method and quality of the data set

Then choose the template you consider the most suitable and note why you have decided to use this template.

  • Retrieve the most suitable template structure coordinate file from the PDB.

(0.5 marks)

It is not straightforward at all how to number sequence in such a project. The "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain as the CDD defines it is not Residue 1 of the Mbp1 protein. The first residue of the e.g. 1MB1 FASTA file is the first residue of Mbp1 protein, but the last five residues are an artifical His tag. Is H125 of 1MB1 thus equivalent to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, therefore N is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the ATOM records; whereas the SEQRES records start with MET ... and so on. You need to remember: a sequence number is not absolute, but derived from a particular context.

The homology model will be based on an alignment of target and template. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for WhatIf, a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.

  • Navigate to the Administration sub-menu of the WhatIf Web server. Follow the link to Make sequence file from PDB file. Enter the PDB-ID of your template into the form field and Send the request to the server. The server accesses the PDB file and extracts sequence information directly from the ATOM   records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this implied sequence to check if and how it differs from the sequence ...
  • ... listed in the SEQRES records of the coordinate file;
  • ... given in the FASTA sequence for the template, which is provided by the PDB;
  • ... stored in the protein database of the NCBI.
and record your results.
  • In a table, establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.

(0.5 marks)

(*) These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the Sequence Viewer extension of VMD..
Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence..

   

(1.2) The input alignment (1 mark)

 

The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these only because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

In the case of Mbp1 genes however, all orthologues we have considered have no indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species.

Accordingly, all we need to do is to write the APSES domain sequences one under the other.

  • Copy the FASTA formatted sequence for the APSES domain of your organism's Mbp1 orthologue from the sequences defined in Assignment 3 and save it as FASTA formatted text file. This is your target sequence. Compare this with the FASTA formatted file you have extracted from the PDB coordinate set. This is your template sequence. Then generate a multi-FASTA formatted file that contains both sequences, and pad the sequence(s) where required with hyphens as gap characters, so that target and template sequences have exactly the same length and are aligned. Refer to the Fallback data if you are not sure about the format.

(1 mark)

 
 

(2) Homology model

   

(2.1) SwissModel (1 mark)

 

Access the Swissmodel server at http://swissmodel.expasy.org . Navigate to the Alignment Interface.

 

  • Paste your alignment for target and model into the form field. Refer to the Fallback Data file if you are not sure about the format. Make sure to select the correct option for the alignment input format on the form.
(You have to choose the correct format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. In the past we have seen problems with uploading alignments that have not been saved as "text only" and including periods i.e. "." in sequence names of CLUSTAL formatted alignments. Underscores appear to be safe.
  • Click submit alignment and on the returned page define your target and template sequence. For the template sequence define the PDB ID of the coordinate file. Enter the correct Chain-ID.
Recently the PDB has undergone a "remediation" process in which archived coordinate files were altered by the database to conform to new format standards. One of the changes was to assign a chain identifier of "A" to all chains that did not previously have a chain identifier. SwissModel uses a derivative of coordinate sets from the PDB (a dataset they call ExPDB). Apparently the PDB proper and ExPDB have now gone out of synchrony; when I entered the (correct, according to PDB) chain designation "A" for 1MB1, SwissModel rejected the alignment with a nondescript error message. When I entered an underscore "_" instead, which would be the designation for a chain without explicit chain identifier, such as the pre-remidation versio of the coordinates, the alignment was accepted and processed. I have e-mailed SwissModel about the problem; they are in the process of correcting it and may or may not be done while you are working on your assignments. If your template chain has the chain identifier "A" and your alignment gets rejected, try entering entering an underscore instead.
Enter the correct chain ID into the form-field even if you think it already appears there, don't simply accept the preloaded default. There is a bug in SwissModel's parser code that may cause incorrect strings to be sent to the server from that field. I have e-mailed SwissModel about the problem which may or may not be corrected while you are working on your assignments.
  • Click submit alignment and review the alignment on the returned page. Make sure it has been interpreted correctly by the server. The conserved residues have to be lined up and matching. Then click submit alignment again, to start the modeling process.
  • The resulting page returns information about the resulting model. Save the model coordinates on your computer. Read the information on what is being returned by the server (click on the red questionmark icon). Paste the Anolea profile into your assignment.
Do not paste a screenshot of the result, but copy and paste the image from the Web-page! You do not need to submit the actual coordinate files with your assignment.

(1 mark)

 
In case you do not wish to submit the modelling job yourself, or have insurmountable problems when using the SwissModel interface, you may access the result files from the Fallback Data file. Document the problems and note this in your assignment.


(3) Model analysis

   

(3.1) The PDB file (1 mark)

 

Open your model coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the Fallback Data file.)

 

  • What is the residue number of the first residue in the model? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the model correspond to that?

(1 mark)

   

(3.2) First visualization (1 mark)

 

In assignment 2 you have already studied a Mbp1 structure and compared it with your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the template, the model should look very similar to the original structure but contain the sequence of the target.

 

  • Save your model coordinates to your harddisk and visualize the structure in VMD. (Alternatively, copy and save the coordinates linked to the Fallback Data file to your harddisk.) Make an informative stereo view that shows the general orientation of the helix-turn-helix motif and the "wing", and paste it into your assignment.
  • Discuss briefly which parts of the model may be unreliable and color these (if any) distinctly in your submitted image.

(1 mark)

 
 

(4) The DNA ligand

   

(4.1) Finding a similar protein-DNA complex (1 mark)

 

One of the really interesting questions we can discuss with reference to our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for how DNA is bound to APSES domains.

Since there is currently no software available that would accurately model such a complex from first principles, we will base a model of a bound complex on homology modeling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a strcturally similar protein-DNA complex?

Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.

At the NCBI, VAST is provided as a search tool for structural similarity search.

At the EBI there are a number of very well designed structure analysis tools linked off the Structural Analysis page. As part of its MSD Services, MSDfold provides a convenient interface for structure searches.

However we have also read previously that the APSES domains are members of a much larger superfamily, the "winged helix" DNA binding domains , of which hundreds of structures have been solved.

 

Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.

 

APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of a protein-DNA complex. CATH does not provide information on complexes, but we can search the PDB with CATH codes in the following way:

  • Access CATH domain 1.10.10.10.
  • Navigate to the PDB home page and follow the link to Advanced Search
  • In the options menu for "Choose a Query Type" select Structure Features → CATH classification. A window will open that allows you to navigate down through the CATH tree. The interface is awkward because it does not display the actual CATH codes along with the class names, but you can view the class names on the CATH page linked above. Click on the triangle icons before "Mainly Alpha"→"Orthogonal Bundle"→"ARC repressor mutant, subunit A" then click on the link to "winged helix repressor DNA binding domain". As of this writing, this subquery matches 295 structures.
  • Click on the (+) button behind the subquery to add an additional query. Select the option "Structure Summary"→"Molecule / Chain type". In the option menus that pop up, select "Contains Protein → Yes", "Contains DNA → Yes""Contains RNA → Ignore". This selects files that contain Protein-DNA complexes.
  • Check the box below this subquery to "Remove Similar Sequences at 90% identity" and click on "Evaluate Query". As of this writing, seventy complexes were returned.
  • In the left-hand menu, under the Tabulate section, click on the "Collage" function to display icons of the structure files. This is a fast way to obtain an overview of the structures that have been returned. First of all you may notice that in fact not all of the structures are really different, despite selecting only to retrieve dissimilar sequences. This appears to be a deficiency of the algorithm. But you can also easily recognize how the recognition helix inserts into the major groove of most of the structures that were returned (at least those where the domain is not a very small part of a much larger complex). There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way. We shall use structural superposition of your homology model and two of the winged-helix proteins to decide which mode of DNA binding seems to be more plausible for Mbp1 homologues.

 

  • Follow the procedure outlined above, from a CATH entry page up to viewing a Collage (or alternatively a tabular view) of the retrieved coordinate files. You can be maximally concise documenting the procedure I have defined above, but do spend a bit of time to understand the key elements of the PDB's advanced search interface.

(1 mark)


(4.2) Preparation and superposition of a canonical complex (1 mark)

 

The structure we shall use as a reference for the canonical binding mode is the Elk-1 transcription factor.

Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.

The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, let's delete the second copy.

  • Access the PDB and navigate to the 1DUX structure explorer page. Download the coordinates to your computer.
  • Open the coordinate file in a text-editor and delete the coordinates for chains D,E and F; you may also delete all HETATM records and the MASTER record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
  • Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which
  • You could use the Extensions→Analysis→RMSD calculator interface to superimpose the two strutcures IF you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the multiseq extension window, select the check-boxes next to both protein structures, and open the Tools→Stamp Structural Alignment interface.
  • In the "'Stamp Alignment Options'" window, check the radio-button for Align the following ... Marked Structures and click on OK.
  • In the Graphical Representations window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
  • You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that the model's side-chain orientations have not been experimentally determined but inferred from the template, and that the template's strcture was determined in the absence of bound ligand.

 

  • Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best. Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.

(1 mark)

 
 


(4.2) Preparation and superposition of a non-canonical complex (1 mark)

 

The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.

Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coressponds to the recogition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).

The 1DP7 coordinate-file contains only one protein domain and only one B-DNA monomer in its asymmetric unit. This is a file for which we have to generate biological unit coordinates! Then, for simplicity we will delete the second protein monomer. As you know, there are at least two systems that make the so-called biological units available: the PDB itself, through the Biological Unit file that is accessible via the "Download Files" section on any Structure Explorer page, and the EBI through the PQS service. How the biological units are stored is subtly different for both cases and for our purpose this small difference is important. The PDB generates additional chins as copies of the original and delineates them with MODEL, ENDMDL records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The EBI's PQS service creates copies that have distinct atomnumbers and chain IDs. The difference is that the PDB file thus contains the same molecule in two different orientations, wheras the PQS file contains two independent molecules. This is an important difference when it comes to selecting residues, visualizing and superimposing structures. For VMD, the PQS way of doing things is the right way to go, since by default only the first MODEL will be displayed if several are available.

  • Access the EBI PQS server, enter 1DP7 into the PDBidcode form field and click on Submit.
  • On the results page, click on the link under 1dp7_0, which is the unique suggestion for a biological unit that the server has identified.
  • On the PQS OUTPUT page that is retrieved, click on the 1dp7.mmol link, this will load the PDB formatted coordinate file.
  • Save the coordinates as 1DP7_complex.pdb (or some other name that makes sense to you), open it in a text editor, delete the HETATM records from the end and the entire chain "B". Also make sure not to delete any of the TER records for chains "D", "P" or "A". Save the file.
  • In the multiseq window, choose File→Import Data, Browse... to your 1DP7_complex file, select it and click on Open. Click OK to load the file.
  • Mark all three protein chains by selecting the checkbox next to thier name and again run the STAMP structural alignment.
  • In the graphical representations window, double-click again on all cartoon representations that multiseq has generated to undisplay them, undisplay also the Tube representation of 1DUX, create a Tube representatrion for 1DP7, and select a Color by ColorID (a differnet color you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.

 

  • Orient and scale your superimposed structures so that their structural similarity is apparent, the orientation is similar to the scene generated above and the 1DP7 "wing" can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best. Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.

(1 mark)

 

(4.3) Interpretation (2 marks)

 

In your previous assignment, you have commented on conservation patterns in Mbp1 orthologues. You can refer back to your last results (easier to do), or you can import the APSES domain alignment for Mbp1 proteins and again color by conservation (easier to study) to briefly discuss the following question.

 

  • Considering the conservation patterns for Mbp1 orthologues, and assuming that all these orthologues bind DNA in a similar way, which model appears to be more plausible for protein-DNA interactions in APSES domains? Is it the canonical, or the non-canonical binding mode? Discuss briefly what you would expect to find and how this relates to your observations. Distinguish clearly between experimental evidence, computational inference and empirical hypothesis. You are of course welcome to paste detail views (stereo !) of particular sidechains, or surfaces etc. if this helps your arguments. Sometimes a picture is worth many words. But this is not a requirement, we are more interested in evidence-based reasoning than in the form of the presentation.

(2 marks)

 
 

(5) Summary of Resources

 

Links and background reading
Fallback Data page
Alignments

   

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the 2011 Course Mailing List .