Difference between revisions of "BIO Assignment Week 7"

From "A B C"
Jump to navigation Jump to search
m
m
Line 44: Line 44:
  
 
 
 
 
 +
==Warm-up: a minimal change==
 +
Minimal changes to structure models can be done directly in Chimera. This illustrates the principle of full-scale modeling quite nicely. For an example, let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented twards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, <code>V</code>, or even <code>I</code>.
 +
 +
{{task|1=
 +
# Open <code>1BM8</code> in Chimera, hide the ribbons and show all atoms as a stick model.
 +
# Color the protein white.
 +
# Open the sequence window and select <code>A&nbsp;42</code>. Color it red. Choose '''Actions&nbsp;&rarr;&nbsp;Set pivot'''. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
 +
# To emphasize this better, hide the solvent molecules and select only the protein atoms. Display them as a '''sphere''' model to better appreciate the packing, i.e. the Van der Waals contacts we discussed in class. Use the '''Favorites&nbsp;&rarr;&nbsp;Side view''' panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
 +
# Lets simplify the view: choose '''Actions &rarr; Atoms/Bonds &rarr; backbone&nbsp;only &rarr; chain&nbsp;trace'''. Then select <code>A&nbsp;42</code> again in the sequence window and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
 +
# Add the surrounding residues: choose '''Select &rarr; Zone...'''. In the window, see that the box is checked that selects all atoms at a distance of less then 5&Aring; to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click '''OK''' and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
 +
#Select <code>A&nbsp;42</code> again: '''left-click''' (control click) on any atom of the alanine to select the atom, then '''up-arrow''' to select the entire residue. Now let's mutate this residue to isoleucine.
 +
#Choose '''Tools &rarr; Structure&nbsp;Editing &rarr; Rotamers''' and select <code>ILE</code> as the rotamer type. Click '''OK''', a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are '''very''' different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D. Btw: I find such "quantitative" work - where the real distances are important - easier in '''orthographic''' than in '''perspective''' view (cf. the '''Camera''' panel).
 +
#I find that the first rotamer is actually not such a bad fit. The <code>CD</code> atom comes close to the sidechains of <code>I&nbsp;25</code> and <code>L&nbsp;96</code>. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your Jalview alignment - it is '''NOT''' the case that sequences that have <code>I&nbsp;42</code>, have a smaller residue in position <code>25</code> and/or <code>96</code>. So let's accept the most frequent <code>ILE</code> rotamer by selecting it in the rotamer window and clicking '''OK''' (while '''existing side chain(s): replace''' is selected).
 +
#Done.
 +
}}
 +
 +
If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group [http://www.youtube.com/watch?v=bcXMexN6hjY '''here''']. I would also encourage you to go over [http://www.youtube.com/watch?v=eJkrvr-xeXY '''Part 2 of the video tutorial'''] that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.
 +
 +
What we have done here with one residue is exactly the way homology modeling works with entire sequences. Let's now build a homology model for YFO Mbp1.
 +
 
==Preparation==
 
==Preparation==
  
 
===Target sequence===
 
===Target sequence===
The first step of homology modelling is to determine which sequence to model. We have determined the putative orthologue with conserved function in YFO by reciprocal best match with ''saccharomyces cervisiae'' Mbp1. Your sequence was initially found with an APSES domain search in YFO and the alignments with the yeast sequence are straightforward for the most part. There is one exception however: the alignment of '''ASPFU''' gene XP_754232 starts 22 amino acids downstream of the yeast sequence and that is odd because this implies that the gene is is missing part of the domain's N-terminus. In such cases you '''must''' consider whether there could be a database error, most likely based on an erroneous gene model. This is not absolutely germane to this assignment, so I have placed the process into the collapsible section below. However do make an effort to understand what the issue is here and how to address it.
+
The first step of homology modelling is to determine which sequence to model. We have determined the putative orthologue with conserved function in YFO by reciprocal best match with ''saccharomyces cervisiae'' Mbp1. Your sequence was initially found with an APSES domain search in YFO and the alignments with the yeast sequence are straightforward for the most part.  
 +
 
 +
There are two exceptions however: the alignment of '''ASPFU''' gene XP_754232 and the '''CAPCO''' gene XP_007722875 both are missing part of the domin's N-terminus. This is odd, because this may imply the APSES domain of these genes might not be properly folded. When such surprising results of alignement occurr,  you '''must''' consider whether there could be an error in the published sequence, perhaps stemming from an erroneous gene model. This is not absolutely germane to this assignment, so I have placed the process into the collapsible section below - optional reading. However it may be useful for you to understand what the issue is here and how to address it.
  
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand to read about gene model correction" data-collapsetext="Collapse">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand to read about gene model correction" data-collapsetext="Collapse">
Line 54: Line 76:
  
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
An alignment of APSES domain sequence shows the shortened N-terminus of the ASPFU protein, relative to SACCE and e.g. the closely related ''aspegillus nidulans'', ASPNI:
+
An alignment of APSES domain sequence shows the shortened N-terminus of the ASPFU and the CAPCOprotein, relative to SACCE and e.g. the closely related ''aspergillus nidulans'', ASPNI:
 
  APSES domains:
 
  APSES domains:
 
  Mbp1_SACCE  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAA...
 
  Mbp1_SACCE  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAA...
 
  Mbp1_ASPNI  NVYSATYSSVPVYEFKIGTDSVMRRRSDDWINATHILKVA...
 
  Mbp1_ASPNI  NVYSATYSSVPVYEFKIGTDSVMRRRSDDWINATHILKVA...
 
  Mbp1_ASPFU  ----------------------MRRRGDDWINATHILKVA...
 
  Mbp1_ASPFU  ----------------------MRRRGDDWINATHILKVA...
 +
Mbp1_CAPCO  ----------------------MRRRSDDWVNATHILKVA...
 +
 +
We analyse this for the ASPFU gene.
  
 
Working from the possibility that this may be a gene model error - e.g. a false translational start, a frameshift due to a sequencing error, or an erroneously modelled intron, we check whether the translation of the genomic sequence supports the presence of the expected amino acids. This is easily done running TBLASTN - BLASTing the protein query against the six reading frames of the ASPFU genome. We find the following:
 
Working from the possibility that this may be a gene model error - e.g. a false translational start, a frameshift due to a sequencing error, or an erroneously modelled intron, we check whether the translation of the genomic sequence supports the presence of the expected amino acids. This is easily done running TBLASTN - BLASTing the protein query against the six reading frames of the ASPFU genome. We find the following:
Line 78: Line 103:
 
:We can easily adapt this to the sequence range we need ...
 
:We can easily adapt this to the sequence range we need ...
 
<ol start="4">
 
<ol start="4">
<li>... and follow: http://www.ncbi.nlm.nih.gov/projects/mapview/seq_reg.cgi?taxid=746128&chr=3&from=3691100&to=3691372 to yield:
+
<li>... and follow: http://www.ncbi.nlm.nih.gov/nuccore/NC_007196.1?from=3691003&to=3691243&report=fasta to yield:
 
</ol>
 
</ol>
  >gi|71025130:3691100-3691372 Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
+
  >gi|71025130:3691003-3691243 Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
  GTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCATCCATTTTGCCCCTTCCTTCGCCGCGAA
+
  ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
  GCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAAAGTCGATGGCGAAAGTGTTATGCGCCG
+
  GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
  ACGAGGCGATGATTGGATCAATGCTACACATATTCTTAAAGTAGCTGGTTTTGACAAGCCAGCGAGAACC
+
  CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
  CGGATTCTCGAGCGTGAAGTCCAGAAAGGGACTCATGAGAAGGTTCAGGGTGGTTATGGCAAA
+
  AGTCGATGGCGAAAGTGTTATGCGCCGACGA
 +
 
  
 
<ol start="5">
 
<ol start="5">
<li>To translate this, we navigate to any of the [http://bips.u-strasbg.fr/EMBOSS/ '''EMBOSS''' tools servers] and use "remap" - we want to see the translation matched to the nucleotide sequence. We turn restriction sites off, translate all three forward frames and paste and manually align the SACCE Mbp1 sequence into the output to see what we expect and what we got:
+
<li>To translate this, we navigate to any of the [http://bips.u-strasbg.fr/EMBOSS/ '''EMBOSS''' tools servers] and use "remap" - we want to see the translation matched to the nucleotide sequence. We turn restriction sites off, translate all three forward frames and paste and manually align the SACCE Mbp1 sequence into the output to see what we expect and what we got. I have selected only the frame(s) that actually give a match, and I have pasted the homologous CAPCO and SACCE sequences (lower case) to demonstrate their similarity:
 
</ol>
 
</ol>
          TCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAAAGTCGAT
+
ASPFU    ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
                    70        80        90        100      110      120     
+
                                                                       
          ----:----|----:----|----:----|----:----|----:----|----:----|
+
  ASPFU      F A E T G I M A A D F S K  I  Y  S  A  T Y  S  S  
          S  P S Q S N A V Q * V  P Y E F K V  D 
+
  CAPCO                          m - a f d - k e i y s a t y s n  
SACCE          Q I  Y  S  A  R Y  S  G  V  D  V  Y  E  F  I  H S
+
  SACCE                          m s - - - - n q i y s a r y s g
            R R E A N L T Q F N R F Q F T  S  S K S M
 
            A A K P I * R S S I G S S  L  R  V Q S R W
 
 
   
 
   
         
+
         
          GGCGAAAGTGTTATGCGCCGACGAGGCGATGATTGGATCAATGCTACACATATTCTTAAA
+
ASPFU    GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
                    130      140      150      160      170      180     
+
 
          ----:----|----:----|----:----|----:----|----:----|----:----|
+
  ASPFU    V L  F * ...
          G E V M R R R G D D W I N A T H I L K ...  
+
  CAPCO    v a - -    ...
  SACCE     T G  S  I K K K D  D  W  V N  A  T  H  I  L  K ...
+
  SACCE    v d - -    ...
            A K V L C A D E A M I G S M L H I F L K
+
         
            R K C Y A P T R R * L D Q C Y T Y S * S
+
  ASPFU    CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
 +
                                                              ... V Y E F K  
 +
CAPCO                                                        ... v  y  e  l  k
 +
  SACCE                                                       ...  v  y  e  f  i
 +
         
 +
ASPFU      AGTCGATGGCGAAAGTGTTATGCGCCGACGAGGCGATGATTGGATCAATGCTACACATATTCTTAAA
 +
 +
ASPFU      V  D G E V R R G D  D  W  I N  A  T  H  I  L  K ...
 +
  CAPCO      v  a  g  d h i m r r r s d d w v n a t h i l k ...
 +
  SACCE      h  s  t  g s i m k r k k d d w v n a t h i l k ...
  
:This clearly shows us that there is N-terminal sequence that ought to be added to the gene model, upstream of the reported translational start of <tt>MRRR...</tt>. The sequence should begin:  
+
 
<tt>VPVYEFKVDGESVMRRR...</tt>. Is that all? Where is the actual N-terminus? There is no Initiation M in sight and this suggests an assembly error. Unfortunately, the search gets messy from that point on. It is not trivial to retrieve arbitrary nucleotide sequence anchored on BLAST hits ... and I have abandoned this for now. The take home lesson here is: if your retrieved protein sequence does not conform to your expectations, it may be worthwhile to follow up with the actual nucleotide sequence.
+
:This clearly shows us that there is N-terminal sequence that ought to be added to the gene model, upstream of the reported translational start of <tt>MRRR...</tt>. The sequences thus most likely begin as follows:  
 +
 
 +
ASPFU  MAAVDFSKIYSATYSSVSLFVYEFKVDGE-----SVMRRRGDDWINATHILK...
 +
CAPCO  ma-fd-keiysatysnva--vyelkvagd-----himrrrsddwvnathilk...
 +
SACCE  ms----nqiysarysgvd--ysgvdvyefihstgsimkrkkddwvnathilk...
 +
 
 +
The fact that the truncated N-terminus appears in both closely '''related''' genes and species suggests that what we see here is a mis-annotated intron. The take-home lesson is: if your retrieved protein sequence does not conform to your expectations, it may be worthwhile to follow up with the actual nucleotide sequence.
  
 
</div>
 
</div>
Line 118: Line 157:
  
  
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest is probably the '''Automated Mode''' that requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I disagree however that that is the best way to use such a service: the reason is that template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. Therefore we will use the '''Alignment Mode''' of Swiss-Model in this assignment, choose our own template and upload our own alignment. But please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.
+
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.
  
 
Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lectures; please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I have reviewed the principles and discussed more details and alternatives. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modeling is sequence similarity.
 
Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lectures; please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I have reviewed the principles and discussed more details and alternatives. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modeling is sequence similarity.
Line 125: Line 164:
  
 
Defining a '''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target''' that you can build a model based on that '''template'''. In  [[BIO_Assignment_Week_2#Structure_search|Assignment 2]] you have used a keyword search at the PDB to find "Mbp1" structures - but some of these structures were not homologs: keyword searches are notoriously unreliable. To find suitable PDB structures, we will perform a BLAST search at the PDB instead.
 
Defining a '''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target''' that you can build a model based on that '''template'''. In  [[BIO_Assignment_Week_2#Structure_search|Assignment 2]] you have used a keyword search at the PDB to find "Mbp1" structures - but some of these structures were not homologs: keyword searches are notoriously unreliable. To find suitable PDB structures, we will perform a BLAST search at the PDB instead.
 +
 +
 +
<!-- NOTE TO SELF: use the following sequence to test the procedure
 +
>Mbp1_SCHPO/2-100 NP_593032
 +
AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQG
 +
TWVPFQRGVDLATKYKVDGIMSPILSL
 +
>1BM8_A
 +
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQG
 +
TWVPLNIAKQLAEKFSVYDQLKPLFDF
 +
-->
 +
 +
  
  
 
{{task|1=
 
{{task|1=
# Retrieve your YFO Mbp1-like APSES domain sequence. You can find the domain boundaries for the yeast protein in the [[Mbp1_protein_reference_annotation|Mbp1 annotation reference page]], and you can get the aligned sequence from your MSA, or simply recompute it with the <code>needle</code> program of the EMBOSS suite. This YFO sequence is your '''target''' sequence.
+
# Retrieve your YFO Mbp1-like APSES domain sequence. You can find the domain boundaries for the yeast protein in the [[Mbp1_protein_reference_annotation|Mbp1 annotation reference page]], and you can get the aligned sequence from your Jalview alignment, or simply recompute it with the <code>needle</code> program of the EMBOSS suite. This YFO sequence is your '''target''' sequence.
 
# Navigate to the [http://www.pdb.org/pdb/home/home.do PDB].
 
# Navigate to the [http://www.pdb.org/pdb/home/home.do PDB].
 
# Click on '''Advanced''' to enter the advanced search interface.
 
# Click on '''Advanced''' to enter the advanced search interface.
Line 162: Line 213:
 
Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. Neither of the structures has a bound DNA ligand, but the experimental methods and structure quality are different. Two of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice: 1BM8. In case you don't agree, please let me know.
 
Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. Neither of the structures has a bound DNA ligand, but the experimental methods and structure quality are different. Two of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice: 1BM8. In case you don't agree, please let me know.
  
;Finally: navigate to the structure page for your '''template''' and save the FASTA file to your computer.
+
;Finally: Click on the 1BM8 ID to navigate to the structure page for the '''template''' and save the FASTA sequence to your computer. This is '''the template sequence'''.
  
 
}}
 
}}
Line 178: Line 229:
 
It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file <small>(one of the related PDB structures)</small> '''is''' the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with <code>MSNQIY...</code>, but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with  ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.  
 
It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file <small>(one of the related PDB structures)</small> '''is''' the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with <code>MSNQIY...</code>, but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with  ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.  
  
Fortunately, the numbering for the residues in the coordinate section of our '''target''' structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence <small>(e.g. by using the bio3D R package)</small>.
+
Fortunately, the numbering for the residues in the coordinate section of our '''target''' structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence <small>(e.g. by using the bio3D R package)</small>. If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.
  
 
<!--
 
<!--
Line 211: Line 262:
  
 
&nbsp;
 
&nbsp;
The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these only because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
+
The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
  
 
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
 
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
  
In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions - (and for the ones in which we do see indels, we might suspect that these are actually gene-model errors). Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the '''template sequence''' and the '''target sequence''' from your species, proceed as follows.  
+
In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the '''template sequence''' and the '''target sequence''' from your species, proceed as follows.  
  
  
Line 238: Line 289:
  
 
;Using the EMBOSS explorer
 
;Using the EMBOSS explorer
* Use the <code>needle</code> tool for the alignment  ... but remember that pairwise alignments will only be suitable in casethe alignment is absolutely unambiguous (such as here) . If there are any indels, an MSA will give much more reliable information.
+
* Use the <code>needle</code> tool for the alignment  ... but remember that pairwise alignments will only be suitable in case the alignment is absolutely unambiguous (such as here) . If there are any indels, an MSA will give much more reliable information.
  
  
Line 273: Line 324:
 
&nbsp;<br>
 
&nbsp;<br>
  
Access the Swissmodel server at '''http://swissmodel.expasy.org''' . Navigate to the '''Alignment Mode''' page.
+
Access the Swissmodel server at '''http://swissmodel.expasy.org''' and click on '''Start Modelling'''. Then, under the '''Supported Inputs''', click on '''Target-Template Alignment'''.
  
 
{{task|1=
 
{{task|1=
*Paste your alignment for target and model into the form field. Refer to the [[Homology_modeling_fallback_data|'''Fallback Data file''']] if you are not sure about the format. Make sure to select the correct option (FASTA) for the alignment input format on the form.
+
*Paste your alignment for target and model into the form field. Click on the question mark next to "Supported Inputs" if you are not sure about the format. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
 
 
* Click '''submit alignment ''' and on the returned page define your '''target''' and '''template''' sequence. For the '''template sequence''' define the PDB ID of the coordinate file it came from. Enter the correct Chain-ID <small>(usually "A", note: upper-case)</small>.
 
:<small>If you run into problems, compare your input to the fallback data. It has worked for me, it will work for you. In particular we have seen problems that arise from "special" characters in the FASTA header like the pipe "<code>{{!}}</code>" character that the NCBI uses to separate IDs - keep the header short and remove all non-alphanumeric characters to be safe.</small>
 
 
 
*Click '''submit alignment''' and review the alignment on the returned page. Make sure it has been interpreted correctly by the server. '''The conserved residues have to be lined up and matching'''. Then click '''submit alignment''' again, to start the modeling process.
 
  
* The resulting page returns information about the resulting model. Save the '''model coordinates''' on your computer. Read the information on what is being returned by the server (click on the red questionmark icon). Study the quality measures.
+
* Click '''Validate Target Template Alignment''' and check that the returned alignment is correct.
}}
 
  
 +
*Click '''Build Model''' to start the modeling process.
  
The server should complete your model within a few minutes and alert you by e-mail. You will also find the results in the Webpage you started the model from.  
+
* The resulting page returns information about the resulting model. Mouse over the '''Model 01''', open the '''PDB file''' and save the coordinates to your computer. Read the information on what is being returned by the server (click on the question mark icon). Study the quality measures.
  
 +
* Also save:
  
{{task|1=
+
** The output page as pdf (for reference)
# Click on '''download model: as pdb'''.
+
** The modeling report (as pdf)
# Also save:
 
## The output page as pdf (for reference)
 
## The "Energy profile".
 
 
}}
 
}}
  
Line 319: Line 363:
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://users.mccammon.ucsd.edu/~bgrant/bio3d/index.html '''bio3D'''] home page and follow the link to the download section.
+
# Navigate to the [http://thegrantlab.org/bio3d/index.php '''bio3D'''] home page. '''bio3d''' is not available for installation via CRAN, but needs to be installed from source. Instructions for the different platforms are here http://thegrantlab.org/bio3d/tutorials/installing-bio3d Follow the instructions and install '''bio3d''' for '''R''' on your platform.
# Follow the instructions to install '''bio3D''' for '''R''' on your platform.
+
 
 
# Explore and execute the following '''R''' script. I am assuming that your model is in your working directory, change paths and filenames as required.
 
# Explore and execute the following '''R''' script. I am assuming that your model is in your working directory, change paths and filenames as required.
  
 
<source lang="rsplus">
 
<source lang="rsplus">
 
# renumberPDB.R
 
# renumberPDB.R
# This is an exceedingly simple renumbering script that uses the
+
 
# bio3D package. We simply set the first residue number to what it
+
# This is a simple renumbering script that uses the bio3D
 +
# package. We simply set the first residue number to what it
 
# should be and renumber all residues based on the first one.
 
# should be and renumber all residues based on the first one.
 
# The script assumes your input PDBfile is in your working
 
# The script assumes your input PDBfile is in your working
 
# directory.
 
# directory.
 +
 +
# To run this, you must have installed the bio3D R package; instructions
 +
# are here: http://thegrantlab.org/bio3d/tutorials/installing-bio3d
  
 
setwd("~/my/working/directory")
 
setwd("~/my/working/directory")
 
PDBin      <- "YFO_model.pdb"
 
PDBin      <- "YFO_model.pdb"
 
PDBout    <- "YFO_model_ren.pdb"
 
PDBout    <- "YFO_model_ren.pdb"
 +
 
first <- 4  # residue number that the first residue should have
 
first <- 4  # residue number that the first residue should have
 
   
 
   
Line 366: Line 415:
 
write.pdb(pdb=pdb,file=PDBout)
 
write.pdb(pdb=pdb,file=PDBout)
  
# Done.
+
# Done. Open the PDB file you have written in a text editor and confirm
 +
# that this has worked.
 +
 
 
</source>
 
</source>
 
}}
 
}}
Line 377: Line 428:
 
&nbsp;<br>
 
&nbsp;<br>
  
Previously you have already studied a Mbp1 structure and compared it with your organism's Mbp1 APSES domain. Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
+
Since a homology model inherits its structural details from the '''template''', your model of the YFO sequence should look very similar to the original 1BM8 structure.
  
 
{{task|1=
 
{{task|1=
# Load the '''model''' coordinates you have saved on your computer in VMD.
+
# Start Chimera and load the '''model''' coordinates that you have just renumbered.
# From the PDB, also load the '''template''' structure. (Use File &rarr; New Molecule ...)
+
# From the PDB, also load the '''template''' structure. (Use File &rarr; Fetch by ID ...)
# In the '''Graphics''' &rarr; '''Representations''' window you can switch between the two molecules by clicking on the '''Selected Molecule'''.
+
# In the '''Favourites''' &rarr; '''Model Panel''' window you can switch between the two molecules.
# Choose '''Trace''' as the '''Drawing Method''' and give the two chains distinct colors
+
# Hide the ribbon and choose '''backbone only &rarr; full'''. You will note that the backbone of the two structures is virtually identical.
# The two molecules should already be aligned quite well, to be sure go, back to the VMD main window, choose '''Extensions''' &rarr; '''Analysis''' &rarr; '''RMSD calculator''' and align the two chains.
+
# Next, choose '''Actions &rarr; Atoms/Bonds &rarr; show''' to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: '''Select &rarr; Chemistry &rarr; Element &rarr; H''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''
# Note the backbone coordinate differences, if any.
+
# Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. Choose '''Favourites &rarr; Sequence''', select the residues for one model, then '''Select &rarr; Invert (selected model)''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''.
# Next, display the two molecules in a line or licorice style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target.
+
# Study the result. A model of the HTH domain of YFO Mbp1.
# Display only the selections <code>residue 50 to 74</code> respectively <code>residue 50 to 74 and not element H</code> to confirm that the numbering targets the right residues.
 
 
}}
 
}}
  
Line 393: Line 443:
 
&nbsp;<br>
 
&nbsp;<br>
  
 +
==Coloring the model by energy ==
  
==R code: coloring the model by energy ==
+
SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB files B-factor field.
  
Swiss model calculates energies for each residue of the model with a molecular mechanics forcefield. The result summary page contains an image what these energies look like. You have downloaded the Energy profile scores, but it will be useful to be able to map these scores to the actual model.
 
 
 
The general strategy we can use here is to use the '''B factor''' field in the PDB file. As discussed in class, B factors characterize the mobility or disorder of an atom in a crystal structure and VMD allows you to color structures according to their B factors. All we need to do is to get the
 
  
 
{{task|1=
 
{{task|1=
# Back in VMD, undisplay the structure of your model, by selecting the model in the '''Graphical Representations''' window and double-clicking on the representation (in the window where its style, color and selection are listed.) The representation description changes from black to pink.
+
# Back in Chimera, use the model panel to '''close''' the 1BM8 structure.
# Then select the template structure and draw it in licorice style.
+
# Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''OK'''.
# Choose '''Beta''' as the coloring method.
+
# Study the result: It seems that residues in the core of the protein have better energies than residues at the surface. Why could that be the case?
# In the VMD main window, choose '''Graphics''' &rarr; '''Colors''' and select the '''Color Scale''' tab.
 
# Choose '''BWR''' (blue - white - red) as the color scale. This color low B-values, immobile residues a cool blue, mobile residues (high B-values) a warm red. Note how there is a tendency for immobile residues in the core, higher B-values on the surface.
 
 
}}
 
}}
  
 
+
Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. Simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. The rewnder this property to map it on the 3D structure of your molecule. If you want to experience with this a bit, you could apply the information scores from the previous assignment to your model, using a script that is easy to derive from the renumbering R-script you have studied above.
If you examine the model PDB file, you will notice that there are only two B-values used: 99 for "completely made up" coordinates, 50 for all others. Let us thus load the energy value file in R and put these values in the correct PDB field. You have downloaded the Energy profile from SwissModel, right? Formally, the script is a bit similar to the one above, but '''pay special attention''' to the way we use conditional expressions to select exatly the rows and columns we want.
 
 
 
<source lang="rsplus">
 
 
 
# energy2Bvalue.R
 
# This is an exceedingly simple renumbering script that uses the
 
# bio3D package. We simply set the first residue number to what it
 
# should be and renumber all residues based on the first one.
 
# The script assumes your input PDBfile is in your working
 
# directory.
 
 
 
setwd("~/Documents/00.0.DEV/35-BCH441\ Bioinformatics\ 2012")
 
PDBin    <- "SCHPO_model_ren.pdb"
 
PDBout    <- "SCHPO_model_energyB.pdb"
 
Eprof    <- "Local_energy_profile.csv"
 
 
# ================================================
 
#    Read coordinate file
 
# ================================================
 
 
# read PDB file using bio3D function read.pdb()
 
library(bio3d)
 
pdb  <- read.pdb(PDBin) # read the PDB file into a list
 
 
 
# ================================================
 
#    Read energy file
 
# ================================================
 
 
en <- read.csv(Eprof, header=TRUE, sep=" ") # read file
 
 
 
en  # examine contents
 
scores <- unlist(en[,"QMEANlocalScore"])
 
 
 
# normalize "scores" to lie between 0 and 80
 
scores <- 80 * ((scores - min(scores))/(max(scores)-min(scores)))
 
 
 
 
 
# ================================================
 
#    replace B-values with energies
 
# ================================================
 
 
 
 
### get the correct sequence of residue numbers
 
 
 
numbers <- pdb$atom[pdb$atom[,"elety"] == "CA","resno"]
 
 
 
# This might warrant some explanation:
 
# pdb$atom[,"elety"] == "CA" is a logical expression: TRUE for
 
# all rows which are CA atoms.
 
# If this expression appears in the "rows" position of  
 
# pdb$atom[rows, columns], only those rows will be selected for
 
# which the expression is TRUE. From these rows, I collect only
 
# the column with the name "resno". This gives me the residue
 
# numbers, in sequence, assuming every residue has a C-alpha
 
# (CA) atom. Therefore every *index* of numbers[,] corresponds to
 
# the same index of the vector scores[,] , which holds the scores,
 
# sequentially; every *value* of numbers[,] corresponds to a "resno"
 
# in pdb$atom[,].
 
 
 
for (i in 1:length(scores)) { # for all values in the scores vector
 
residue <- numbers[i]    # define which residue this is
 
pdb$atom[pdb$atom[,"resno"] == residue,"b"] <- scores[i] # update "b" for
 
                                                        # all atoms in
 
                                                        # this residue
 
}
 
 
 
 
 
# ================================================
 
#    Write output to file
 
# ================================================
 
 
 
write.pdb(pdb=pdb,file=PDBout)
 
 
 
# Done.
 
 
 
</source>
 
 
 
# Load the new coordinate file, color by B-values ("'''Beta'''") and ensure the color scale is "BWR.
 
# Think carefully about which residues in general are considered reliable (low energy, blue) and which ones are less reliable (pink and red, high B-values. I think one of the quiz questions will probably be about that.
 
  
  
  
 
;That is all.
 
;That is all.
 
  
 
==Links and Resources==
 
==Links and Resources==
Line 499: Line 464:
 
&nbsp;<br>
 
&nbsp;<br>
  
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
+
{{#pmid: 24782522}}
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains]
+
{{#pmid: 22323219}}
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/2000_Gajiwala_WingedHelixDomains.pdf '''Review (PDF, restricted)''' Gajiwala &amp; Burley, winged-Helix domains]
+
{{#pmid: 12117790}}
 +
 
 +
 
 
:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
 
:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
 
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
 
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
  
 
;Data
 
 
:* [[Homology_modeling_fallback_data|'''Fallback Data page''']] <small> - Refer to this page in case your own efforts fail, or you have insurmountable problems with your input files.</small>
 
  
  

Revision as of 18:34, 18 November 2014

Assignment for Week 8
Homology Modeling

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
Max Perutz (on his first glimpse of the Hemoglobin structure)

   

Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the Vendian period of the Proterozoic era of Precambrian times.

In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 (Taylor et al., 2000) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.

In this and the following assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 orthologue in your assigned species, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and (4) consider whether the available evidence allows you to distinguish between different modes of ligand binding.

For the following, please remember the following terminology:

Target
The protein that you are planning to model.
Template
The protein whose structure you are using as a guide to build the model.
Model
The structure that results from the modeling process. It has the Target sequence and is similar to the Template structure.

 

A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.


 

Warm-up: a minimal change

Minimal changes to structure models can be done directly in Chimera. This illustrates the principle of full-scale modeling quite nicely. For an example, let us consider the residue A 42 of the 1BM8 structure. It is oriented twards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, V, or even I.

Task:

  1. Open 1BM8 in Chimera, hide the ribbons and show all atoms as a stick model.
  2. Color the protein white.
  3. Open the sequence window and select A 42. Color it red. Choose Actions → Set pivot. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
  4. To emphasize this better, hide the solvent molecules and select only the protein atoms. Display them as a sphere model to better appreciate the packing, i.e. the Van der Waals contacts we discussed in class. Use the Favorites → Side view panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
  5. Lets simplify the view: choose Actions → Atoms/Bonds → backbone only → chain trace. Then select A 42 again in the sequence window and choose Actions → Atoms/Bonds → show.
  6. Add the surrounding residues: choose Select → Zone.... In the window, see that the box is checked that selects all atoms at a distance of less then 5Å to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click OK and choose Actions → Atoms/Bonds → show.
  7. Select A 42 again: left-click (control click) on any atom of the alanine to select the atom, then up-arrow to select the entire residue. Now let's mutate this residue to isoleucine.
  8. Choose Tools → Structure Editing → Rotamers and select ILE as the rotamer type. Click OK, a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are very different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D. Btw: I find such "quantitative" work - where the real distances are important - easier in orthographic than in perspective view (cf. the Camera panel).
  9. I find that the first rotamer is actually not such a bad fit. The CD atom comes close to the sidechains of I 25 and L 96. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your Jalview alignment - it is NOT the case that sequences that have I 42, have a smaller residue in position 25 and/or 96. So let's accept the most frequent ILE rotamer by selecting it in the rotamer window and clicking OK (while existing side chain(s): replace is selected).
  10. Done.

If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group here. I would also encourage you to go over Part 2 of the video tutorial that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.

What we have done here with one residue is exactly the way homology modeling works with entire sequences. Let's now build a homology model for YFO Mbp1.

Preparation

Target sequence

The first step of homology modelling is to determine which sequence to model. We have determined the putative orthologue with conserved function in YFO by reciprocal best match with saccharomyces cervisiae Mbp1. Your sequence was initially found with an APSES domain search in YFO and the alignments with the yeast sequence are straightforward for the most part.

There are two exceptions however: the alignment of ASPFU gene XP_754232 and the CAPCO gene XP_007722875 both are missing part of the domin's N-terminus. This is odd, because this may imply the APSES domain of these genes might not be properly folded. When such surprising results of alignement occurr, you must consider whether there could be an error in the published sequence, perhaps stemming from an erroneous gene model. This is not absolutely germane to this assignment, so I have placed the process into the collapsible section below - optional reading. However it may be useful for you to understand what the issue is here and how to address it.

Correcting the ASPFU Mbp1 gene model.


An alignment of APSES domain sequence shows the shortened N-terminus of the ASPFU and the CAPCOprotein, relative to SACCE and e.g. the closely related aspergillus nidulans, ASPNI:

APSES domains:
Mbp1_SACCE  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAA...
Mbp1_ASPNI  NVYSATYSSVPVYEFKIGTDSVMRRRSDDWINATHILKVA...
Mbp1_ASPFU  ----------------------MRRRGDDWINATHILKVA...
Mbp1_CAPCO  ----------------------MRRRSDDWVNATHILKVA...

We analyse this for the ASPFU gene.

Working from the possibility that this may be a gene model error - e.g. a false translational start, a frameshift due to a sequencing error, or an erroneously modelled intron, we check whether the translation of the genomic sequence supports the presence of the expected amino acids. This is easily done running TBLASTN - BLASTing the protein query against the six reading frames of the ASPFU genome. We find the following:


Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
Sequence ID: ref|NC_007196.1|Length: 4079167Number of Matches: 2
[...]
Query  10       VDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILE ...
                V VYEF     S+M+R+ DDW+NATHILK A F K  RTRILE ...
Sbjct  3691193  VPVYEFKVDGESVMRRRGDDWINATHILKVAGFDKPARTRILE ...

Indeed, there is sequence upstream of the gene's published translation start that matches well with our query! But where is the correct translation start? For that we need to look at the actual nucleotide sequence and translate it. Remember: BLAST is a local sequence alignment algorithm and it won't retrieve everything that matches to our query, just the best matching segment. ASPFU chromosome 3 is over 4 megabases large, so let us try to obtain only the region we are actually interested in: downstream of bases 3691193, lets say 3691100 (make sure this offset is divisible by three, to stay in the same reading frame) and upstream to, say, 3691372.

  1. At the NCBI genome project site we search for aspergillus fumigatus.
  2. At the aspergillus fumigatus genome project site we click on chromosome 3 to access the map viewer.
  3. Hovering over the Download/View sequence link shows us how an URL to access sequence data is structured:
http://www.ncbi.nlm.nih.gov/projects/mapview/seq_reg.cgi?taxid=746128&chr=3&from=1&to=4079167
We can easily adapt this to the sequence range we need ...
  1. ... and follow: http://www.ncbi.nlm.nih.gov/nuccore/NC_007196.1?from=3691003&to=3691243&report=fasta to yield:
>gi|71025130:3691003-3691243 Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
AGTCGATGGCGAAAGTGTTATGCGCCGACGA


  1. To translate this, we navigate to any of the EMBOSS tools servers and use "remap" - we want to see the translation matched to the nucleotide sequence. We turn restriction sites off, translate all three forward frames and paste and manually align the SACCE Mbp1 sequence into the output to see what we expect and what we got. I have selected only the frame(s) that actually give a match, and I have pasted the homologous CAPCO and SACCE sequences (lower case) to demonstrate their similarity:
ASPFU     ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
                                                                        
ASPFU      R  F  A  E  T  G  I  M  A  A  V  D  F  S  K  I  Y  S  A  T  Y  S  S  
CAPCO                           m  -  a  f  d  -  k  e  i  y  s  a  t  y  s  n  
SACCE                           m  s  -  -  -  -  n  q  i  y  s  a  r  y  s  g

         
ASPFU     GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
 
ASPFU     V  S  L  F  *  ... 
CAPCO     v  a  -  -     ...
SACCE     v  d  -  -     ...
         
ASPFU     CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
                                                             ...  V  Y  E  F  K 
CAPCO                                                        ...  v  y  e  l  k 
SACCE                                                        ...  v  y  e  f  i
         
ASPFU      AGTCGATGGCGAAAGTGTTATGCGCCGACGAGGCGATGATTGGATCAATGCTACACATATTCTTAAA

ASPFU       V  D  G  E  S  V  M  R  R  R  G  D  D  W  I  N  A  T  H  I  L  K ...
CAPCO       v  a  g  d  h  i  m  r  r  r  s  d  d  w  v  n  a  t  h  i  l  k ...
SACCE       h  s  t  g  s  i  m  k  r  k  k  d  d  w  v  n  a  t  h  i  l  k ...


This clearly shows us that there is N-terminal sequence that ought to be added to the gene model, upstream of the reported translational start of MRRR.... The sequences thus most likely begin as follows:
ASPFU   MAAVDFSKIYSATYSSVSLFVYEFKVDGE-----SVMRRRGDDWINATHILK...
CAPCO   ma-fd-keiysatysnva--vyelkvagd-----himrrrsddwvnathilk...
SACCE   ms----nqiysarysgvd--ysgvdvyefihstgsimkrkkddwvnathilk...

The fact that the truncated N-terminus appears in both closely related genes and species suggests that what we see here is a mis-annotated intron. The take-home lesson is: if your retrieved protein sequence does not conform to your expectations, it may be worthwhile to follow up with the actual nucleotide sequence.


 

Template choice and template sequence

The SWISS-MODEL server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.

Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lectures; please refer to the template choice principles page on this Wiki where I have reviewed the principles and discussed more details and alternatives. One can either search the PDB itself through its Advanced Search interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modeling is sequence similarity.

In Assignment 3, you have defined the extent of the APSES domain in yeast Mbp1. In Assignment 6, you have used PSI-BLAST to search for APSES domains in YFO. In Assignment 7 you have confirmed by Reciprocal Best Match which of these APSES domain sequences is the closest related orthologue to yeast Mbp1. This sequence is the best candidate for having a conserved function similar to yeast Mbp1. Therefore, this sequence is the one you will model: it is called the target for the homology modeling procedure. In the same assignment you have also computed a multiple sequence alignment that includes the sequence of Mbp1 with YFO.

Defining a template means finding a PDB coordinate set that has sufficient sequence similarity to your target that you can build a model based on that template. In Assignment 2 you have used a keyword search at the PDB to find "Mbp1" structures - but some of these structures were not homologs: keyword searches are notoriously unreliable. To find suitable PDB structures, we will perform a BLAST search at the PDB instead.




Task:

  1. Retrieve your YFO Mbp1-like APSES domain sequence. You can find the domain boundaries for the yeast protein in the Mbp1 annotation reference page, and you can get the aligned sequence from your Jalview alignment, or simply recompute it with the needle program of the EMBOSS suite. This YFO sequence is your target sequence.
  2. Navigate to the PDB.
  3. Click on Advanced to enter the advanced search interface.
  4. Open the menu to Choose a Query Type:
  5. Find the Sequence features section and choose Sequence (BLAST...)
  6. Paste your target sequence into the Sequence field, select not to mask low-complexity regions and Submit Query. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.

All hits that are homologs are potentially suitable templates, but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...

  • sequence similarity to your target
  • size of expected model (= length of alignment)
  • presence or absence of ligands
  • experimental method and quality of the data set

Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.

  1. There is a menu to create Reports: - select customizable table.
  2. Select (at least) the following information items:
Structure Summary
  • Experimental Method
Sequence
  • Chain Length
Ligands
  • Ligand Name
Biological details
  • Macromolecule Name
refinement Details
  • Resolution
  • R Work
  • R free
  1. click: Create report.

Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. Neither of the structures has a bound DNA ligand, but the experimental methods and structure quality are different. Two of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the real world, there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice: 1BM8. In case you don't agree, please let me know.

Finally
Click on the 1BM8 ID to navigate to the structure page for the template and save the FASTA sequence to your computer. This is the template sequence.


 


Sequence numbering

 

It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file (one of the related PDB structures) is the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the ATOM records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with MSNQIY..., but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.

Fortunately, the numbering for the residues in the coordinate section of our target structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence (e.g. by using the bio3D R package). If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.


 


The input alignment

  The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.

The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the template sequence and the target sequence from your species, proceed as follows.


 

Task:
Choose on of the following options to align your target and template sequence.


In Jalview...
  • Load your Jalview project with aligned APSES domain sequences or recreate it from the Mbp1 orthologue sequences from the Mbp1 protein orthologs page that I prepared for Assignment 7. Include the sequence of your template protein and re-align.
  • Delete all sequence you no longer need, i.e. keep only the APSES domains of the target (from your species) and the template (from the PDB) and choose Edit → Remove empty columns. This is your input alignment.
  • Choose File→Output to textbox→FASTA to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.


Using a different MSA program
  • Copy the FASTA formatted sequences of the Mbp1 proteins in the reference species from the Reference APSES domain page.
  • Access e.g. the MSA tools page at the EBI.
  • Paste the Mbp1 sequence set, your target sequence and the template sequence into the input form.
  • Run the alignment and save the output.


Using the EMBOSS explorer
  • Use the needle tool for the alignment ... but remember that pairwise alignments will only be suitable in case the alignment is absolutely unambiguous (such as here) . If there are any indels, an MSA will give much more reliable information.


By hand

APSES domains are strongly conserved and have few if any indels. You could also simply align by hand.

  • Copy the CLUSTAL formatted reference alignment of the Mbp1 proteins in the reference species from the Reference APSES domain page.
  • Open a new file in a text editor.
  • Paste the Mbp1 sequence set, your target sequence and the template sequence into the file.
  • Align by hand, replace all spaces with hyphens and save the output.


Whatever method you use: the result should be a two sequence alignment in multi-FASTA format, that was constructed from a number of supporting sequences and that contains your aligned target and template sequence. This is your input alignment for the homology modeling server. For a Schizosaccharomyces pombe model, which I am using as an example here, it looks like this:

>1BM8_A 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
>Mbp1_SCHPO 2-100 NP_593032
AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL


 

Homology model

 


SwissModel

 

Access the Swissmodel server at http://swissmodel.expasy.org and click on Start Modelling. Then, under the Supported Inputs, click on Target-Template Alignment.

Task:

  • Paste your alignment for target and model into the form field. Click on the question mark next to "Supported Inputs" if you are not sure about the format. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
  • Click Validate Target Template Alignment and check that the returned alignment is correct.
  • Click Build Model to start the modeling process.
  • The resulting page returns information about the resulting model. Mouse over the Model 01, open the PDB file and save the coordinates to your computer. Read the information on what is being returned by the server (click on the question mark icon). Study the quality measures.
  • Also save:
    • The output page as pdf (for reference)
    • The modeling report (as pdf)

Model analysis

   

The PDB file

 

Task:
Open your model coordinates in a text-editor (make sure you view the PDB file in a fixed-width font (like "courier") so all the columns line up correctly) and consider the following questions:

  • What is the residue number of the first residue in the model? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your model correspond to that region?


R code: renumbering the model

As you have seen, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. Fortunately there is a very useful R package that will help us with that.

Task:

  1. Navigate to the bio3D home page. bio3d is not available for installation via CRAN, but needs to be installed from source. Instructions for the different platforms are here http://thegrantlab.org/bio3d/tutorials/installing-bio3d Follow the instructions and install bio3d for R on your platform.
  1. Explore and execute the following R script. I am assuming that your model is in your working directory, change paths and filenames as required.
# renumberPDB.R

# This is a simple renumbering script that uses the bio3D 
# package. We simply set the first residue number to what it
# should be and renumber all residues based on the first one.
# The script assumes your input PDBfile is in your working
# directory.

# To run this, you must have installed the bio3D R package; instructions
# are here: http://thegrantlab.org/bio3d/tutorials/installing-bio3d

setwd("~/my/working/directory")
PDBin      <- "YFO_model.pdb"
PDBout     <- "YFO_model_ren.pdb"

first <- 4  # residue number that the first residue should have
 
# ================================================
#    Read coordinate file
# ================================================
 
# read PDB file using bio3D function read.pdb()
library(bio3d)
pdb  <- read.pdb(PDBin) # read the PDB file into a list

pdb            # examine the information
pdb$atom[1,]   # get information for the first atom

# you can explore ?read.pdb and study the examples.

# ================================================
#    Change residue numbers
# ================================================


resNum <- as.numeric(pdb$atom[,"resno"])  # get residue numbers for all atoms
resNum <- resNum + (first - resNum[1])         # calculate offset
pdb$atom[,"resno"] <- resNum             # replace old numbers with new
pdb$atom[1,]                                   # check result


# ================================================
#    Write output to file
# ================================================

write.pdb(pdb=pdb,file=PDBout)

# Done. Open the PDB file you have written in a text editor and confirm
# that this has worked.


 

First visualization

 

Since a homology model inherits its structural details from the template, your model of the YFO sequence should look very similar to the original 1BM8 structure.

Task:

  1. Start Chimera and load the model coordinates that you have just renumbered.
  2. From the PDB, also load the template structure. (Use File → Fetch by ID ...)
  3. In the FavouritesModel Panel window you can switch between the two molecules.
  4. Hide the ribbon and choose backbone only → full. You will note that the backbone of the two structures is virtually identical.
  5. Next, choose Actions → Atoms/Bonds → show to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: Select → Chemistry → Element → H and Actions → Atoms/Bonds → hide
  6. Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. Choose Favourites → Sequence, select the residues for one model, then Select → Invert (selected model) and Actions → Atoms/Bonds → hide.
  7. Study the result. A model of the HTH domain of YFO Mbp1.

 
 

Coloring the model by energy

SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB files B-factor field.


Task:

  1. Back in Chimera, use the model panel to close the 1BM8 structure.
  2. Choose Tools → Depiction → Render by attribute and select attributes of atoms, Attribute: bfactor, check color atoms and click OK.
  3. Study the result: It seems that residues in the core of the protein have better energies than residues at the surface. Why could that be the case?

Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. Simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. The rewnder this property to map it on the 3D structure of your molecule. If you want to experience with this a bit, you could apply the information scores from the previous assignment to your model, using a script that is easy to derive from the renumbering R-script you have studied above.


That is all.

Links and Resources

 

Biasini et al. (2014) SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res 42:W252-8. (pmid: 24782522)

PubMed ] [ DOI ] Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/.

Bordoli & Schwede (2012) Automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal. Methods Mol Biol 857:107-36. (pmid: 22323219)

PubMed ] [ DOI ] Comparative protein structure modeling is a computational approach to build three-dimensional structural models for proteins using experimental structures of related protein family members as templates. Regular blind assessments of modeling accuracy have demonstrated that comparative protein structure modeling is currently the most reliable technique to model protein structures. Homology models are often sufficiently accurate to substitute for experimental structures in a wide variety of applications. Since the usefulness of a model for specific application is determined by its accuracy, model quality estimation is an essential component of protein structure prediction. Comparative protein modeling has become a routine approach in many areas of life science research since fully automated modeling systems allow also nonexperts to build reliable models. In this chapter, we describe practical approaches for automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal.

Peitsch (2002) About the use of protein models. Bioinformatics 18:934-8. (pmid: 12117790)

PubMed ] [ DOI ] Protein models can be of great assistance in functional genomics, as they provide the structural insights often necessary to understand protein function. Although comparative modelling is far from yielding perfect structures, this is still the most reliable method and the quality of the predictions is now well understood. Models can be classified according to their correctness and accuracy, which will impact their applicability and usefulness in functional genomics and a variety of situations.



Reference sequences


 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.