Difference between revisions of "Template choice principles"

From "A B C"
Jump to navigation Jump to search
m
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
The most important step of comparative modelling is a carefully done multiple sequence alignment of the '''target''' sequence with a protein of known structure. However, you can't expect a useful model either, if you use an unsuitable '''template''' and for many templates more than one coordinate file is available.  
+
<div id="BIO">
 +
<div class="b1">
 +
Template choice<br />
 +
<span style="font-size: 70%"> for comparative modeling</span>
 +
</div>
 +
 
 +
 
 +
The most important step of comparative modelling is a carefully done multiple sequence alignment of the '''target''' sequence with a protein of known structure. However, you can't expect a useful model either, if you use an unsuitable '''template''', and for many templates more than one coordinate file is available.  
  
 
;All homologues can contribute template information to your project!
 
;All homologues can contribute template information to your project!
 +
 +
{{Vspace}}
 +
 +
{{Vspace}}
 +
 +
__TOC__
 +
 +
{{Vspace}}
 +
 +
{{Vspace}}
 +
 +
==Searching for templates==
 +
  
 
;How to find a template
 
;How to find a template
* Keyword searches are possible, but unrealiable: there is no guarantee that the keyword you are thinking of is used, rather than a synonym, or that it is correctly spelled.
+
* Keyword searches are possible, but unrealiable: there is no guarantee that the researchers who have deposited the structure have used the keyword you are thinking of, or that it is correctly spelled.
* Sequence searches: '''BLAST and PSI-BLAST are the tools of first choice to find homologues structures.''' Try a BLAST search in the PDB subsection of the protein database first. If this is unsuccessful, do a PSI-BLAST search in "nr" and look for homologues sequences that are flagged with the "known structre" icon.
+
* Sequence searches: '''BLAST and PSI-BLAST are the tools of first choice to find homologues structures.''' Try a BLAST search in the PDB subsection of the protein database first. If this is unsuccessful, do a PSI-BLAST search in "nr" and look for homologs that are flagged with the "known structure" icon.
* Use of CATH - a hierarchical classification of the entire PDB will contain domains you can use if you know your protein's folding architecture, however the actual alignment is likely to be very challenging.
+
* Use [http://www.cathdb.info/ '''CATH'''] or [http://scop.mrc-lmb.cam.ac.uk/scop/ '''SCOP''']. These hierarchical classifications of the entire PDB will contain domains that may serve as templates, if you know your protein's folding architecture. Sometimes protein families have diverged in sequence so far that alignments fail. A structural superposition of structures from a family may pinpoint key conserved residues that must be represented in the sequence alignment you use for your modelling procedure.
  
 +
{{Vspace}}
 +
 +
{{Vspace}}
 +
 +
==Evaluating Templates==
 +
 +
Evaluation is based on an accurate alignment between target and template sequence.
 +
 +
{{Vspace}}
 +
 +
===Alignment===
 
;Hard and easy results
 
;Hard and easy results
* Since structural similarity correlates with sequence similarity, use the structure with the highest degree of % sequence identity (not alignment score) as a template. Easy results are those where no indels ave to be considered. Modeling indels is unreliable. In selected cases you may consider using a closely related template overall, but importing a same-length loop from a more distantly related template.
+
* Since structural similarity correlates with sequence similarity, use the structure with the highest degree of % sequence identity (not alignment score) as a template. ''Easy'' modeling tasks are those where no indels have to be considered. Structural modeling of indels is always unreliable. ''Hard'' modeling tasks have significant indels or uncertain alignments over the length of the target. In selected cases you may consider using a template of high sequence identity to model the global fold, and then to import coordinates for a loop of the same length as your target from a more distantly related template. Whether such a loop will have the same conformation as your target protein depends on whether the loop length has been ''conserved'' from a shared ancestor, or whether it has changed, and then ''converged'' to your target sequence. If you have a phylogenetic tree available, you may be able to figure this out. Nevertheless, that template at least provides an example of a low-energy loop configuration of the correct length in the global context of the target protein.
  
 +
{{Vspace}}
 +
 +
===Suitability===
 
;Assessing suitability
 
;Assessing suitability
  
The model must be relevant to your protein's function! If you have a choice:
+
The model must be '''relevant''' to your protein's function! If you have a choice:
  
* Choose orthologues over paralogues;
+
* Choose orthologs fulfilling the ''Reciprocal Best Match''' criterium over paralogues that may be functionally diverged;
 
* Choose protein-ligand complexes over unliganded structures;
 
* Choose protein-ligand complexes over unliganded structures;
* Choose structures in a functional state (bound inhibitor? Heterooligomer?) over free structures;
+
* Choose structures in a functional state (bound inhibitor? heterooligomer? phosphorylated? proteolytic processing?) over free, unmodified structures;
* Choose native sequences over mutated sequences (incl. His-tag, SeMet, post-translational modifications);
+
* Choose native sequences over mutated sequences (incl. His-tag, SeMet, non-physiological post-translational modifications);
* Chose coordinate sets in which the regions of interest are well ordered over regions that are locally disordered and have high B-factors,  or regions that are highly divergent in of NMR model sets;
+
* Chose coordinate sets in which the regions of interest are well ordered over regions that are locally disordered and have high B-factors,  or regions that are highly divergent in NMR model sets;
 
* Choose structures where crystal packing contacts are distant from regions of interest over those where crystal packing may introduce conformational artefacts.
 
* Choose structures where crystal packing contacts are distant from regions of interest over those where crystal packing may introduce conformational artefacts.
  
Line 26: Line 60:
 
Use the highest-quality structure available:
 
Use the highest-quality structure available:
 
* Use the structure with the best resolution (low values: 2.0 &Aring; is better than 2.5 &Aring;).
 
* Use the structure with the best resolution (low values: 2.0 &Aring; is better than 2.5 &Aring;).
* Treat NMR structures like crystal structures with a resulution (at best) worse than 2.5 &Aring;
+
* Treat NMR structures like crystal structures with a resolution (at best) worse than 2.5 &Aring;
* Well refined structures have R-values better than 10% of their nominal resolution.  
+
* Well refined structures have R-values better than 10% of their nominal resolution (e.g. 2&Aring;: R&lt; 0.2).  
* R-free, and R-merge are additional quality metrics ... but are difficult to assess for the non-expert.
+
* R-free, and R-merge are additional quality metrics ... but are difficult to assess for the non-expert. Here too: lower is better.
 +
 
 +
 
 +
 
 +
{{Vspace}}
 +
 
 +
{{Vspace}}
 +
 
 +
 
 +
[[Category:Bioinformatics]]
 +
</div>

Latest revision as of 00:35, 7 January 2017

Template choice
for comparative modeling


The most important step of comparative modelling is a carefully done multiple sequence alignment of the target sequence with a protein of known structure. However, you can't expect a useful model either, if you use an unsuitable template, and for many templates more than one coordinate file is available.

All homologues can contribute template information to your project!


 


 


 


 

Searching for templates

How to find a template
  • Keyword searches are possible, but unrealiable: there is no guarantee that the researchers who have deposited the structure have used the keyword you are thinking of, or that it is correctly spelled.
  • Sequence searches: BLAST and PSI-BLAST are the tools of first choice to find homologues structures. Try a BLAST search in the PDB subsection of the protein database first. If this is unsuccessful, do a PSI-BLAST search in "nr" and look for homologs that are flagged with the "known structure" icon.
  • Use CATH or SCOP. These hierarchical classifications of the entire PDB will contain domains that may serve as templates, if you know your protein's folding architecture. Sometimes protein families have diverged in sequence so far that alignments fail. A structural superposition of structures from a family may pinpoint key conserved residues that must be represented in the sequence alignment you use for your modelling procedure.


 


 

Evaluating Templates

Evaluation is based on an accurate alignment between target and template sequence.


 

Alignment

Hard and easy results
  • Since structural similarity correlates with sequence similarity, use the structure with the highest degree of % sequence identity (not alignment score) as a template. Easy modeling tasks are those where no indels have to be considered. Structural modeling of indels is always unreliable. Hard modeling tasks have significant indels or uncertain alignments over the length of the target. In selected cases you may consider using a template of high sequence identity to model the global fold, and then to import coordinates for a loop of the same length as your target from a more distantly related template. Whether such a loop will have the same conformation as your target protein depends on whether the loop length has been conserved from a shared ancestor, or whether it has changed, and then converged to your target sequence. If you have a phylogenetic tree available, you may be able to figure this out. Nevertheless, that template at least provides an example of a low-energy loop configuration of the correct length in the global context of the target protein.


 

Suitability

Assessing suitability

The model must be relevant to your protein's function! If you have a choice:

  • Choose orthologs fulfilling the Reciprocal Best Match' criterium over paralogues that may be functionally diverged;
  • Choose protein-ligand complexes over unliganded structures;
  • Choose structures in a functional state (bound inhibitor? heterooligomer? phosphorylated? proteolytic processing?) over free, unmodified structures;
  • Choose native sequences over mutated sequences (incl. His-tag, SeMet, non-physiological post-translational modifications);
  • Chose coordinate sets in which the regions of interest are well ordered over regions that are locally disordered and have high B-factors, or regions that are highly divergent in NMR model sets;
  • Choose structures where crystal packing contacts are distant from regions of interest over those where crystal packing may introduce conformational artefacts.
Assessing quality

Use the highest-quality structure available:

  • Use the structure with the best resolution (low values: 2.0 Å is better than 2.5 Å).
  • Treat NMR structures like crystal structures with a resolution (at best) worse than 2.5 Å
  • Well refined structures have R-values better than 10% of their nominal resolution (e.g. 2Å: R< 0.2).
  • R-free, and R-merge are additional quality metrics ... but are difficult to assess for the non-expert. Here too: lower is better.