Steipe Lab - Canonical Sequence Approximation

Canonical Sequence Approximation

[ overview ] [ V_k-ALL ] [ V_l-HUMAN ] [ V_l-MOUSE ] [ V_H-ALL ] [ V_H-HUMAN ] [ V_H-MOUSE ]

The Canonical Sequence Approximation: A Rational Route to Intrabodies
Lecture slides from CHI 3d. annual "Recombinant Antibodies" conference. Cambridge, Ma, April 24. 2002 (slides will open in new window).
Abstract: To address the stability problems of immunoglobulin domains in the reducing environment of the cytoplasm, we have devised a strategy for rational stability engineering based on consensus sequences. Point mutations are predicted from sequence alignments and provide incremental increases of stability. These can be combined in existing domains, or used for the design of hyperstable frameworks and routinely allow expression of the target domains as soluble intrabodies with high yield. The method is straightforward, does not require knowledge of the structure, is applicable to scFvs as well as isolated domains and improves expression yields and solubility at the same time.

Concept

The canonical sequence approximationderives from a conceptual link between statistical thermodynamics and observed sequence distributions in proteins. Its purpose is to describe sequences in terms of their information content towards generating the protein's folded structure. This bears on the protein's thermodynamic stability: stability is the experimental metric of the protein folding problem. This correlation can be used for predictions of stabilizing mutations in protein domains. If we consider a group of sequences that represent a single fold, in the framework of statistical thermodynamics that fold could be viewed as a "system" and the individual amino acid positions of the fold would be "components" or "particles" of the system. Each component can occur in one of 21"microstates", it can be one of the 20 proteinogenic amino acids or a gap. The set of all specific sequences or "macrostates" that are compatible with the fold, would be a canonical ensemble. We may call the specific sequence that best represents that ensemble - the consensus sequence of the canonical ensemble - a canonical sequence. If we introduce a number of assumptions we can attempt to approximate that sequence from an analysis of observed sequences.
This concept originated from the observation that the stability effects of point-mutations in an immunoglobulin V_k domain were not correlated with simple concepts of predictions of such mutations. We had prepared variants with methionine, leucine and isoleucine in position 21 of the recombinantly expressed murin V_k domain of McPC603:

Effects of V_k residue-21 mutations.

Amino Acid Rotatable bonds in side-chain (1) Free energy of transfer (kJ/mol) (2) Water molecule (3) Frequency (4) Stability (kJ/mol) (5)

Leu 21 2 -7.5 partially present 12 % -12.2

Met 21 3 -5.4 present 19 % -13.5

Ile 21 2 -12.1 absent 66 % -14.5

The number of rotatable bonds correlates with the loss in entropy upon folding of the protein.
The free energy of transfer measures the change in solvation energy, taking a side-chain from the hydrated environment of the unfolded state to the hydrophobic core of a folded protein. (Data from Nozaki, Y. & Tanford, C. (1971) JBC 246:2211.)
A conserved structural water molecule is found in the core of many V_k domains, adjacent to amino acid 21. It makes very good hydrogen bonds to donors and acceptors from three strands of the protein. From inspection, one would expect a significant stabilizing effect from the presence or absence of this water molecule.
The frequency is given as the percentage of occurence of the respective amino acid in poistion 21 in the Kabat Database of immunoglobulin sequences.
Stability is measured experimentally from unfolding transition curves by denaturation in urea. The change from the folded to the unfolded state is monitored with the fluorescence of the domain's single tryptophan.

As can be seen from the table, the consideration of amino acid properties in three variants does not even predict qualitatively the observed stability of the respective proteins. Only the frequency of observation of the amino acids in the Kabat database of sequences of proteins of the immune system follows the experimentally observed stabilities. This raises the question of whether this connection between frequency and stability had any significance. Of course, the relationship between the observed frequency of a system's state and the energy associated with this state is the realm of statistical thermodynamics. Boltzmann's law describes quantitatively how the energy of a system is distributed among the states of its components. Low-energy states are frequent and high-energy states are rare. Interestingly, the derivation of the Boltzmann equation can be obtained without any reference to energy, particles, temperature or other physical quantities. The equation follows simply from mathematical combinatorics. Systems to which the Boltzmann equation applies have to conform to a few very general requirements:

The number of elements in the system has to be constant.
The system posesses a certain quantity - "energy" - which is constant.
This quantity is distributed freely among the elements and stochastically redistributed.
The state an element is in - i.e. the quantity of "energy" associated with it - is independent of the states of other elements.

The application of the Boltzmann equation to an ensemble of immunoglobulin domains is based on the concept of an immunoglobulin repertoire that approximates a canonical ensemble of sequences. Each is derived from one of a set of germ-line sequences and selected in a process of random, independent mutations to be compatible with every aspect of antibody function. Such an ensemble is expected to have two properties: A) the average level of domain stability is marginal and B) the ensemble is at a state of equilibrium with respect to sequence changes affecting stability. These properties follow from the fact that while destabilizing random mutations are highly probable, they are selectively neutral as long as the overall domain stability does not fall below a certain threshold. Conversely, stabilizing random mutations are highly improbable but there is no positive selection above a certain threshold.

IG Domain ensemble The Immunoglobulin Domain "Ensemble": If a number of immunoglobulin domains are superimposed, at low resolution they all look very similar, despite significant sequence differences in typically more than 20 % of their residues. In the Canonical Sequence Approximation, we imagine this low-resolution superposition to comprise an averaged environment for individual amino acid changes, which are otherwise random and independent. The fitness of each domain (approximated by its thermodynamic stability) is assumed to be (nearly) constant - severly disruptive mutations are not observed in the sequence database since they do not lead to functional immunoglobulins. Under these assumptions, the most probable amino acid frequency distribution is a Boltzmann distribution and the consensus residue is the fittest (most stable) choice. We have corroborated this by experiment.

Our canonical sequence approximation states that the most probable distribution of amino acids at a specific position is given by Boltzmann's law. Tot the extent that this is true, we can calculate a statistical "free energy" from the frequencies of observation. This statistical "free energy" quantitates selection on that position: the deviation of the observed distribution from randomness. Obviously, sequence changes need to be independent and the observed distributions have to approximate the underlying probabilities well. As a result, to the degree that stability contributes to the selective pressure on a specific position, the statistical "free energy" will correlate with the free energy of folding.
In order to predict the effects of point mutations on stability, two approximations are introduced. The first approximation is that the Kabat database of sequences of immunoglobulins represents the canonical ensemble. To the extent that this is true, the underlying probabilities are obtained by simply averaging over the sequences of the database. We expect deviations to arise from significant sampling errors of the database (e.g. bias for the capacity to bind small haptens, bias towards a few intensively studied sequence families, errors from species-specific differences...) but simple averaging avoids requiring detailed assumptions about sequence evolution in the immune system. The second approximation is that selection is only for domain stability. We expect systematic overestimation of stability effetcs to arise from the contributions of other factors, common to all domains, that influence the process of selection on some positions. However, no position of the domain can be freely mutated without any effect on stability and even if selective pressure is towards a different factor, severely disruptive mutations are never allowed. Since antigen binding imposes specific constraints on individual domains, the effects of the requirements for antigen binding should average out over the range of all observed sequences and the prediction should hold true also in the complementarity determining regions.
Finally this allows to predict sequence improvements. Whenever a residue can be replaced with one that is observed significantly more frequently in the database, this replacement is expected to stabilize the domain. Further, individual replacements can be combined, their effects are expected to be additive.
To use this principle, you can access the precompiled statistics for residue frequencies and distributions for V_k, V_l and V_H domains through the links at the header and footer of this page.
Even though we have used assumptions that need to be carefully discussed, and the procedure is not strictly falsifiable (since the relative contribution of stability and other factors to the selective pressure on a position is not known), it has proven to be very useful for stability engineering of immunoglobulin domains and other proteins, such as p53, SH3 and WW domains and phytase. It is successful in the absence of structural information and it can be extended to other protein families, provided that sequence divergence is sufficiently small to make co-variation of residues or subdomains unlikely. We have obtained correct predictions from frequency ratios of 3:1 or even 2:1 suggesting that the minimum number of sequences needed for meaningful predictions can be correspondingly small. Even though the precise sequence of the improved domains is probably not already in the immune repertoire, no novel immunogenic epitopes are generated since only consensus residues are introduced. Finally, this method of consensus engineering introduces no non-natural structural motifs into the domain and thus can be expected to provide immunoglobulin frameworks that are especially well suited for CDR-grafting procedures.

[ overview ] [ V_k-ALL ] [ V_l-HUMAN ] [ V_l-MOUSE ] [ V_H-ALL ] [ V_H-HUMAN ] [ V_H-MOUSE ]

References

Ohage, E.C. and Steipe, B. (1999) Intrabody construction and expression I: The critical role of VL domain stability J Mol Biol 291: 1119-1128

Ohage, E.C., Wirtz P., Barnikow, J. and Steipe, B. (1999) Intrabody construction and expression II: A synthetic catalytic Fv J Mol Biol 291: 1129-1134

Wirtz P. and Steipe, B. (1999) Intrabody construction and expression III: Engineering hyperstable VH domains Protein Science 8: 2245-2250

Ohage, E.C., Graml, W., Walter, M.M., Steinbacher, S. and Steipe, B. (1997) beta-turn Propensities as Paradigms for the Analysis of Structural Motifs to Engineer Protein Stability Protein Science 6: 233-241.

Steipe, B., Schiller, B., Plückthun, A. and Steinbacher, S. (1994) Sequence statistics reliably predict stabilizing mutations in a protein domain. J Mol Biol 240: 188-192.

HOME
SITE MAP
RESEARCH
TEACHING
PEOPLE
OPENINGS
LAB
ADDRESS

B. Steipe and Program in Proteomics and Bioinformatics, University of Toronto
boris.steipe@utoronto.ca
Last revision: April. 2002
©