Lecture 03

(Previous lecture) ... (Next lecture)

Sequence Properties

What you should take home from this part of the course

Understand the ideas of analysis by composition and analysis by signal;
Know what deterministic pattern matching is;
Recognize and understand the term regular expression;
Be familiar with common sequence signals in DNA,RNA and proteins;
Be familiar with the Prosite database and the Prosite scan server;
Kow where to find EMBOSS tools and how to use them;
Know about the offerings on the ExPASy tools collection page;
Work on an understanding how biological facts can be translated into hypotheses and how hypotheses can be translated into computational procedures for analysis.

Links summary

...

Exercises

Retrieve and read the Prosite documentation entry for the Leucine Zipper.
Download entry 1NWQ from the PDB, visualize the Leucine Zipper with VMD and study its architecture (stereo vision!).

Lecture Slides

Slide 001

Lecture 03, Slide 001
This finding made the news. You should be aware of important new developments: subscribe to read at least the news items from Nature and Science, preferably subscribe to and browse their tables of contents too. In this particular new finding, researchers challenge our current concept of "genome": what is a genome, if the same physical DNA molecule can contain coding information for more than one species? Also, this finding further emphasizes the importance of horizontal gene transfer in evolution.

Slide 002

Lecture 03, Slide 002
What properties of a sequence can you analyze to describe what it is or does?

Slide 003

Lecture 03, Slide 003

Slide 004

Lecture 03, Slide 004

Slide 005

Lecture 03, Slide 005

Slide 006

Lecture 03, Slide 006

Slide 007

Lecture 03, Slide 007
A protein's isoelectric point depends on the pK values of the amino acids; the pK values characterize the propensity fo an amino acid sidechain to dissociate, which in turn depends on how energetically favourable dissociation is. For example: since a negatively charged amino acid will be stabilized in a positive electrostatic field, such a field will shift a pK value down. This means the pH value at which the side chain will be 50% ionized is lower, or in other words, in a positive electrostatic field the concentration of protons must be higher to keep a proton associated to the sidechain.

Compositional properties of nucleic acids include hybridization temperature and helix structure.

Slide 008

Lecture 03, Slide 008
Simple tools exist to conveniently calculate compositional properties for peptide sequences. For example the EMBOSS GUI serves EMBOSS tools on the Web on many freely accessible servers in the world. One of these tools ist the pepstats routine that was used to create the output above.

Slide 009

Lecture 03, Slide 009

Slide 010

Lecture 03, Slide 010
Comparison of our unknown text (the German translation of "What's in a name ...") with letter frequencies of German, English and French shows a tendency towards the German origins but we can't immediately say that this is statistically significant. It is, after all, a very short sample. It is interesting to consider the outliers W (for reasons of alliteration) and I and S (for assonance). Both poetic devices can be regarded to create constraints on the choice of words that will lead to deviations from expected distributions.

Slide 011

Lecture 03, Slide 011
This has a corollary in the non-random distribution of amino acids across species. Some of this may be due to physicochemical properties of amino acids in a particular ecological niche, but such effects may also be due to chance characteristics of the biochemical machinery of replication and translation.

Slide 012

Lecture 03, Slide 012
Sometimes the excess or depletion of a component may carry important information. In the example above case, the choice of words has been dictated by their letter composition. The poem Eunoia is an example of a univocalic lipogram, a form of concrete poetry, in which the author has constrained himself to use only a single vowel in each of the poem's chapters.

Slide 013

Lecture 03, Slide 013
The atypical distribution and clustering of particular amino acids suggests consequences for folding and interactions of the encoded protein.

Slide 014

Lecture 03, Slide 014
In this graph, amino acids have been ordered according to their standard frequencies in proteins (blue) to emphasize deviations in a particular protein (white). The sequence of the single stranded RNA binding protein Nab3p is remarkable, possessing long tracts of Glutamine and Glutamic acid and an excess of proline. Just looking at these anomalies allows us to infer that large tracts of sequence are likely not structured (and thus fulfill their function as an adaptable, flexible polypeptide) and that large segments are highly negatively charged (and thus presumably have a highly positively charged ligand, such as the (+)charges of exposed, single stranded nucleotide bases).

Slide 015

Lecture 03, Slide 015

Slide 016

Lecture 03, Slide 016
A sequence is fundamentally different from an unordered set, since it places its components into a context. Here is where biology differs from human language: A pattern with a different sequence is a different pattern. Constraints on patterns can be structural or functional.

Slide 017

Lecture 03, Slide 017

Slide 018

Lecture 03, Slide 018

Slide 019

Lecture 03, Slide 019

Slide 020

Lecture 03, Slide 020
Restriction endonucleases are the quintessential pattern recognition molecules. They bind strongly the specific conformation of DNA that is associated with a particular DNA sequence. Even though the structural differences between DNA strands of similar sequence is small, evolutionary pressure has resulted in enzymes that are highly specific for their cognate sequence. An excellent site for endonuclease information is Rebase.

Slide 021

Lecture 03, Slide 021
Sequence maps collect annotations from various sources - for plasmids in the laboratory either in linear form (based on the remap tool of the EMBOSS suite) ...

Slide 022

Lecture 03, Slide 022
... or as a circular map, e.g. from PlasMapper.

Slide 023

Lecture 03, Slide 023

Slide 024

Lecture 03, Slide 024
Pattern search (or pattern matching) means inspecting an entity and stating whether that entity is an example of a given pattern. Usually the entity is a substring of a sequence, but patterns in protein structure, biological networks or morphogenesis can also be computationally defined. Pattern discovery means finding patterns that have not been defined a priori.

Slide 025

Lecture 03, Slide 025
Deterministic pattern search is a well understood field of computer science. Much more elegant solutions than "brute force" search exist ...

Slide 026

Lecture 03, Slide 026
... such as the Boyer-Moore algorithm. For a step-by-step version see here. Defining optimal algorithms and analyzing their resource requirements is the domain of computer science.

Slide 027

Lecture 03, Slide 027
If searches are to be repeated, precomputed index trees are much faster than examining the entire sequence. Simply look up where a pattern could be. Time (and storage space) invested in constructing the index pays off manyfold for every lookup.

Slide 028

Lecture 03, Slide 028

Slide 029

Lecture 03, Slide 029
To be able search for patterns we need a convention to define them. In particular, we would like to be able to find degenerate patterns: patterns in which we allow a number of alternative choices for particular positions. Such patterns are commonly written as Regular Expressions' (even though some sites, such as the ProSite database use a custom variant of the concept).

Slide 030

Lecture 03, Slide 030
Here is an example of regular expression searching: the leucine zipper, a protein dimerization element found frequently in transcription factors is defined by PROSITE as <tt>L-x(6)-L-x(6)-L-x(6)-L</tt>.

Slide 031

Lecture 03, Slide 031
A crude Perl program to find the Leucine Zipper pattern uses a regular expression at its core. <tt>L.{6}){3,}L</tt> means: a string matching an "L", followed by 6 occurrences of any character ("."), repeated three or more times, and terminated by a final "L". (The arcane-looking print statement is just there to capture the sequence number of the pattern.)

Slide 032

Lecture 03, Slide 032
Having this as a Perl program on the computer makes it trivially easy to adjust the query, for example to allow any of the amino acids V, I, L, or M in the pattern and thus find examples that may be functionally related to the core leucine zipper.