Lecture 04
(Previous lecture) ... (Next lecture)
Sequence Analysis
- What you should take home from this part of the course
- Understand the ideas of analysis by composition and analysis by signal;
- Know what deterministic pattern matching is;
- Recognize and understand the term regular expression;
- Be familiar with common sequence signals in DNA,RNA and proteins;
- Be familiar with the Prosite database and the Prosite scan server;
- Kow where to find EMBOSS tools and how to use them;
- Know about the offerings on the ExPASy tools collection page;
- Work on an understanding how biological facts can be translated into hypotheses and how hypotheses can be translated into computational procedures for analysis.
- Links summary
- Exercises
- Retrieve and read the Prosite documentation entry for the Leucine Zipper.
- Download entry 1NWQ from the PDB, visualize the Leucine Zipper with VMD and study its architecture (stereo vision!).
Lecture Slides
Slide 001
![](/abc/images/4/49/L04_s001.jpg)
Lecture 04, Slide 001
From the Science News, Sept. 14. As far as systems biology complexities go, this one must be near the top: intimate interactions between human's most- and second-most complex systems. The key method here is a bioinformatics approach to classifying genes: pattern searches in the promoter regions. (NB. Not studying in isolation but forming study groups is an excellent idea!). Are you more lonely than average ? Check with the UCLA loneliness scale.
From the Science News, Sept. 14. As far as systems biology complexities go, this one must be near the top: intimate interactions between human's most- and second-most complex systems. The key method here is a bioinformatics approach to classifying genes: pattern searches in the promoter regions. (NB. Not studying in isolation but forming study groups is an excellent idea!). Are you more lonely than average ? Check with the UCLA loneliness scale.
Slide 002
Slide 003
Slide 004
Slide 005
![](/abc/images/a/ae/L04_s005.jpg)
Lecture 04, Slide 005
To generate this collection of sequences, the feature "Gal4-binding-site" was searched in the [Saccharomyces Genome Database SGD]; in the resulting overview page binding site annotations recorded by Harbison et al. (2004) were shown for all occurrences; the actual sequences were retrieved by specifying the genome coordinates in the appropriate search form of the database. I have added ten bases upstream and downstream of the core binding region. This procedure could be done by hand in about the same time it took me to write the small screen-scraping program to fetch the sequences. Depending on your programming proficiency, you will find that some tasks can efficiently be done manually, for some tasks it is more efficient to spend the time to search for a better way to achieve them on the Web and only for a comparatively small number of tasks it is worthwhile (or mandatory) to write your own code.
To generate this collection of sequences, the feature "Gal4-binding-site" was searched in the [Saccharomyces Genome Database SGD]; in the resulting overview page binding site annotations recorded by Harbison et al. (2004) were shown for all occurrences; the actual sequences were retrieved by specifying the genome coordinates in the appropriate search form of the database. I have added ten bases upstream and downstream of the core binding region. This procedure could be done by hand in about the same time it took me to write the small screen-scraping program to fetch the sequences. Depending on your programming proficiency, you will find that some tasks can efficiently be done manually, for some tasks it is more efficient to spend the time to search for a better way to achieve them on the Web and only for a comparatively small number of tasks it is worthwhile (or mandatory) to write your own code.
Slide 006
![](/abc/images/b/b1/L04_s006.jpg)
Lecture 04, Slide 006
A consensus sequence simply lists the most frequent amino acid or nucleotide at each position, or a random one if there is more than one with the highest frequency. The consensus sequence is the one that you would synthesize to make an idealized representative of the set. It is likely to bind more tightly or to be more stable than each of the individual sequences in the alignment.
A consensus sequence simply lists the most frequent amino acid or nucleotide at each position, or a random one if there is more than one with the highest frequency. The consensus sequence is the one that you would synthesize to make an idealized representative of the set. It is likely to bind more tightly or to be more stable than each of the individual sequences in the alignment.
Slide 007
![](/abc/images/6/61/L04_s007.jpg)
Lecture 04, Slide 007
Introducing nucleotide ambiguity codes to represent situations in which more than one nucleotide has the highest frequency improves the situation a bit, but there is also ambiguity. Consider the situation after the conserved CCG pattern: 9 As and 3Gs: should we report the consensusAA, or ist it more interesting to report that the only observed alternative is another purine base and write Y instead?
Introducing nucleotide ambiguity codes to represent situations in which more than one nucleotide has the highest frequency improves the situation a bit, but there is also ambiguity. Consider the situation after the conserved CCG pattern: 9 As and 3Gs: should we report the consensusAA, or ist it more interesting to report that the only observed alternative is another purine base and write Y instead?
Slide 008
![](/abc/images/2/2d/L04_s008.jpg)
Lecture 04, Slide 008
Sequence logo of Gal4 binding sites with 10 nucleotides flanking bases. Created with WebLogo. A Sequence Logo is a graphical representation of aligned sequences where at each position the height of a column is proportional to the (Shannon) information of that position and the relative size of each character is proportional to its frequency within the column. Sequence Logos were pioneered by Tom Schneider who maintains an informative Website about their use and theoretical foundations. Note that there is considerable additional information in the flanking sequences that are not included in the published description of the core binding pattern; it is advantageous if you are able to rerun such analyses, rather than rely on someone else's opinion.
Sequence logo of Gal4 binding sites with 10 nucleotides flanking bases. Created with WebLogo. A Sequence Logo is a graphical representation of aligned sequences where at each position the height of a column is proportional to the (Shannon) information of that position and the relative size of each character is proportional to its frequency within the column. Sequence Logos were pioneered by Tom Schneider who maintains an informative Website about their use and theoretical foundations. Note that there is considerable additional information in the flanking sequences that are not included in the published description of the core binding pattern; it is advantageous if you are able to rerun such analyses, rather than rely on someone else's opinion.
Slide 009
Slide 010
Slide 011
Slide 012
![](/abc/images/3/35/L04_s012.jpg)
Lecture 04, Slide 012
In this informal example, I have simply counted matches with the consensus sequence (excluding "N"). We can slide the PSSM over the entire chromosome, and calculate scores for each position. Only the middle sequence is an annotated binding site. Whatever method we use for probabilistic pattern matching, we will always get a score. It is then our problem to decide what the score means.
In this informal example, I have simply counted matches with the consensus sequence (excluding "N"). We can slide the PSSM over the entire chromosome, and calculate scores for each position. Only the middle sequence is an annotated binding site. Whatever method we use for probabilistic pattern matching, we will always get a score. It is then our problem to decide what the score means.
Slide 013
Slide 014
Slide 015
![](/abc/images/d/d6/L04_s015.jpg)
Lecture 04, Slide 015
This first order Markov model depends only on the current state. Higher-order models take increasing lengths of "history" into account, how the system arrived in its current state. Note that the exit probabilities fo a state always have to sum to 1.0. The so called "stationary probability" over a long period of time for p(rain) is 0.167 - this is determined by the combined effects of all individual transition probabilities. The stationary probabilities for two- or three consecutive rainy days are 4.2% and 2.1%, respectively.
This first order Markov model depends only on the current state. Higher-order models take increasing lengths of "history" into account, how the system arrived in its current state. Note that the exit probabilities fo a state always have to sum to 1.0. The so called "stationary probability" over a long period of time for p(rain) is 0.167 - this is determined by the combined effects of all individual transition probabilities. The stationary probabilities for two- or three consecutive rainy days are 4.2% and 2.1%, respectively.
Slide 016
![](/abc/images/c/cb/L04_s016.jpg)
Lecture 04, Slide 016
Hidden Markov Model: on Wikipedia.
Hidden Markov Model: on Wikipedia.
Slide 017
Slide 018
Slide 019
Slide 020
Slide 021
Slide 022
Slide 023
![](/abc/images/a/a4/L04_s023.jpg)
Lecture 04, Slide 023
Signal peptide example for recognition of sequence features with HMMs or NNs: common features in gram-negative signal-peptide sequences are shown in a Sequence Logo. Sequences were aligned on the signal-peptidase cleavage site. Their common features include a positively charged N-terminus, a hydrophobic helical stretch and a small residue that precedes the actual cleavage site.
Signal peptide example for recognition of sequence features with HMMs or NNs: common features in gram-negative signal-peptide sequences are shown in a Sequence Logo. Sequences were aligned on the signal-peptidase cleavage site. Their common features include a positively charged N-terminus, a hydrophobic helical stretch and a small residue that precedes the actual cleavage site.
Slide 024
![](/abc/images/1/1a/L04_s024.jpg)
Lecture 04, Slide 024
SignalP is the premier Web server to detect signal sequences.
SignalP is the premier Web server to detect signal sequences.
Slide 025
Slide 026
Slide 027
deleted
Slide 028
Slide 029
Slide 030
Slide 031
Slide 032
![](/abc/images/6/6f/L04_s032.jpg)
Lecture 04, Slide 032
You should be familiar with these most fundamental descriptors, they come up time- and time again in the literature. Here is a series of highly readable reviews on topics of medical statistics by Jonathan Ball and Coauthors:
*(1)Presenting and summarising data
*(2) Samples and populations
*(3) Hypothesis testing and P values
*(4) Sample size calculations
*(5) Comparison of means
*(6) Nonparametric methods
*(7) Correlation and regression
*(8) Qualitative data - tests of association
*(9) One-way analysis of variance
*(10) Further nonparametric methods
*(11) Assessing risk
*(12) Survival analysis
*(13) Receiver operating characteristic curves
*(14) Logistic regression
You should be familiar with these most fundamental descriptors, they come up time- and time again in the literature. Here is a series of highly readable reviews on topics of medical statistics by Jonathan Ball and Coauthors:
*(1)Presenting and summarising data
*(2) Samples and populations
*(3) Hypothesis testing and P values
*(4) Sample size calculations
*(5) Comparison of means
*(6) Nonparametric methods
*(7) Correlation and regression
*(8) Qualitative data - tests of association
*(9) One-way analysis of variance
*(10) Further nonparametric methods
*(11) Assessing risk
*(12) Survival analysis
*(13) Receiver operating characteristic curves
*(14) Logistic regression
Slide 033
Slide 034
![](/abc/images/b/b5/L04_s034.jpg)
Lecture 04, Slide 034
Statistical model: on Wikipedia.
Statistical model: on Wikipedia.
Slide 035
Slide 036
Slide 037
Slide 038
Slide 039
Slide 040
![](/abc/images/b/bd/L04_s040.jpg)
Lecture 04, Slide 040
Still not convinced? Try the simulation here.
Still not convinced? Try the simulation here.
Slide 041
Slide 042
Slide 043
Slide 044
Slide 045
Slide 046
Slide 047
Slide 048
Slide 049
Slide 050
Slide 051
Slide 052
Slide 053
Slide 054
![](/abc/images/7/76/L04_s054.jpg)
Lecture 04, Slide 054
Multiple testing: on Wikipedia
Multiple testing: on Wikipedia
Slide 055
Slide 056
Slide 057
Slide 058
Slide 059
Slide 060
Slide 061
Slide 062
Slide 063
Slide 064
Slide 065
![](/abc/images/e/e2/L04_s065.jpg)
Lecture 04, Slide 065
We can describe a set of observations as a distribution, and we can express this distribution as a vector if we define each element of the vector to represent a particular amino acid. This gives us a convenient and intuitive way to define a metric to compare two distributions - by considering the difference between all components of the two distributions. If we interpret this geometrically, the distribution of n-elements corresponds to a point in an n-dimensional spaceand the difference we are using here is the distance between the two points defined by the two distributions. We could use different metrics, but this one (the vector norm) is intuitive and convenient. The comparison between the frequency distribution of all amino acids in the sequence database (fexp, the expected distribution for a random sample of amino acids )
We can describe a set of observations as a distribution, and we can express this distribution as a vector if we define each element of the vector to represent a particular amino acid. This gives us a convenient and intuitive way to define a metric to compare two distributions - by considering the difference between all components of the two distributions. If we interpret this geometrically, the distribution of n-elements corresponds to a point in an n-dimensional spaceand the difference we are using here is the distance between the two points defined by the two distributions. We could use different metrics, but this one (the vector norm) is intuitive and convenient. The comparison between the frequency distribution of all amino acids in the sequence database (fexp, the expected distribution for a random sample of amino acids )
Slide 066
![](/abc/images/4/4f/L04_s066.jpg)
Lecture 04, Slide 066
We can apply the same metric to a set of the same number of simulated amino acids, in which the probability of picking an amino acid is given by its expectation value, fexp. If we do this many times, we will obtain a distribution of d values that tells us how different the relative frequencies of amino acids are, when they are generated by our simulator, relative to what we see in the database. Note that under many simulations we still gat an error every time, simply because the number of amino acids in every single run is small (20, in our example) and thus do what we want, the sample can never exactly reproduce the database distribution. This is important to understand: we are not simulating the distribution, we are simulating the influence of a limited-size sample!
We can apply the same metric to a set of the same number of simulated amino acids, in which the probability of picking an amino acid is given by its expectation value, fexp. If we do this many times, we will obtain a distribution of d values that tells us how different the relative frequencies of amino acids are, when they are generated by our simulator, relative to what we see in the database. Note that under many simulations we still gat an error every time, simply because the number of amino acids in every single run is small (20, in our example) and thus do what we want, the sample can never exactly reproduce the database distribution. This is important to understand: we are not simulating the distribution, we are simulating the influence of a limited-size sample!
Slide 067
![](/abc/images/1/13/L04_s067.jpg)
Lecture 04, Slide 067
Once we have simulated the experiment many times, we can compare the observed outcome with the one that would be expected if the amino acids had been randomly picked from a database distribution. In our example, the result deviates considerably from what we would expect, but not as much so that it meet a significance level of 95%.
Once we have simulated the experiment many times, we can compare the observed outcome with the one that would be expected if the amino acids had been randomly picked from a database distribution. In our example, the result deviates considerably from what we would expect, but not as much so that it meet a significance level of 95%.
Slide 068
Slide 069
![](/abc/images/3/33/L04_s069.jpg)
Lecture 04, Slide 069
If we want to simulate events according to a particular probability distribution, we can use the procedure given above. The procedure is not very efficient, since many values will be discarded if the interval is large. For each particular distribution there will be more efficient, specialized ways to generate it. However this procedure is completely general and it is trivial to change the target probability distribution's parameters; all you need is the definition of the distribution.
If we want to simulate events according to a particular probability distribution, we can use the procedure given above. The procedure is not very efficient, since many values will be discarded if the interval is large. For each particular distribution there will be more efficient, specialized ways to generate it. However this procedure is completely general and it is trivial to change the target probability distribution's parameters; all you need is the definition of the distribution.
Slide 070
Slide 071
Slide 072
Slide 073
Slide 074
Slide 075