Difference between revisions of "BIO Assignment 4 2011"

Revision as of 05:16, 10 November 2008

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet - looking at orthologues - this is not always a clear one-to-one mapping of related genes to each other. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, this may be warranted. But what if that gene has duplicated in one of them, and the two paralogues now perform different, related functions in one organism? In order to be able to even ask such questions, we need to understand how we can make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And how did the species benefit from this event?

We will develop this kind of analysis in this assignment. In the previous assignment you have established which genes are the reciprocally most closely related orthologues to Mbp1 and to other yeast APSES domain genes. In this assignment, we will analyse their evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.

A number of good tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package and the (commercial) PAUP package. Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.

But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do in real life):

One: we will use an efficient tree-building algorithm, not the best-available one. This is an algorithm which is available through an online Webserver, without the need for you to install software on your own machine. In real life you would of course use the most accurate algortihm you can get, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript. (However, the simpler algorithm we use here apears to give results that appear quite plausible for the situation we are studying.)

Two: we will assume the tree the algorithm constructs is correct. In real life you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. However, we should acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the reults critically.

In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis (pdf) here and to the resource section at the bottom of this page.

Preparation, submission and due date

Read carefully.: Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Friday, December 7 at 15:00 in the afternoon.

(1) Preparations

(1.1) Preparing Input Files (2 marks)

Introduction: Task

For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse and discuss the tree.

The phylogenetic tree we will construct will contain all APSES domains we have found. In order to interpret such a tree it is crucial to have some sense of what these domains are, i.e. to cluster them according to their orthologues. Only then can we analyse the tree by asking which subclades mirror the accepted phylogeny of fungi and which ones differ. In the third assignment, you have defined the true orthologues for most of the domains we had previously found with our PSI-BLAST search. (I have filled in the rest.) From this information, I have revised the gene names in the MUSCLE alignment of all APSES domains. When we calculate a phylogenetic tree with these sequences, we should expect orthologues to cluster into the same subclade. Of course, not all fungi have the same number of APSES domain homologues, but from the data we have compiled it should be possible to define their evolutionary history with reference to the other species.

Introduction: Principle

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyse evolutionary relationships, given the alignment. Their inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.

The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.

Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
this score is stored in a "distance matrix" ...
... and used to estimate a tree that goups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.

Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.

ML, or Maximum Lieklihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also VERY compute intensive and a tree of the size that we are building in this assignment is already almost beyond the resources of common workstations (runs about a day on my computer). However, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable. They also suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spurious shared differences.

Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a totally different evolutionary model as all others, such as domain fusion, or large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.

Introduction: Problems

Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. Hoever in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the samescore. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment to one or two character, or to remove them.

Introduction: Practice

In practice, follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:

Remove all stretches of residues in which the alignment appears ambiguous (not just highly varible, but ambiguous regarding the aligned positions).
Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
Remove all but approximately one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all, some researchers remove all gaps).
Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.

(A very useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text: color or highlight characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier or "Courier New". )

The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.

Some notes on how to avoid common editing troubles. Copy the sequences from the pages linked from the Resources section below. Paste them into a document, using the Word "Edit → Paste special → Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. Take special note that your files must not include tab characters! (Tabs are counted as one single character by the phylogeny programs.) You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier (or "Courier New"), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all columns in your alignments align as they should. As always, make sure you save your input files as "Text Only".

A note if you are working on a Mac and saving input on disk, to run with a locally installe PHYLIP version: here MS Word will play one of its usual shenanigans on you since it writes text files with the old-style OS 9 Carriage Return characters (\r; ASCII 13; hex 0D; CR). Just by looking at the file, this is quite invisible but such "Carriage returns" are not going to be recognized by PHYLIP and most other UNIX based programs. It may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though all the input would be passed in one single, very long line). And this can (and did) lead to head-banging rounds of frustration. You need to replace them with Linefeed resp. Newline characters (\n; ASCII 10; hex 0A; LF) and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:

tr "\r" "\n" < infile > outfile

... where outfile is different from infile (careful: if a file by the name of outfile already exists, tr will cheerfully overwrite it.) Alternatively you could type the following perl one-line program :

perl -e 'while(<>){tr/\r/\n/;print}' < infile > outfile

In your assignment submission, clearly identify the source alignment you are using, and define the process how it was created and how the gene names were defined. Paste your unaltered source alignment into your document, clearly highlight or otherwise color the columns that you have selected, annotate why you have selected them and paste your resulting input file as well as well. Here is an example of what this might look like:

(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide.

Introduction: Web Service and data

You have two choices for completing the assignment: either to use one of the PHYLIP on-line servers that generously provide public computing resources, or to download and install the PHYLIP program package on your own computer at home. If you choose the former, one of your options is the PHYLIP service at the Institut Pasteur in France.

I have tried the Pasteur service many times, and it works - however not always entirely without problems. Uninformative errors may occur when your input is too large for the system's memory (like: "sequences not aligned" ... "out of memories" and such) and once, after submitting a number of jobs, the system locked me out to wait until results would be received by e-mail (which then hasn't happened). Regrettably, this is not documented. However the integration of their services in a logical sequence of steps is very convenient and some of their services use algorithms that improve on PHYLIP. If you rather decide to install PHYLIP, good for you. That is easy to do, well documented, there are much less limitations on memory - but if you don't read and understand the instructions carefully, you may be in for a spell of frustration.

Either way, I have posted typical input files and result files on the fallback data page, to allow you to bail out in case technical problems become overwhelming. If you use the data posted here instead of your own, you must document that fact and explain what you have tried, and why that has failed. The posted data is a fallback, not a shortcut.

For this assignment, we will use a simple distance based tree construction method, specifically the UPGMA variant of the neighbor joining algorithm. This represents a reasonable compromise between accuracy and speed, especially when applied to moderately dissimilar sequences. In general, distance methods include two steps: (1) calculate a pairwise-distance matrix between sequences, (2) construct a tree, based on the matrix. Thus all the information in the alignment bewtween two pairs of sequences is collapsed into a single number: their pairwise distance. Alternative approaches, parsimony as well as ML based algorithms, take individual columns into account.

Prepare an input file that is representative of the APSES domains.

Access the revised MSA for all APSES domains, linked here (and from the resources section at the bottom of the page). Prepare a PHYLIP formatted input file from this MSA, restricting the number of sequence characters to no more than 70. Read the PHYLIP format documentation and follow the considerations dicussed above. (See the fallback data in case you get stuck, but you must prepare (and document) an input file according to the instructions, even if you end up using the fallback data for whatever reason.) Do not forget to document how you have prepared your input file: define where your source-sequences came from, define which columns you have deleted by highlighting the deleted residues in one sequence, and include your input file in the assignment. (2 marks)

(1.2) Calculating a Tree (2 marks)

Using the protdist program of PHYLIP, calculate a distance matrix for the input file you have prepared. (See the fallback data in case you get stuck) (1 mark)

If you use the PHYLIP Webserver, select the neighbor joining algorithm from the menu options (neighbor on the PHYLIP server) and click the button "run the selected program on outfile" ; on the next form, click the button to the "advanced neighbor form", choose the option "UPGMA" and click on the button "run neighbor". When the program is done, select the option drawgram and click Run the selected program on outtree. Choose a cladogram tree-style and a suitable output format (e.g. postscript). Paste the trees into your assignment.

If you use a locally installed version of PHYLIP use neighbor with the UPGMA method to construct a tree for the input file. Open the file outfile in a text-editor, copy and paste the trees into your assignment.

In both cases, the process is: protdist → neighbor → drawgram

(1 mark for constructing and displaying the tree).

(2) Analysis

I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.

Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows Tehler et al. (2003) Mycol Res. 107:901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity. I have labeled all speciation events so you can refer to these labels in your assignment.

In order to study the evolutionary history of the entire gene family you can use the tree you have computed or access the APSES domains reference tree here.

This is a complicated tree, and it can look impenetrably confusing at first. Here are two principles that will help you make sense of the tree.

A: A gene that is present in an ancestral species, is inherited in all descendent species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event). This means, if a gene is present in two widely divergent species, but in none other of the descendants of the LCA, it is possible that there is some problem with the tree (long branch attraction maybe), or the sequence has been acquired through horizontal gene transfer.

B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the genes, in all descendants; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the OTUs, up to the branchpoint of their LCA.

With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry the reference tree of all APSES domains apart quite nicely. A few colored pencils and a printout of the tree will help.

(2.1) The Cenancestor's APSES Domains (2 marks)

Refer to your tree or the reference tree for the following two tasks. Be specific, to support your arguments, i.e. use specific branchpoints (by numbers or letters) and OTU or gene names in your arguments (see the example below).

Discuss briefly how many APSES domain proteins the fungal cenancestor appears to have posessed and what evidence you see in the tre that this is so. (2 marks)

(2.2) Unraveling your organism's APSES domains (4 marks)

Assume that the phylogenetic tree for fungi is correct, and that the mixed gene tree is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the species tree and either your constructed tree or the reference tree. (4 marks)

For example the following discusion for Saccharomyces cerevisiae would be sufficient for full marks:

(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). There are four subclades that are shared by most current species, they branch from 129, 108, 76 and (94 + 102). For the latter case, the precise resolution appears not be well resolved, but by comparison with the species tree, we can argue that branch 102 corresponds to branch (H) and should be inserted between branchpoints 94 (corresponding to (A) ) and 96 (B) , not after branch 74. This is because the species under 95 and 102 share a common ancestor (B) that is distinct from 95. Saccharomyces cerevisiae has one gene in each of these major subclades, there is no gene loss. (Note however that there is no Dikaryomycota (2) orthologue of a Sok2 gene.) Saccharomyces cerevisiae has an additional paralogue to Sok2 that created the Phd1 gene. This is shared with Candida albicans. There are three possibilites to explain this: (i) the gene could have been duplicated before (H) and then lost in separate, independent events after I,J,K,M and N in those species that do not possess an orthologue. (ii) the gene could have arisen after (N) or after (K) and then passed by horizontal gene transfer from or to S. cerevisiae, or (iii) the annotations of orthologues could be incorrect and some of the genes labelled SokA (Sok2 paralogues) could in fact be Phd1 orthologues; if this were the case it would require a reassessment of how much gene-loss would be necessary to explain the subclade below 108.

(3) Summary of Resources

Links

APSES domain alignment

All APSES domains - MUSCLE aligned and sequence names revised

Tree

APSES domains reference tree

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List

Difference between revisions of "BIO Assignment 4 2011"

Revision as of 05:16, 10 November 2008

Contents

(1) Preparations

(1.1) Preparing Input Files (2 marks)

Introduction: Task

Introduction: Principle

Introduction: Problems

Introduction: Practice

Introduction: Web Service and data

(1.2) Calculating a Tree (2 marks)

(2) Analysis

(2.1) The Cenancestor's APSES Domains (2 marks)

(2.2) Unraveling your organism's APSES domains (4 marks)

(3) Summary of Resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 1: / Line 1: @@
 <!-- {{Template:Active}} -->
 {{Template:Inactive}}
@@ Line 8: / Line 9: @@
 <div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
-Assignment 4 - Homology modeling
+Assignment 5 - Phylogenetic Analysis
 </div>
-<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
-;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
+Introduction
-::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
+&nbsp;
+;Nothing in Biology makes sense except in the light of evolution.
+:''Theodosius Dobzhansky''
 </div>
-&nbsp;
-&nbsp;
-Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html Vendian period] of the Proterozoic era of Precambrian times.
+... but does evolution make sense in the light of biology?
+As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet - looking at orthologues - this is not always a clear one-to-one mapping of related genes to each other. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of ''function'' - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, this may be warranted. But what if that gene has duplicated in one of them, and the two paralogues now perform different, related functions in one organism? In order to be able to even ask such questions, we need to understand how we can make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And how did the species benefit from this event?
+We will develop this kind of analysis in this assignment. In the previous assignment you have established which genes are the reciprocally most closely related orthologues to Mbp1 and to other yeast APSES domain genes. In this assignment, we will analyse their evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
+A number of good tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) [http://evolution.genetics.washington.edu/phylip.html PHYLIP] package and the (commercial) PAUP package. ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.
+But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do ''in real life''):
-In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
+'''One''': we will use an '''efficient''' tree-building algorithm, not the best-available one. This is an algorithm which is available through an online Webserver, without the need for you to install software on your own machine. In ''real life'' you would of course use the most accurate algortihm you can get, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript. <small>(However, the simpler algorithm we use here apears to give results that appear quite plausible for the situation we are studying.)</small>
-''In this assignment you will (1) construct a molecular model of the Mbp1 orthologue in your assigned organism, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and(4) discuss whether the available evidence allows you to distinguish between different modes of ligand binding, ''
+'''Two''': we will assume the tree the algorithm constructs is ''correct''. In ''real life'' you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. However, we should acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the reults critically.
-For the following, please remember the following terminology:
+In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked an excellent and very understandable introduction-level [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf article on phylogenetic analysis (pdf)] here and to the resource section at the bottom of this page.
-;Target
-:The protein that you are planning to model.
-;Template
-:The protein whose structure you are using as a guide to build the model.
-;Model
-:The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
 &nbsp;
-A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.
 {{Template:Preparation|
-care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you are trying to guess, rather than confirm possibly important information.|
+care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.|
-num=4|
+num=5|
-ord=fourth|
+ord=fifth|
-due = Monday, November 12 at 10:00 in the morning}}
+due = Friday, December 7 at 15:00 in the afternoon}}
+&nbsp;
+&nbsp;
 <div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
-==(1) Preparation==
+==(1) Preparations==
 </div>
+&nbsp;
+&nbsp;
 <div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-===(1.1) Template choice and sequence (1 mark)===
+===(1.1) Preparing Input Files (2 marks)===
 </div>
 &nbsp;<br>
-Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lecture and there is a short summary of [[Template_choice_principles|template choice principles]] on this Wiki. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But one can always also use the BLAST interface at the NCBI, since the sequences contained in PDB files are accessible as a database subsection on the BLAST menu.
-<div style="padding: 5px; background: #DDDDEE;">
+=====Introduction: Task=====
-*Use the NCBI BLAST interface to identify all PDB files that are clearly homologous to your target APSES domain, if you haven't already done so in Assignment 2. Document that you have searched in the correct subsection of the database by selecting "pdb" on the database options menu. For the hits you find, consider how these coordinate sets differ and which features would make each more or less suitable for your task by commenting briefly on
+For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse and discuss the tree.
-:*sequence similarity to your target
-:*size of expected model (length of alignment)
+The phylogenetic tree we will construct will contain all APSES domains we have found. In order to '''interpret''' such a tree it is crucial to have some sense of what these domains are, i.e. to cluster them according to their orthologues. Only then can we analyse the tree by asking which subclades mirror the accepted phylogeny of fungi and which ones differ. In the third assignment, you have defined the true orthologues for most of the domains we had previously found with our PSI-BLAST search. (I have filled in the rest.) From this information, I have revised the gene names in the [[APSES_domains_MUSCLE_revised|'''MUSCLE alignment of all APSES domains''']]. When we calculate a phylogenetic tree with these sequences, we should expect orthologues to cluster into the same subclade. Of course, not all fungi have the same number of APSES domain homologues, but from the data we have compiled it should be possible to define their evolutionary history with reference to the other species.
-:*presence or absence of ligands
-:*experimental method and quality of the data set
-Then choose the '''template''' you consider the most suitable and note why you have decided to use this template.
-* Retrieve the most suitable template structure coordinate file from the PDB.
+=====Introduction: Principle=====
+In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyse evolutionary relationships, given the alignment. Their inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.
-(0.5 marks)
+The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
-</div>
-It is not straightforward at all how to number sequence in such a project. The "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain as the CDD defines it is not Residue 1 of the Mbp1 protein. The first residue of the e.g. 1MB1 FASTA file '''is''' the first residue of Mbp1 protein, but the last five residues are an artifical His tag. Is H125 of 1MB1 thus equivalent to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, therefore N is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records; whereas the SEQRES records start with MET ... and so on. You need to remember: a sequence number is not absolute, but derived from a particular context.
+'''Distance based''' phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
+* they apply a model of evolution such as a mutation data matrix, to calculate a score for each '''pair''' of sequences,
+* this score is stored in a "distance matrix" ...
+* ... and used to estimate a tree that goups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).
+They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
-The homology model will be based on an alignment of target and template. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit  and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for [http://swift.cmbi.ru.nl/servers/html/index.html '''WhatIf'''], a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.
+'''Parsimony based''' phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
-<div style="padding: 5px; background: #DDDDEE;">
+'''ML''', or '''Maximum Lieklihood''' methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also VERY compute intensive and a tree of the size that we are building in this assignment is already almost beyond the resources of common workstations (runs about a day on my computer). However, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable. They also suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spurious shared differences.
-*Navigate to the '''Administration''' sub-menu of the [http://swift.cmbi.ru.nl/servers/html/index.html WhatIf Web server]. Follow the link to '''Make sequence file from PDB file'''. Enter the PDB-ID of your template into the form field and '''Send''' the request to the server. The server accesses the PDB file and extracts sequence information directly from the <code>ATOM&nbsp;&nbsp;</code> records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this '''implied''' sequence to check if and how it differs from the sequence ...
-:*... listed in the <code>SEQRES</code> records of the coordinate file;
+Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a totally different evolutionary model as all others, such as domain fusion, or large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a ''most characteristic subset'' of amino acids. The goal is not to be as comprehensive as possible but to input those columns of aligned residues that will best represent the ''true'' phylogenetic relationships between the sequences.
-:*... given in the FASTA sequence for the template, which is provided by the PDB;
-:*... stored in the protein database of the NCBI.
-: and record your results.
-* In a table, establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.
+=====Introduction: Problems=====
+Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered ''sequence'', but for a ''sum of independent values'', one for each aligned columns of characters. The order of the columns does not change the score. Hoever in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the samescore. For short indels, this '''underestimates''' the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this '''overestimates''' the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment to one or two character, or to remove them.
-(0.5 marks)
+=====Introduction: Practice=====
-</div>
+In practice, follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
-:(*) <small>These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the ''Sequence Viewer'' extension of VMD.</small>.
+:*Remove all stretches of residues in which the ''alignment'' appears ambiguous (not just highly varible, but ambiguous regarding the aligned positions).
-:<small>Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence.</small>.
+:*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
-&nbsp;
+:*Remove all but approximately one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all, some researchers remove all gaps).
-&nbsp;
+:*Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
+:*If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+:<small>(A '''very''' useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text: color or highlight characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier or "Courier New". )</small>
-===(1.2) The input alignment  (1 mark)===
+The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.
-</div>
-&nbsp;<br>
-The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these only because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
+Some notes on how to avoid common editing troubles. Copy the sequences from the pages linked from the ''Resources'' section below. Paste them into a document, using the Word "Edit &rarr; Paste special &rarr; Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. Take special note that your files must not include tab characters! (Tabs are counted as one single character by the phylogeny programs.) You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier (or "Courier New"), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all columns in your alignments align as they should. As always, make sure you save your input files as "Text Only".
-The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+<small>
+:A note if you are  working on a '''Mac''' and saving input on disk, to run with a locally installe PHYLIP version: here MS Word will play one of its usual [http://en.wikipedia.org/wiki/Shenanigan shenanigans] on you since it writes text files with the old-style OS 9 Carriage Return characters <code>(\r; ASCII 13; hex 0D; CR)</code>. Just by looking at the file, this is quite invisible but such "Carriage returns" are not going to be recognized by PHYLIP and most other UNIX based programs. It may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though all the input would be passed in one single, very long line). And this can (and did) lead to head-banging rounds of frustration. You need to replace them with '''Linefeed''' resp. '''Newline''' characters <code>(\n; ASCII 10; hex 0A; LF)</code> and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:
-In the case of Mbp1 genes however, all orthologues we have considered have no indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species.
+:'''tr "\r" "\n" &lt; infile     &gt; outfile'''
-Accordingly, all we need to do is to write the APSES domain sequences one under the other.
+:... where outfile is different from infile (careful: if a file by the name of outfile already exists, '''tr''' will cheerfully overwrite it.) Alternatively you could type the following perl one-line program :
-<div style="padding: 5px; background: #DDDDEE;">
+:'''perl -e 'while(&lt;&gt;){tr/\r/\n/;print}'  &lt; infile     &gt; outfile'''
-* Copy the FASTA formatted sequence for the APSES domain of your organism's Mbp1 orthologue from the sequences [[All_APSES_domains|defined in Assignment 3]] and save it as FASTA formatted text file. This is your '''target''' sequence. Compare this with the FASTA formatted file you have extracted from the PDB coordinate set. This is your '''template''' sequence. Then generate a multi-FASTA formatted file that contains both sequences, and '''pad''' the sequence(s) where required with hyphens as gap characters, so that target and template sequences have exactly the same length and are aligned.  Refer to the [[Assignment_4_fallback_data|'''Fallback data''']] if you are not sure about the format.
+</small>
-(1 mark)
-</div>
-&nbsp;<br>
-&nbsp;<br>
-<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+In your assignment submission, clearly identify the source alignment you are using, and define the process how it was created and how the gene names were defined. Paste your unaltered source alignment into your document, clearly highlight or otherwise color the columns that you have selected, annotate why you have selected them and paste your resulting input file as well as well. Here is an example of what this might look like:
-==(2) Homology model==
-</div>
-&nbsp;
-&nbsp;
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
-=== (2.1) SwissModel (1 mark)===
-</div>
-&nbsp;<br>
-Access the Swissmodel server at '''http://swissmodel.expasy.org''' . Navigate to the '''Alignment Interface'''.
+=====Introduction: Web Service and data=====
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+You have two choices for completing the assignment: either to use one of the [http://evolution.gs.washington.edu/phylip/phylipweb.html PHYLIP on-line servers] that generously provide public computing resources, or to download and install the [http://evolution.genetics.washington.edu/phylip.html PHYLIP program package] on your own computer at home. If you choose the former, one of your options is the [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' service at the '''Institut Pasteur'''] in France.
-*Paste your alignment for target and model into the form field. Refer to the [[Assignment_4_fallback_data|'''Fallback Data file''']] if you are not sure about the format. Make sure to select the correct option for the alignment input format on the form.
-:<small>(You have to choose the correct format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. In the past we have seen problems with uploading alignments that have not been saved as "text only" and including periods i.e.   "."  in sequence names of CLUSTAL formatted alignments. Underscores appear to be safe.</small>
-* Click '''submit alignment ''' and on the returned page define your '''target''' and '''template''' sequence. For the '''template sequence''' define the PDB ID of the coordinate file. Enter the correct Chain-ID.
+<small>I have tried the Pasteur service many times, and it works - however not always entirely without problems. Uninformative errors may occur when your input is too large for the system's memory (like: "sequences not aligned" ... "out of memories" and such) and once, after submitting a number of jobs, the system locked me out to wait until results would be received by e-mail (which then hasn't happened). Regrettably, this is not documented. However the integration of their services in a logical sequence of steps is very convenient and some of their services use algorithms that improve on PHYLIP. If you rather decide to install PHYLIP, good for you. That is easy to do, well documented, there are much less limitations on memory - but if you don't read and understand the instructions carefully, you may be in for a spell of frustration.</small>
-:<small>Recently the PDB has undergone a "remediation" process in which archived coordinate files were altered by the database to conform to new format standards. One of the changes was to assign a chain identifier of "A" to all chains that did not previously have a chain identifier. SwissModel uses a derivative of coordinate sets from the PDB (a dataset they call ExPDB). Apparently the PDB proper and ExPDB have now gone out of synchrony; when I entered the (correct, according to PDB) chain designation "A" for 1MB1, SwissModel rejected the alignment with a nondescript error message. When I entered an underscore "_" instead, which would be the designation for a chain without explicit chain identifier, such as the pre-remidation versio of the coordinates, the alignment was accepted and processed. I have e-mailed SwissModel about the problem; they are in the process of correcting it and may or may not be done while you are working on your assignments. If your template chain has the chain identifier "A" and your alignment gets rejected, try entering entering an underscore instead.</small>
-:<small>'''Enter''' the correct chain ID into the form-field even if you think it already appears there, don't simply accept the preloaded default. There is a bug in SwissModel's parser code that may cause incorrect strings to be sent to the server from that field. I have e-mailed SwissModel about the problem which may or may not be corrected while you are working on your assignments.</small>
-*Click '''submit alignment''' and review the alignment on the returned page. Make sure it has been interpreted correctly by the server. The conserved residues have to be lined up and matching. Then click '''submit alignment''' again, to start the modeling process.
+Either way, I have posted typical input files and result files on the [[Assignment_5_fallback_data|fallback data page]], to allow you to bail out in case technical problems become overwhelming. If you use the data posted here instead of your own, you '''must''' document that fact and explain what you have tried, and why that has failed. The posted data is a fallback, not a shortcut.
-* The resulting page returns information about the resulting model. Save the '''model coordinates''' on your computer. Read the information on what is being returned by the server (click on the red questionmark icon). Paste the Anolea profile into your assignment.
+For this assignment, we will use a simple distance based tree construction method, specifically the UPGMA variant of the neighbor joining algorithm. This represents a reasonable compromise between accuracy and speed, especially when applied to moderately dissimilar sequences. In general, distance methods include '''two''' steps: (1) calculate a pairwise-distance matrix between sequences, (2) construct a tree, based on the matrix. Thus all the information in the alignment bewtween two pairs of sequences is collapsed into a single number: their pairwise distance. Alternative approaches, parsimony as well as ML based algorithms, take individual columns into account.
-:<small>Do not paste a screenshot of the result, but copy and paste the image from the Web-page! You do not need to submit the actual coordinate files with your assignment.</small>
-(1 mark)
-</div>
 &nbsp;<br>
-In case you do not wish to submit the modelling job yourself, or have insurmountable problems when using the SwissModel interface, you may access the result files from the  [[Assignment_4_fallback_data|'''Fallback Data file''']]. Document the problems and note this in your assignment.
+<div style="padding: 5px; background: #DDDDEE;">
+Prepare an input file that is representative of the APSES domains.
+*Access the [[APSES_domains_MUSCLE_revised|revised MSA for all APSES domains]], linked here (and from the resources section at the bottom of the page). Prepare a PHYLIP formatted input file from this MSA, restricting the number of sequence characters to no more than 70. Read the [http://evolution.genetics.washington.edu/phylip/doc/main.html#inputfiles PHYLIP format documentation] and follow the considerations dicussed above. ([[Assignment_5_fallback_data|See the fallback data in case you get stuck]], but you '''must''' prepare (and document) an input file according to the instructions, even if you end up using the fallback data for whatever reason.) Do not forget to document how you have prepared your input file: define where your source-sequences came from, define which columns you have deleted by highlighting the deleted residues in one sequence, and include your input file in the assignment. (2 marks)
-<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
-==(3) Model analysis==
 </div>
-&nbsp;
-&nbsp;
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+&nbsp;<br>
-=== (3.1) The PDB file (1 mark)===
-</div>
 &nbsp;<br>
-Open your '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the [[Assignment_5_fallback_data|'''Fallback Data file''']].)
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+===(1.2) Calculating a Tree (2 marks)===
-*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the '''model''' correspond to that?
-(1 mark)
 </div>
-<!-- discuss flagging of loops - setting of B-factor to 99.0 phps. ANOLEA vs. Gromos ... packing vs. energy? -->
+&nbsp;<br>
-&nbsp;
+&nbsp;<br>
-&nbsp;
+<div style="padding: 5px; background: #DDDDEE;">
+*Using the '''protdist''' program of PHYLIP, calculate a distance matrix for the input file you have prepared. ([[Assignment_5_fallback_data|See the fallback data in case you get stuck]]) (1 mark)
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+*If you use the PHYLIP Webserver,  select the neighbor joining algorithm from the menu options ('''neighbor''' on the PHYLIP server) and click the button "run the selected program on outfile" ; on the next form, click the button to the "advanced neighbor form", choose the option "UPGMA" and click on the button "run neighbor". When the program is done, select the option '''drawgram''' and click '''Run the selected program on outtree'''. Choose a '''cladogram''' tree-style and a suitable output format (e.g. postscript). Paste the trees into your assignment.
-===(3.2) First visualization (1 mark)===
-</div>
-&nbsp;<br>
-In assignment 2 you have already studied a Mbp1 structure and compared it with your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
+*If you use a locally installed version of PHYLIP use '''neighbor''' with the UPGMA method to construct a tree for the input file. Open the file '''outfile''' in a text-editor, copy and paste the trees into your assignment.
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+In both cases, the process is: <code>protdist</code> &rarr; <code>neighbor</code> &rarr; <code>drawgram</code>
-*Save your '''model''' coordinates to your harddisk and visualize the structure in VMD. (Alternatively, copy and save the coordinates linked to the  [[Assignment_4_fallback_data|'''Fallback Data file''']] to your harddisk.) Make an informative stereo view that shows the general orientation of the helix-turn-helix motif and the "wing", and paste it into your assignment.
-* Discuss briefly which parts of the model may be unreliable and color these (if any) distinctly in your submitted image.
+:(1 mark for constructing and displaying the tree).
-(1 mark)
 </div>
 &nbsp;<br>
 &nbsp;<br>
@@ Line 187: / Line 161: @@
 <div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
-==(4) The DNA ligand==
+==(2) Analysis==
 </div>
-&nbsp;
-&nbsp;
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
-===(4.1) Finding a similar protein-DNA complex (1 mark)===
+[[Image:FungiCladogram.jpg|frame|none|Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity. I have labeled all speciation events so you can refer to these labels in your assignment.]]
-</div>
-&nbsp;<br>
-One of the really interesting questions we can discuss with reference to our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for how DNA is bound to APSES domains.
+In order to study the evolutionary history of the entire gene family you can use the tree you have computed or access the [[APSES_domains_reference_tree|'''APSES domains reference tree''']] here.
-Since there is currently no software available that would accurately model such a complex from first principles, we will base a model of  a bound complex on homology modeling as well. This means we need to find a similar structure for which the position of bound DNA is known, then superimpose that structure with our model. This places the DNA molecule into the spatial context of the model we are studying. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of an APSES domain-DNA complex. How can we find a coordinate set of a strcturally similar protein-DNA complex?
+This is a complicated tree, and it can look impenetrably confusing at first. Here are two principles that will help you make sense of the tree.
-Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures. Just like with sequence searches, we might not want to search with the entire protein, if we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless.
+A: '''A gene that is present in an ancestral species, is inherited in all descendent species.''' The gene has to be observed in all OTUs, unless its has been lost (which is a rare event). This means, if a gene is present in two widely divergent species, but in none other of the descendants of the LCA, it is possible that there is some problem with the tree (long branch attraction maybe), or the sequence has been acquired through horizontal gene transfer.
-At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is provided as a search tool for structural similarity search.
+B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the genes, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the OTUs, up to the branchpoint of their LCA.
-At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, [http://www.ebi.ac.uk/msd-srv/ssm/ '''MSDfold'''] provides a convenient interface for structure searches.
+With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry the [[APSES_domains_reference_tree|reference tree of all APSES domains]] apart quite nicely. A few colored pencils and a printout of the tree will help.
-However we have also read previously that the APSES domains are members of a much larger superfamily, the "winged helix" DNA binding domains , of which hundreds of structures have been solved.
-&nbsp;<br>
+&nbsp;
+&nbsp;
-[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76) and the "wing" is clearly seen as the green pair of beta-strands, extending to the right of the helix-turn-helix motif.]]
-&nbsp;<br>
+<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-APSES domains represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of a beta-strand binding into the minor groove. This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can compare our model to a structure of  a protein-DNA complex. CATH does not provide information on complexes, but we can search the PDB with CATH codes in the following way:
+===(2.1) The Cenancestor's APSES Domains (2 marks)===
-* Access [http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH domain 1.10.10.10].
-* Navigate to the [http://www.pdb.org/ PDB home page] and follow the link to [http://www.pdb.org/pdb/search/advSearch.do Advanced Search]
-* In the options menu for "Choose a Query Type" select Structure Features &rarr; CATH classification. A window will open that allows you to navigate down through the CATH tree. The interface is awkward because it does not display the actual CATH codes along with the class names, but you can view the class names on the CATH page linked above. Click on '''the triangle icons''' before "Mainly Alpha"&rarr;"Orthogonal Bundle"&rarr;"ARC repressor mutant, subunit A" then click on the link to "winged helix repressor DNA binding domain". As of this writing, this subquery matches 295 structures.
-* Click on the (+) button behind the subquery to add an additional query. Select the option "Structure Summary"&rarr;"Molecule / Chain type". In the option menus that pop up, select "Contains Protein &rarr; Yes",  "Contains DNA &rarr; Yes""Contains RNA &rarr; Ignore". This selects files that contain Protein-DNA complexes.
-* Check the box below this subquery to "Remove Similar Sequences at 90% identity" and click on "Evaluate Query". As of this writing, seventy complexes were returned.
-* In the left-hand menu, under the Tabulate section, click on the "Collage" function to display icons of the structure files. This is a fast way to obtain an overview of the structures that have been returned. First of all you may notice that in fact not all of the structures are really different, despite selecting only to retrieve dissimilar sequences. This appears to be a deficiency of the algorithm. But you can also easily recognize how the recognition helix inserts into the major groove of most of the structures that were returned (at least those where the domain is not a very small part of a much larger complex). There is one exception: the structure 1DP7 shows how the human RFX1 protein binds DNA in a non-canonical way. We shall use structural superposition of your homology model and two of the winged-helix proteins to decide which mode of DNA binding seems to be more plausible for Mbp1 homologues.
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
-* Follow the procedure outlined above, from a CATH entry page up to viewing a Collage (or alternatively a tabular view) of the retrieved coordinate files. You can be maximally concise documenting the procedure I have defined above, but do spend a bit of time to understand the key elements of the PDB's advanced search interface.
-(1 mark)
 </div>
+Refer to your tree or the reference tree for the following two tasks. Be specific, to support your arguments, i.e. use specific branchpoints (by numbers or letters) and OTU or gene names in your arguments (see the example below).
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-===(4.2) Preparation and superposition of a canonical complex (1 mark)===
-</div>
 &nbsp;<br>
+&nbsp;<br>
+<div style="padding: 5px; background: #DDDDEE;">
-The structure we shall use as a reference for the canonical binding mode is the Elk-1 transcription factor.
+Discuss briefly how many APSES domain proteins the fungal cenancestor appears to have posessed and what evidence you see in the tre that this is so. (2 marks)
-[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
-The 1DUX coordinate-file contains two protein domains and two B-DNA dimers in one asymmetric unit. For simplicity, let's delete the second copy.
-* Access the PDB and navigate to the 1DUX structure explorer page. Download the coordinates to your computer.
-* Open the coordinate file in a text-editor and delete the coordinates for chains <code>D</code>,<code>E</code> and <code>F</code>; you may also delete all <code>HETATM</code> records and the <code>MASTER</code> record. Save the file with a different name, e.g. 1DUX_monomer.pdb .
-* Open VMD and load your homology model. Turn off the axes, display the model as a Tube representation in stereo, and color it by Index. Then load your edited 1DUX file, display this coordinate set in a tube representation as well, and color it by ColorID in some color you like. It is important that you can distinguish easily which structure is which
-* You could use the Extensions&rarr;Analysis&rarr;RMSD calculator interface to superimpose the two strutcures '''IF''' you would know which residues correspond to each other. Sometimes it is useful to do exactly that: define exact correspondences between residue pairs and superimpose according to these selected pairs. For our purpose it is much simpler to use the Multiseq tool (and the structures are simple and small enough that the STAMP algorithm for structural alignment can define corresponding residue pairs automatically). Open the '''multiseq''' extension window, select the check-boxes next to both protein structures, and open the '''Tools&rarr;Stamp Structural Alignment''' interface.
-* In the "'Stamp Alignment Options'" window, check the radio-button for ''Align the following ...'' '''Marked Structures''' and click on '''OK'''.
-* In the '''Graphical Representations''' window, double-click on all "NewCartoon" representations for both molecules, to undisplay them.
-* You should now see a superimposed tube model of your homology model and the 1DUX protein-DNA complex. You can explore it, display side-chains etc. and study some of the details of how a transcription factor recognizes and binds to its cognate DNA sequence. However, remember that the model's side-chain orientations have not been experimentally determined but inferred from the template, and that the template's strcture was determined in the absence of bound ligand.
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
-* Orient and scale your superimposed structures so that their structural similarity is apparent, and the recognition helix can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best.  Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.
-(1 mark)
 </div>
 &nbsp;<br>
 &nbsp;
 <div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
-===(4.2) Preparation and superposition of a non-canonical complex (1 mark)===
+===(2.2) Unraveling your organism's APSES domains (4 marks)===
 </div>
 &nbsp;<br>
-The structure displaying a non-canonical complex between a winged-helix domain and its cognate DNA binding site is the human Regulatory Factor X.
-[[Image:A5_non-canonical_wHTH.jpg|frame|none|Stereo-view of a non-canonical wHTH-DNA complex, discovered in with the stucture of human Regulatory Factor X (hRFX) binding its cognate X-box DNA sequence (1DP7). Note how the helix that coressponds to the recogition helix in the canonical domain lies across the minor groove whereas the beta-"wing" inserts into the major groove. The color gradient ramps from blue (18) to green (68).]]
-The 1DP7 coordinate-file contains only one protein domain and only one B-DNA monomer in its asymmetric unit. This is a file for which we have to generate ''biological unit'' coordinates! Then, for simplicity we will delete the second protein monomer. As you know, there are at least two systems that make the so-called biological units available: the PDB itself, through the Biological Unit file that is accessible via the "Download Files" section  on any Structure Explorer page, and the EBI through the PQS service. '''How''' the biological units are stored is subtly different for both cases and for our purpose this small difference is important. The PDB generates additional chins as copies of the original and delineates them with <code>MODEL</code>, <code>ENDMDL</code> records, just like in a multi-structure NMR file. The chain IDs and the atom numbers are the same as the original. The EBI's PQS service creates copies that have distinct atomnumbers and chain IDs. The difference is that the PDB file thus '''contains the same molecule in two different orientations''', wheras the PQS file contains '''two independent molecules'''. This is an important difference when it comes to selecting residues, visualizing and superimposing structures. For VMD, the PQS way of doing things is the right way to go, since by default only the first <code>MODEL</code> will be displayed if several are available.
-* Access the [http://pqs.ebi.ac.uk/ '''EBI PQS server'''], enter 1DP7 into the '''PDBidcode''' form field and click on '''Submit'''.
-* On the results page, click on the link under '''1dp7_0''', which is the unique suggestion for a biological unit that the server has identified.
-* On the PQS OUTPUT page that is retrieved, click on the '''1dp7.mmol''' link, this will load the PDB formatted coordinate file.
-* Save the coordinates as 1DP7_complex.pdb (or some other name that makes sense to you), open it in a text editor, delete the <code>HETATM</code> records from the end and the entire chain "B". Also make sure not to delete any of the <code>TER</code> records for chains "D", "P" or "A". Save the file.
-* In the multiseq window, choose File&rarr;Import Data, '''Browse...''' to your 1DP7_complex file, select it and click on '''Open'''. Click '''OK''' to load the file.
-* Mark all three protein chains by selecting the checkbox next to thier name and again run the STAMP structural alignment.
-* In the graphical representations window, double-click again on all cartoon representations that multiseq has generated to undisplay them, undisplay also the Tube representation of 1DUX, create a Tube representatrion for 1DP7, and select a Color by ColorID (a differnet color you like). The resulting scene should look similar to the one you have created above, only with 1DP7 in place of 1DUX and colored differently.
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
-* Orient and scale your superimposed structures so that their structural similarity is apparent, the orientation is similar to the scene generated above and the 1DP7 "wing" can be clearly seen inserting into the DNA major groove. Paste a copy of your image into your assignment. Remark briefly on which parts of the structure appear to superimpose best.  Note whether this orientation of a B-DNA double-helix is a plausible model for DNA binding of your Mbp1 orthologue.
-(1 mark)
-</div>
 &nbsp;<br>
+<div style="padding: 5px; background: #DDDDEE;">
-<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+Assume that the phylogenetic tree for fungi is correct, and that the mixed gene tree is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the species tree and either your constructed tree or the [[APSES_domains_reference_tree|reference tree]]. (4 marks)
-===(4.3) Interpretation (2 marks)===
 </div>
 &nbsp;<br>
+&nbsp;
-In your previous assignment, you have commented on conservation patterns in Mbp1 orthologues. You can refer back to your last results (easier to do), or you can import the APSES domain alignment for Mbp1 proteins and again color by conservation (easier to study) to briefly discuss the following question.
+For example the following discusion for ''Saccharomyces cerevisiae'' would be sufficient for full marks:
+:(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). There are four subclades that are shared by most current species, they branch from 129, 108, 76 and (94 + 102). For the latter case, the precise resolution appears not be well resolved, but by comparison with the species tree, we can argue that branch 102 corresponds to branch (H) and should be inserted between branchpoints 94 (corresponding to (A) ) and 96 (B) , not after branch 74. This is because the species under 95 and 102 share a common ancestor (B) that is distinct from 95.  ''Saccharomyces cerevisiae'' has one gene in each of these major subclades, there is no gene loss.  (Note however that there is no Dikaryomycota (2) orthologue of a Sok2 gene.) ''Saccharomyces cerevisiae'' has an additional paralogue to Sok2 that created the Phd1 gene. This is shared with ''Candida albicans''. There are three possibilites to explain this: (''i'') the gene could have been duplicated before (H) and then lost in separate, independent events after I,J,K,M and N in those species that do not possess an orthologue. (''ii'') the gene could have arisen after (N) or after (K) and then passed by horizontal gene transfer from or to  ''S. cerevisiae'', or ('''iii''') the annotations of orthologues could be incorrect and some of the genes labelled SokA (Sok2 paralogues) could in fact be Phd1 orthologues; if this were the case it would require a reassessment of how much gene-loss would be necessary to explain the subclade below 108.
-&nbsp;<br><div style="padding: 5px; background: #DDDDEE;">
+&nbsp;
-* Considering the conservation patterns for Mbp1 orthologues, and assuming that all these orthologues bind DNA in a similar way, which model appears to be more plausible for protein-DNA interactions in APSES domains? Is it the canonical, or the non-canonical binding mode? Discuss briefly what you would expect to find and how this relates to your observations. Distinguish clearly between experimental evidence, computational inference and empirical hypothesis. You are of course welcome to paste detail views (stereo !) of particular sidechains, or surfaces etc. if this helps your arguments. Sometimes a picture is worth many words. But this is not a requirement, we are more interested in evidence-based reasoning than in the form of the presentation.
+&nbsp;
-(2 marks)
-</div>
-&nbsp;<br>
-&nbsp;<br>
 <div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
-==(5) Summary of Resources==
+==(3) Summary of Resources==
 </div>
 &nbsp;<br>
-;Links and background reading
+;Links
+:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf '''Review (PDF, restricted)''' Sandra Baldauf: Phylogeny for the Faint of Heart]
-:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
-:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains]
-:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/2000_Gajiwala_WingedHelixDomains.pdf '''Review (PDF, restricted)''' Gajiwala &amp; Burley, winged-Helix domains]
 :* [[Organism_list_2007|Assigned Organisms]]
-:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
+:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
-:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
+:* [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' Web Service at the Institut Pasteur]
+:*[[Assignment_5_fallback_data|'''Fallback data''']]
-;[[Assignment_4_fallback_data|'''Fallback Data page''']]
+;APSES domain alignment
+:* [[APSES_domains_MUSCLE_revised|All '''APSES domains - MUSCLE aligned''' and sequence names revised]]
-;Alignments
+;Tree
-:* [[APSES_domains_MUSCLE|APSES domains MUSCLE aligned]]
+:*[[APSES_domains_reference_tree|'''APSES domains reference tree''']]
 &nbsp;
 &nbsp;
-{{Template:Assignment_Footer}}
+<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
+[End of assignment]
+</div>
+If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2007@googlegroups.com Course Mailing List]