BIO Assignment Week 7

From "A B C"
Jump to navigation Jump to search

Assignment for Week 8
Phylogenetic Analysis

< Assignment 7 Assignment 9 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 


 

Nothing in Biology makes sense except in the light of evolution.
Theodosius Dobzhansky

... but does evolution make sense in the light of biology?

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?

We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 (with reciprocal best match) and you have identified the full complement of APSES domain genes in your assigned organism (as a result of your PSI-BLAST search). In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of other fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history. I have prepared APSES domains from six diverse reference species, you will add YFO's APSES domain sequences and compute the phylogram for all genes. The goal is to identify orthologues and paralogues.

A number of excellent tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package, the MEGA package and the (commercial) PAUP* package. Of these, only MEGA is still under active development, although PHYLIP still functions perfectly (except for problems with graphical windows under Mac OS 10.6). Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.

However, we will take a shortcut in this assignment (something you should not do in real life). We will skip establishing the reliability of the tree with a bootstrap procedure, i.e. repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. (If you are interested, have a look here for the procedure for running a bootstrap analysis on the data set you are working with, but this may require a day or so of computing time on your computer.) In this assignment, we will simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.


If you would like to review concepts of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis here and to the resource section at the bottom of this page.

Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728)

PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources.

nbsp;



nbsp;


Preparing input alignments

In this section, we start from a collection of homologous APSES domains, construct a multiple sequence alignment, and edit the alignment to make it suitable for phylogenetic analysis.


Principles

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, after the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.


The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.


Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

  • they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
  • this score is stored in a "distance matrix" ...
  • ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.


Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.


ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.

ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.


Bayesian methods don't estimate the tree that gives the highest likelihood for the observed data, but find the most probably tree, given that the data have been observed. If this sounds conceptually similar to you, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a "prior" on trees before observation.


Choosing sequences

In principle, we have discussed strategies for using PSI-BLAST to collect suitable sequences earlier. To prepare the process, I have collected all APSES domains for six reference fungal species, together with the KilA-N domain of E. coli. The process is explained on the reference APSES domains page.


Renaming sequences

Renaming sequences so that their species is apparent is crucial for the interpretation of mixed gene trees. Refer to the reference APSES domains page to see how I have prepared the FASTA sequence headers.


Adding an outgroup

To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the reference APSES domains page. The procedure is explained in detail on that page.

>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF

E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.


Calculating alignments

Task:

  1. Navigate to the reference APSES domains page and copy the APSES/KilA-N domain sequences.
  2. Open Jalview, select File → Input Alignment → from Textbox and paste the sequences into the textbox.
  3. Add the APSES domain sequences from your species (YFO) that you have previously defined through PSI-BLAST. Don't worry that the sequences are longer, the MSA algorithm should be able to take care of that. However: do rename your sequences to follow the pattern for the other domains, i.e. edit the FASTA header line to begin with the five-letter abbreviated species code.
  4. When all the sequences are present, click on New Window.
  5. In Jalview, select Web Service → Alignment → MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
  6. Choose any colour scheme and add Colour → by Conservation. Adjust the slider left or right to see which columns are highly conserved.
  7. Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.

Editing sequences

As discussed in the lecture, we should edit our alignments to make them suitable for phylogeny calculations. Here are the principles:

Follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:

  • Remove all stretches of residues in which the alignment appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
  • Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains. You want to only retain the APSES domains. All the extra residues from the YFO sequence can be deleted.
  • Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
  • Remove all but approximately one column from gapped regions in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact. (Some researchers simply remove all gapped regions).
  • Remove sections N- and C- terminal of gaps where the alignment appears questionable.
  • If the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input. If you do run out of memory try removing columns of sequence. Or remove species that you are less interested in from the alignment.
  • Move your outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.

Handling indels

Gaps are a real problem, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.


(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide.


Task:
Prepare a PHYLIP input file from the sequences you have prepared following the principles above. The simplest way to achieve this appears to be:

    1. Copy the sequences you want into a textfile. Make sure the "reference sequences", are included, the outgroup and the sequences from YFO.
    2. In a browser, navigate to the Readseq sequence conversion service.
    3. Paste your sequences into the form and choose Phylip as the output format. Click on submit.
    4. Save the resulting page as a text file. Give it some useful name such as APSES_domains.phy.

Calculating trees

In this section we perform the actual phylogenetic calculation.

Task:

  1. Download the PHYLIP package from the Phylip homepage and install it on your computer.
  2. Make a copy of your PHYLIP formatted sequence alignment file and name it infile. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else. Place this file into the directory where the PHYLIP executables reside on your computer.
  3. Run the proml program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree (on the Mac, use proml.app). The program will automatically use "infile" for its input. Use the default parameters except that you should change option S: Speedier but rougher analysis? to No, not rough - your analysis should not sacrifice accuracy for speed. The calculation may take some fifteen minutes or so..


The program produces two output files: the outfile contains a summary of the run, the likelihood of bifurcations, and an ASCII representation of the tree. Open it with your usual text editor to have a look, and save the file with a meaningful name. The outtree contains the resulting tree in so-called "Newick" format. Again, have a look and save it with a meaningful filename.


Analysing your tree

In order to analyse your tree, you need a species tree as reference. Then you can begin comparing your expectations with the observed tree.


The species tree reference

I have constructed a cladogram for many of the species we are analysing, based on data published for 1551 fungal ribosomal sequences. The six reference species are included. Such reference trees from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.

Cladogram of many fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows Tehler et al. (2003) Mycol Res. 107:901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.

Your species may not be included in this cladogram, but you can easily create your own species tree with the following procedure:

Task:

  1. Access the NCBI taxonomy database Entrez query page.
  2. Edit the list of reference species below to include your species and paste it into the form.
"Aspergillus nidulans"[Scientific Name] OR
"Candida albicans"[Scientific Name] OR
"Neurospora crassa"[Scientific Name] OR
"Saccharomyces cerevisiae"[Scientific Name] OR
"Schizosaccharomyces pombe"[Scientific Name] OR
"Ustilago maydis"[Scientific Name]
  1. Next, as Display Settings option, select Common Tree.

You can use that tree as is - or visualize it more nicely as follows

  1. Select the phylip tree option from the menu, and click save as to save the tree in phylip (Newick) tree format.
  2. The output can be edited, and visualized in any program that reads phylip trees. One particularly nice viewer is the iTOL - Interactive Tree of Life project. Copy the contents of the phyliptree.phy file that the NCBI page has written, navigate to the iTOL project, click on the Data Upload tab, paste your tree data and click Upload. Then go to the main display page to view the tree. Change the view from Circular to Normal.
Alternatively ...

You can look up your species in the latest version of the species tree for the fungi:

Ebersberger et al. (2012) A consistent phylogenetic backbone for the fungi. Mol Biol Evol 29:1319-34. (pmid: 22114356)

PubMed ] [ DOI ] The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data-a common practice in phylogenomic analyses-introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses.

Visualizing the tree

Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip outfile - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file outtree. Visualize it as follows:

Task:

  1. Open outtree in a texteditor and copy the tree.
  2. Visualize the tree in alternative representations:
    1. I have already mentioned the iTOL - Interactive Tree of Life project viewer.
    2. Navigate to the Proweb treeviewer, paste and visualize your tree.
    3. Navigate to the Trex-online Newick tree viewer for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
  3. A particularly useful viwer is actually Jalview.
    1. Open Jalview, copy the sequences you have used and paste them via File → Input Alignment → from Textbox.
    2. In the alignment window, choose File → Load associated Tree and load the Phylip outtree file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades (plus the outgroup). This view is particularly informative, since you can associate the clades of the tree with the actual sequences in the alignment, and get a good sense what sequence features the tree is based on.
    3. Try the Calculate → Sort → By Tree Order option to sort the sequences by their position in the tree. Also note that you can flip the tree around a node by double-clicking on it. This is especially useful: try to rearrange the tree so that the subdivisions into clades are apparent. Clicking into the window "cuts" the tree and colours your sequences according to the clades in which they are found. This is useful to understand what particular sequences contributed to which part of the phylogenetic inference.
    4. Study the tree: understand what you see and what you would have expected.


Here are two principles that will help you make sense of the tree.


A: A gene that is present in an ancestral species is inherited in all descendant species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).

B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.


With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.

The APSES domains of LCA

Note: A common confusion about cenancestral genes (LCA = Last Common Ancestor) arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have diverged beyond recognizability. In general you have to ask: given the species represented in a subclade, what is the last common ancestor of that branch? The expectation is that all descendants of that ancestor should be represented in that branch unless one of the above reasons why a gene might be absent would apply.


Task:

  • Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so. Note that the hallmark of a clade that originated in the cenancestor is that it contains species from all subsequent major branches of the species tree.


The APSES domains of YFO

Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to identify the sequence of duplications and/or gene loss in your organism through which YFO has ended up with the APSES domains it possesses today.

Task:

  1. Print the tree to a single sheet of paper.
  2. Mark the clades for the genes of the cenancestor.
  3. Label all subsequent branchpoints that affect the gene tree for YFO with either "D" (for duplication) or "S" (for speciation). Remember that specific speciation events can appear more than once in a tree. Identify such events.
  4. Bring this sheet with you to the quiz on Wednesday.

Bonus: when did it happen?

A very cool resource is Timetree - a tool that allows you to estimate divergence times between species. For example, the speciation event that separated the main branches of the fungi - i.e. the time when the fungal cenacestor lived - is given by the divergence time of Schizosaccharomyces pombe and Saccharomyces cerevisiaea: 761,000,000 years ago. For comparison, these two fungi are therefore approximately as related to each other as you are ...

A) to the rabbit?
B) to the opossum?
C) to the chicken?
D) to the rainbow trout?
E) to the warty sea squirt?
F) to the bumblebee?
G) to the earthworm?
H) to the fly agaric?

Check it out - the question will be on the quiz.

Identifying Orthologs

In the last assignment we discovered homologs to S. cerevisiae Mbp1 in YFO. Some of these will be orthologs to Mbp1, some will be paralogs. Some will have similar function, some will not. We discussed previously that genes that evolve under continuously similar evolutionary pressure should be most similar in sequence, and should have the most similar "function".

In this assignment we will define the YFO gene that is the most similar ortholog to S. cerevisiae Mbp1, and perform a multiple sequence alignment with it.

Let us briefly review the basic concepts.

 

All related genes are homologs.


Two central definitions about the mutual relationships between related genes go back to Walter Fitch who stated them in the 1970s:

 

Orthologs have diverged after speciation.
Paralogs have diverged after duplication.


 


Hypothetical evolutionary tree. A single gene evolves through two speciation events and one duplication event. A duplication occurs during the evolution from reptilian to synapsid. It is easy to see how this pair of genes (paralogs) in the ancestral synapsid gives rise to two pairs of genes in pig and elephant, respectively. All circle genes are mutually orthologs, they form a "cluster of orthologs". All genes within one species are mutual paralogs–they are so called in-paralogs. The circle gene in pig and the triangle gene in the elephant are so-called out-paralogs. Somewhat counterintuitively, the triangle gene in the pig and the circle gene in the raven are also orthologs - but this has to be, since the last common ancestor diverged by speciation. The "phylogram" on the right symbolizes the amount of evolutionary change as proportional to height difference to the "root". It is easy to see how a bidirectional BLAST search will only find pairs of most similar orthologs. If applied to a group of species, bidirectional BLAST searches will find clusters of orthologs only (except if genes were lost, or there are anomalies in the evolutionary rate.)


Defining orthologs

To be reasonably certain about orthology relationships, we would need to construct and analyze detailed evolutionary trees. This is computationally expensive and the results are not always unambiguous either, as we will see in a later assignment. But a number of different strategies are available that use precomputed results to define orthologs. These are especially useful for large, cross genome surveys. They are less useful for detailed analysis of individual genes. Pay the sites a visit and try a search.


Orthologs by eggNOG
The eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) database contains orthologous groups of genes at the EMBL. It seems to be continuously updtaed, the search functionality is reasonable and the results for yeast Mbp1 show many genes from several fungi. Importantly, there is only one gene annotated for each species. Alignments and trees are also available, as are database downloads for algorithmic analysis.

 

Powell et al. (2014) eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:D231-9. (pmid: 24297252)

PubMed ] [ DOI ] With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.


Orthologs at OrthoDB
OrthoDB includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword) retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs.

 

Waterhouse et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41:D358-65. (pmid: 23180791)

PubMed ] [ DOI ] The concept of orthology provides a foundation for formulating hypotheses on gene and genome evolution, and thus forms the cornerstone of comparative genomics, phylogenomics and metagenomics. We present the update of OrthoDB-the hierarchical catalog of orthologs (http://www.orthodb.org). From its conception, OrthoDB promoted delineation of orthologs at varying resolution by explicitly referring to the hierarchy of species radiations, now also adopted by other resources. The current release provides comprehensive coverage of animals and fungi representing 252 eukaryotic species, and is now extended to prokaryotes with the inclusion of 1115 bacteria. Functional annotations of orthologous groups are provided through mapping to InterPro, GO, OMIM and model organism phenotypes, with cross-references to major resources including UniProt, NCBI and FlyBase. Uniquely, OrthoDB provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and now extended with exon-intron architectures, syntenic orthologs and parent-child trees. The interactive web interface allows navigation along the species phylogenies, complex queries with various identifiers, annotation keywords and phrases, as well as with gene copy-number profiles and sequence homology searches. With the explosive growth of available data, OrthoDB also provides mapping of newly sequenced genomes and transcriptomes to the current orthologous groups.


Orthologs at OMA

OMA (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with MBP1_YEAST (this is the Swissprot ID) as a "Group" search finds the correct gene in EREGO, KLULA, CANGL and SACCE. But searching with the sequence of the Ustilago maydis ortholog does not find the yeast protein, but the orthologs in YARLI, SCHPO, LACCBI, CRYNE and USTMA. Apparently the orthologous group has been split into several subgroups across the fungi. However as a whole the database is carefully constructed and available for download and API access; a large and useful resource.

 

Altenhoff et al. (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289-94. (pmid: 21113020)

PubMed ] [ DOI ] OMA (Orthologous MAtrix) is a database that identifies orthologs among publicly available, complete genomes. Initiated in 2004, the project is at its 11th release. It now includes 1000 genomes, making it one of the largest resources of its kind. Here, we describe recent developments in terms of species covered; the algorithmic pipeline--in particular regarding the treatment of alternative splicing, and new features of the web (OMA Browser) and programming interface (SOAP API). In the second part, we review the various representations provided by OMA and their typical applications. The database is publicly accessible at http://omabrowser.org.

... see also the related articles, much innovative and carefully done work on automated orthologue definition by the Dessimoz group.


Orthologs by syntenic gene order conservation
We will revisit this when we explore the UCSC genome browser.


Orthologs by RBM
Defining it yourself. RBM (or: Reciprocal Best Match) is easy to compute and half of the work you have already done in Assignment 3. Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in YFO and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.

Task:

  1. Navigate to the BLAST homepage.
  2. Paste the YFO RefSeq sequence identifier into the search field. (You don't have to search with sequences–you can search directly with an NCBI identifier IF you want to search with the full-length sequence.)
  3. Set the database to refseq, and restrict the species to Saccharomyces cerevisiae.
  4. Run BLAST.
  5. Keep the window open for the next task.

The top hit should be yeast Mbp1 (NP_010227). E mail me your sequence identifiers if it is not. If it is, you have confirmed the RBM or BBM criterion (Reciprocal Best Match or Bidirectional Best Hit, respectively).

Technically, this is not perfectly true since you have searched with the APSES domain in one direction, with the full-length sequence in the other. For this task I wanted you to try the search-with-accession-number. Therefore the procedural laxness, I hope it is permissible. In fact, performing the reverse search with the YFO APSES domain should actually be more stringent, i.e. if you find the right gene with the longer sequence, you are even more likely to find the right gene with the shorter one.


Orthology by annotation
The NCBI precomputes BLAST results and makes them available at the RefSeq database entry for your protein.

Task:

  1. In your BLAST result page, click on the RefSeq link for your query to navigate to the RefSeq database entry for your protein.
  2. Follow the Blink link in the right-hand column under Related information.
  3. Restrict the view RefSeq under the "Display options" and to Fungi.

You should see a number of genes with low E-values and high coverage in other fungi - however this search is problematic since the full length gene across the database finds mostly Ankyrin domains.


You will find that all of these approaches yield some of the orthologs. But none finds them all. The take home message is: precomputed results are good for large-scale survey-type investigations, where you can't humanly process the information by hand. But for more detailed questions, careful manual searches are still indsipensable.

Orthology by crowdsourcing
Luckily a crowd of willing hands has prepared the necessary sequences for you: in the section below you will find a link to the annotated and verified Mbp1 orthologs from last year's course :-)

We could call this annotation by many hands "crowdsourcing" - handing out small parcels of work to many workers, who would typically allocate only a small share of their time, but here the strength is in numbers and especially projects that organize via the Internet can tally up very impressive manpower, for free, or as Microwork. These developments have some interest for bioinformatics: many of our more difficult tasks can not be easily built into an algorithm, language related tasks such as text-mining, or pattern matching tasks come to mind. Allocating this to a large number of human contributors may be a viable alternative to computation. A marketplace where this kind of work is already a reality is Amazon's "Mechanical Turk" Marketplace: programmers–"requesters"– use an open interface to post tasks for payment, "providers" from all over the world can engage in these. Tasks may include matching of pictures, or evaluating the aesthetics of competing designs. A quirky example I came across recently was when information designer David McCandless had 200 "Mechanical Turks" draw a small picture of their soul for his collection.

The name "Mechanical Turk" by the way relates to a famous ruse, when a Hungarian inventor and adventurer toured the imperial courts of 18th century Europe with an automaton, dressed in turkish robes and turban, that played chess at the grandmaster level against opponents that included Napoleon Bonaparte and Benjamin Franklin. No small mechanical feat in any case, it was only in the 19th century that it was revealed that the computational power was actually provided by a concealed human.

Are you up for some "Turking"? Before the next quiz, edit the Mbp1 RBM page on the Student Wiki and include the RBM for Mbp1, for a 10% bonus on the next quiz.


 


Coloring a 3D model by conservation

With the superimposed coordinates, you can begin to get a sense whether either or both binding modes could be appropriate for a protein-DNA complex in your Mbp1 orthologue. But these are geometrical criteria only, and the protein in your species may be flexible enough to adopt a different conformation in a complex, and different again from your model. A more powerful way to analyze such hypothetical complexes is to look at conservation patterns. With VMD, you can import a sequence alignment into the MultiSeq extension and color residies by conservation. The protocol below assumes

  • You have prealigned the reference Mbp1 proteins with your species' Mbp1 orthologue;
  • You have saved the alignment in a CLUSTAL format.

You can use Jalview or any other MSA server to do so. You can even do this by hand - there should be few if any indels and the correct alignment is easy to see.

Task:

Load the Mbp1 APSES alignment into MultiSeq.
(A) In the MultiSeq Window, navigate to File → Import Data...; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable ALN files (these are CLUSTAL formatted multiple sequence alignments).
(B) Open the alignment file, click on Ok to import the data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.
(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).

You will see that the 1MB1 sequence and the APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.

Bring the 1MB1 sequence in register with the APSES alignment.
(A)MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the entire first column of the sequences you have imported.
(B) Select Edit → Enable Editing... → Gaps only to allow changing indels.
(C) Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1: S I M ...
(D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)
(E) When you are done, it may be prudent to save the state of your alignment. Use File → Save Session...
Color by similarity
(A) Use the View → Coloring → Sequence similarity → BLOSUM30 option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
(B) You can adjust the color scale in the usual way by navigating to VMD main → Graphics → Colors..., choosing the Color Scale tab and adjusting the scale midpoint.
(C) Navigate to the Representations window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use User coloring of your Tube and Licorice representations to apply the sequence similarity color gradient that MultiSeq has calculated.
 
  • Once you have colored the residues of your model by conservation, create another informative stereo-image and paste it into your assignment.

 

Links and Resources

Literature
Ebersberger et al. (2012) A consistent phylogenetic backbone for the fungi. Mol Biol Evol 29:1319-34. (pmid: 22114356)

PubMed ] [ DOI ] The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data-a common practice in phylogenomic analyses-introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses.

Marcet-Houben & Gabaldón (2009) The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. PLoS ONE 4:e4357. (pmid: 19190756)

PubMed ] [ DOI ] A recurrent topic in phylogenomics is the combination of various sequence alignments to reconstruct a tree that describes the evolutionary relationships within a group of species. However, such approach has been criticized for not being able to properly represent the topological diversity found among gene trees. To evaluate the representativeness of species trees based on concatenated alignments, we reconstruct several fungal species trees and compare them with the complete collection of phylogenies of genes encoded in the Saccharomyces cerevisiae genome. We found that, despite high levels of among-gene topological variation, the species trees do represent widely supported phylogenetic relationships. Most topological discrepancies between gene and species trees are concentrated in certain conflicting nodes. We propose to map such information on the species tree so that it accounts for the levels of congruence across the genome. We identified the lack of sufficient accuracy of current alignment and phylogenetic methods as an important source for the topological diversity encountered among gene trees. Finally, we discuss the implications of the high levels of topological variation for phylogeny-based orthology prediction strategies.

Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728)

PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources.

Tuimala, Jarno (2006) A primer to phylogenetic analysis using the PHYLIP package.  
(pmid: None)Source URL ] The purpose of this tutorial is to demonstrate how to use PHYLIP, a collection of phylogenetic analysis software, and some of the options that are available. This tutorial is not intended to be a course in phylogenetics, although some phylogenetic concepts will be discussed briefly. There are other books available which cover the theoretical sides of the phylogenetic analysis, but the actual data analysis work is less well covered. Here we will mostly deal with molecular sequence data analysis in the current PHYLIP version 3.66.


Software
Sequences


 

Biasini et al. (2014) SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res 42:W252-8. (pmid: 24782522)

PubMed ] [ DOI ] Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/.

Bordoli & Schwede (2012) Automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal. Methods Mol Biol 857:107-36. (pmid: 22323219)

PubMed ] [ DOI ] Comparative protein structure modeling is a computational approach to build three-dimensional structural models for proteins using experimental structures of related protein family members as templates. Regular blind assessments of modeling accuracy have demonstrated that comparative protein structure modeling is currently the most reliable technique to model protein structures. Homology models are often sufficiently accurate to substitute for experimental structures in a wide variety of applications. Since the usefulness of a model for specific application is determined by its accuracy, model quality estimation is an essential component of protein structure prediction. Comparative protein modeling has become a routine approach in many areas of life science research since fully automated modeling systems allow also nonexperts to build reliable models. In this chapter, we describe practical approaches for automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal.

Peitsch (2002) About the use of protein models. Bioinformatics 18:934-8. (pmid: 12117790)

PubMed ] [ DOI ] Protein models can be of great assistance in functional genomics, as they provide the structural insights often necessary to understand protein function. Although comparative modelling is far from yielding perfect structures, this is still the most reliable method and the quality of the predictions is now well understood. Models can be classified according to their correctness and accuracy, which will impact their applicability and usefulness in functional genomics and a variety of situations.



 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 7 Assignment 9 >