Difference between revisions of "User:Boris/Temp/APB"

From "A B C"
Jump to navigation Jump to search
m
m
 
(80 intermediate revisions by the same user not shown)
Line 1: Line 1:
<!-- {{Template:Inactive}} -->
+
<div id="APB">
{{Template:Active}}
 
  
 +
<table width="40%"><tr><td class="l1">&nbsp;</td><td>
  
 +
===Hardware===
 +
<table width="100%">
 +
<tr class="s1"><td class="l1">High performance computing <!-- (... at the bench: GPUs, FPGAs, Clusters) --></td></tr>
 +
<tr class="s2"><td class="l1">Cloud computing</td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
__TOC__
+
===Systems and Tools===
&nbsp;
+
<table width="100%">
&nbsp;
 
  
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
+
<tr class="s1"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[Unix]]
Assignment 4 - Phylogenetic Analysis
+
<div class="mw-collapsible-content">
 +
<table width="100%"><tr class="s2"><td class="l2">[[Unix system administration]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Unix automation]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Program installation]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[wget]]</td></tr></table>
 
</div>
 
</div>
 +
</td></tr>
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
<tr class="s2"><td class="l1">[[Network Configuration]]</td></tr>
Introduction
+
<tr class="s1"><td class="l1">[[Apache]]</td></tr>
&nbsp;
+
<tr class="s2"><td class="l1">[[MySQL]]</td></tr>
 +
<tr class="s1"><td class="l1">[[Tools for the bioinformatics lab]]</td></tr>
 +
<tr class="s2"><td class="l1">[[GBrowse|GBrowse and LDAS]]</td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
;Nothing in Biology makes sense except in the light of evolution.
+
===Programming===
:''Theodosius Dobzhansky''
+
<table width="100%" >
</div>
+
<tr class="s1"><td class="l1">[[IDE|IDE (Integrated Development Environment)]]</td></tr>
 
+
<tr class="s2"><td class="l1">[[Regular Expressions]]</td></tr>
... but does evolution make sense in the light of biology?
+
<tr class="s1"><td class="l1">[[Screenscraping]]</td></tr>
 
 
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?
 
 
 
 
 
We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 and you have identified the full complement of APSES domain genes in your assigned organism. In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
 
 
 
A number of good tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) [http://evolution.genetics.washington.edu/phylip.html PHYLIP] package and the (commercial) PAUP package. ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.
 
 
 
But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do ''in real life''):
 
 
 
'''One''': we will use an '''efficient''' tree-building algorithm, not the best-available one. This is an algorithm which is available through an online Webserver, without the need for you to install software on your own machine. In ''real life'' you would of course use the most accurate algortihm you can get, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript. <small>(However, the simpler algorithm we use here apears to give results that appear quite plausible for the situation we are studying.)</small>
 
 
 
'''Two''': we will assume the tree the algorithm constructs is ''correct''. In ''real life'' you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. However, we should acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the reults critically.
 
 
 
In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked an excellent and very understandable introduction-level [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf article on phylogenetic analysis (pdf)] here and to the resource section at the bottom of this page.
 
 
 
&nbsp;
 
 
 
{{Template:Preparation|
 
care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.|
 
num=4|
 
ord=fourth|
 
due = Monday, November 17 at 10:00 in the morning}}
 
 
 
;Your documentation for the procedures you follow in this assignment will be worth 1 mark.
 
 
 
&nbsp;
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
==(1) Preparations==
 
</div>
 
&nbsp;
 
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<tr class="s2"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[Perl]]
===(1.1) Preparing Input Files===
+
<div class="mw-collapsible-content">
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl basic programming]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl hash example]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl LWP example]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl MySQL introduction]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl OBO parser]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl basic programming]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl programming exercises 1]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl programming exercises 2]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl programming Data Structures]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl references]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl simulation]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl: Object oriented programming]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl: Ugly programming]]</td></tr></table>
 
</div>
 
</div>
&nbsp;<br>
+
</td></tr>
 
 
=====Introduction: Task=====
 
For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse the tree.
 
 
 
The phylogenetic tree we will construct will represent all APSES domains of the species we have analyzed. In order to '''interpret''' such a tree it is crucial to have some sense of what these domains are, i.e. to cluster them according to their orthologues. Only then can we analyse the tree by asking which subclades mirror the accepted phylogeny of fungi and which ones differ. In the third assignment, we have assigned orthology from reciprocal best match analysis. Based on this information, I have revised the gene names in the [[APSES_domains_MUSCLE_revised|'''MUSCLE alignment of all APSES domains''']]. When we calculate a phylogenetic tree with these sequences, we should expect orthologues to cluster into the same subclade. Of course, not all fungi have the same number of APSES domain homologues, but from the data we have compiled it should be possible to define their evolutionary history with reference to the other species.
 
 
 
=====Introduction: Principle=====
 
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold '''aligned characters in corresponding positions'''. Phylogeny programs are not meant to revise an alignment but to analyse evolutionary relationships, given the alignment. Their inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.
 
 
 
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
 
 
 
'''Distance based''' phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
 
* they apply a model of evolution such as a mutation data matrix, to calculate a score for each '''pair''' of sequences,
 
* this score is stored in a "distance matrix" ...
 
* ... and used to estimate a tree that goups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).
 
They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
 
 
 
'''Parsimony based''' phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
 
 
 
'''ML''', or '''Maximum Lieklihood''' methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also VERY compute intensive and a tree of the size that we are building in this assignment is already almost beyond the resources of common workstations (runs about a day on my computer). However, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable. They also suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spurious shared differences.
 
 
 
Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a ''most characteristic subset'' of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the ''true'' phylogenetic relationships between the sequences.
 
 
 
=====Introduction: Problems=====
 
Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered ''sequence'', but for a ''sum of independent values'', one for each aligned columns of characters. The order of the columns does not change the score. Hoever in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the samescore. For short indels, this '''underestimates''' the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this '''overestimates''' the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment to one or two character, or to remove them.
 
 
 
=====Introduction: Practice=====
 
In practice, follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
 
 
 
:*Remove all stretches of residues in which the ''alignment'' appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
 
:*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
 
:*Remove all but approximately one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all, some researchers remove all gaps).
 
:*Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
 
:*If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.
 
 
 
:<small>(A '''very''' useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text: color or highlight characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier or "Courier New". )</small>
 
 
 
The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.
 
 
 
Some notes on how to avoid common editing troubles. Copy the sequences from the pages linked from the ''Resources'' section below. Paste them into a document, using the Word "Edit &rarr; Paste special &rarr; Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. Take special note that your files must not include tab characters! (Tabs are counted as one single character by the phylogeny programs.) You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier (or "Courier New"), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all columns in your alignments align as they should. As always, make sure you save your input files as "Text Only".
 
 
 
<small>
 
:A note if you are  working on a '''Mac''' and saving input on disk, to run with a locally installe PHYLIP version: here MS Word will play one of its usual [http://en.wikipedia.org/wiki/Shenanigan shenanigans] on you since it writes text files with the old-style OS 9 Carriage Return characters <code>(\r; ASCII 13; hex 0D; CR)</code>. Just by looking at the file, this is quite invisible but such "Carriage returns" are not going to be recognized by PHYLIP and most other UNIX based programs. It may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though all the input would be passed in one single, very long line). And this can (and did) lead to head-banging rounds of frustration. You need to replace them with '''Linefeed''' resp. '''Newline''' characters <code>(\n; ASCII 10; hex 0A; LF)</code> and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:
 
 
 
:'''tr "\r" "\n" &lt; infile    &gt; outfile'''
 
 
 
:... where outfile is different from infile (careful: if a file by the name of outfile already exists, '''tr''' will cheerfully overwrite it.) Alternatively you could type the following perl one-line program :
 
 
 
:'''perl -e 'while(&lt;&gt;){tr/\r/\n/;print}'  &lt; infile    &gt; outfile'''
 
</small>
 
 
 
 
 
In your assignment submission, clearly highlight or otherwise color the columns that you have selected, annotate why you have selected them and paste your resulting input file as well. Here is an example of what this might look like:
 
  
 +
<tr class="s1"><td class="l1">[[BioPerl]]</td></tr>
 +
<tr class="s2"><td class="l1">[[PHP]]</td></tr>
 +
<tr class="s1"><td class="l1">[[Data modelling]]</td></tr>
 +
<tr class="s2"><td class="l1">BioPython <!-- (scope, highlights, installation, use, support) --></td></tr>
 +
<tr class="s1"><td class="l1">Graphical output <!-- (PNG and SVG) --></td></tr>
 +
<tr class="s2"><td class="l1">[[Autonomous agents]]</td></tr>
 +
</table>
  
[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
+
===Algorithms===
 +
<table width="100%" >
 +
<tr class="sh"><td class="l1">Algorithms on Sequences</td></tr>
 +
<tr class="s1"><td class="l2">[[Dynamic Programming]]</td></tr>
 +
<tr class="s2"><td class="l2">[[Multiple Sequence Alignment]]</td></tr>
 +
<tr class="s1"><td class="l2">[[Genome Assembly]]</td></tr>
  
=====Introduction: Web Service and data=====
+
<tr><td class="sp">&nbsp;</td></tr>
  
You have two choices for completing the assignment: either to use one of the [http://evolution.gs.washington.edu/phylip/phylipweb.html PHYLIP on-line servers] that generously provide public computing resources, or to download and install the [http://evolution.genetics.washington.edu/phylip.html PHYLIP program package] on your own computer at home. If you choose the former, one of your options is the [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' service at the '''Institut Pasteur'''] in France.
+
<tr class="sh"><td class="l1">Algorithms on Structures</td></tr>
 +
<tr class="s1"><td class="l2">[[Docking]]</td></tr>
 +
<tr class="s2"><td class="l2">Protein Structure Prediction <!-- ''ab initio'' --></td></tr>
  
<small>I have tried the Pasteur service many times, and it works - however not always entirely without problems. Uninformative errors may occur when your input is too large for the system's memory (like: "sequences not aligned" ... "out of memories" and such) and once, after submitting a number of jobs, the system locked me out to wait until results would be received by e-mail (which then hasn't happened). Regrettably, this is not documented. However the integration of their services in a logical sequence of steps is very convenient and some of their services use algorithms that improve on PHYLIP. If you rather decide to install PHYLIP, good for you. That is easy to do, well documented, there are much less limitations on memory - but if you don't read and understand the instructions carefully, you may be in for a spell of frustration.</small>
+
<tr><td class="sp">&nbsp;</td></tr>
  
Either way, I have posted typical input files and result files on the [[Assignment_5_fallback_data|fallback data page]], to allow you to bail out in case technical problems become overwhelming. If you use the data posted here instead of your own, you '''must''' document that fact and explain what you have tried, and why that has failed. The posted data is a fallback, not a shortcut.
+
<tr class="sh"><td class="l1">Algorithms on Trees</td></tr>
 +
<tr class="s1"><td class="l2">Computing with trees <!-- Bayesian approaches for phylogenetic trees, tree comparison) --></td></tr>
  
For this assignment, we will use a simple distance based tree construction method, specifically the UPGMA variant of the neighbor joining algorithm. This represents a reasonable compromise between accuracy and speed, especially when applied to moderately dissimilar sequences. In general, distance methods include '''two''' steps: (1) calculate a pairwise-distance matrix between sequences, (2) construct a tree, based on the matrix. Thus all the information in the alignment bewtween two pairs of sequences is collapsed into a single number: their pairwise distance. Alternative approaches, parsimony as well as ML based algorithms, take individual columns into account.
+
<tr><td class="sp">&nbsp;</td></tr>
  
&nbsp;<br>
+
<tr class="sh"><td class="l1">Algorithms on Networks</td></tr>
<div style="padding: 5px; background: #DDDDEE;">
+
<tr class="s1"><td class="l2">Network metrics <!-- (Degree distributions, Centrality metrics, other metrics on topology, small-world- vs. random-geometric controversy) --></td></tr>
Prepare an input file that is representative of the APSES domains.
+
<tr class="s2"><td class="l3">[[Dijkstras Algorithm]]</td></tr>
 +
<tr class="s1"><td class="l3">[[Floyd Warshall Algorithm]]</td></tr>
 +
</table>
  
*Access the [[APSES_domains_MUSCLE_revised|revised MSA for all APSES domains]], linked here (and from the resources section at the bottom of the page). Prepare a PHYLIP formatted input file from this MSA, restricting the number of sequence characters to no more than 70. Read the [http://evolution.genetics.washington.edu/phylip/doc/main.html#inputfiles PHYLIP format documentation] and follow the considerations dicussed above. ([[Assignment_5_fallback_data|See the fallback data in case you get stuck]], but you '''must''' prepare (and document) an input file according to the instructions, even if you end up using the fallback data for whatever reason.) Do not forget to document how you have prepared your input file: define where your source-sequences came from, define which columns you have deleted by highlighting the deleted residues in one sequence, and include your input file in the assignment.
 
</div>
 
  
&nbsp;<br>
+
===Communication and collaboration===
&nbsp;<br>
+
<table width="100%" >
 +
<tr class="s1"><td class="l1">[[MediaWiki]]</td></tr>
 +
<tr class="s2"><td class="l1">[[HTML essentials]]</td></tr>
 +
<tr class="s1"><td class="l1">[[HTML 5]]</td></tr>
 +
<tr class="s2"><td class="l1">[[SADI|SADI Semantic Automated Discovery and Integration]]</td></tr>
 +
<tr class="s1"><td class="l1">[[CGI]]</td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
===Statistics===
 +
<table width="100%" >
 +
<tr class="s1"><td class="l1">[[Pattern discovery]]</td></tr>
 +
<tr class="s2"><td class="l1">Correlation <!-- (Covariance matrices and their interpretation, application to large problems, collaborative filtering, MIC and MINE) --></td></tr>
 +
<tr class="s1"><td class="l1">Clustering methods <!-- (Algorithms and choice (including: hierarchical, model-based and partition clustering, graphical methods (MCL), flow based methods (RRW) and spectral methods). Implementation in R if possible) --></td></tr>
 +
<tr class="s2"><td class="l1">Cluster metrics <!-- (Cluster quality metrics (Akaike, BIC)–when and how) --></td></tr>
 +
<tr class="s1"><td class="l1">[[Map equation|The Map Equation]] </td></tr>
 +
<tr class="s2"><td class="l1">Machine learning <!-- (Classification problems: Neural Networks, HMMs, SVM..) --></td></tr>
  
===(1.2) Calculating a Tree===
+
<tr class="s1"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[R]]
 +
<div class="mw-collapsible-content">
 +
<table width="100%"><tr class="s2"><td class="l2">R plotting</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[R programming]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">R EDA</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">R regression</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">R PCA</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">R Clustering</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">R Classification <!-- Phrasing inquiry as a classification problem, dealing with noisy data, machine learning approaches to classification, implementation in R) --></td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">R hypothesis testing</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Bioconductor]]</td></tr></table>
 
</div>
 
</div>
 +
</td></tr>
  
&nbsp;<br>
+
<tr><td class="sp">&nbsp;</td></tr>
&nbsp;<br>
+
</table>
<div style="padding: 5px; background: #DDDDEE;">
 
  
*Using the '''protdist''' program of PHYLIP, calculate a distance matrix for the input file you have prepared. ([[Assignment_5_fallback_data|See the fallback data in case you get stuck]]) (1 mark)
+
===Applications===
 
+
<table width="100%" >
*If you use the PHYLIP Webserver,  select the neighbor joining algorithm from the menu options ('''neighbor''' on the PHYLIP server) and click the button "run the selected program on outfile" ; on the next form, click the button to the "advanced neighbor form", choose the option "UPGMA" and click on the button "run neighbor". When the program is done, select the option '''drawgram''' and click '''Run the selected program on outtree'''. Choose a '''cladogram''' tree-style and a suitable output format (e.g. postscript). Paste the trees into your assignment.
+
<tr class="s1"><td class="l1">[[Data integration]] <!-- Add BioMart: Biodata integration, and data-mining of complex, related, descriptive data --></td></tr>
 
+
<tr class="s2"><td class="l1">Text mining <!-- (Use cases, tasks and metrics, taggers, vocabulary mapping, Practicals: R-support, Python/Perl support, others...) --></td></tr>
*If you use a locally installed version of PHYLIP use '''neighbor''' with the UPGMA method to construct a tree for the input file. Open the file '''outfile''' in a text-editor, copy and paste the trees into your assignment.
+
<tr class="s1"><td class="l1">[[HMMER]]</td></tr>
 
+
<tr class="s2"><td class="l1">High-throughput sequencing</td></tr>
In both cases, the process is: <code>protdist</code> &rarr; <code>neighbor</code> &rarr; <code>drawgram</code>
+
<tr class="s1"><td class="l1">Functional annotation <!-- GFF --></td></tr>
 +
<tr class="s2"><td class="l1">Microarray analysis <!-- (... in R: differential expression and multiple testing; Loading and normalizing data, calculating differential expression, LOWESS, the question of significance, FWERs: Bonferroni and FDR; SAM and LIMMA) --></td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
 +
</td></tr></table>
  
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
 
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
 
==(2) Analysis (3 marks)==
 
</div>
 
 
I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
 
 
[[Image:FungiCladogram.jpg|frame|none|Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity. I have labeled all speciation events so you can refer to these labels in your assignment.]]
 
 
In order to study the evolutionary history of the entire gene family you can use the tree you have computed or access the [[APSES_domains_reference_tree|'''APSES domains reference tree''']] here.
 
 
This is a complicated tree, and it can look impenetrably confusing at first. Here are two principles that will help you make sense of the tree.
 
 
A: '''A gene that is present in an ancestral species, is inherited in all descendent species.''' The gene has to be observed in all OTUs, unless its has been lost (which is a rare event). This means, if a gene is present in two widely divergent species, but in none other of the descendants of the LCA, it is possible that there is some problem with the tree (long branch attraction maybe), or the sequence has been acquired through horizontal gene transfer.
 
 
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the genes, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the OTUs, up to the branchpoint of their LCA.
 
 
With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry the [[APSES_domains_reference_tree|reference tree of all APSES domains]] apart quite nicely. A few colored pencils and a printout of the tree will help.
 
 
 
&nbsp;
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
 
===(2.1) The Cenancestor's APSES Domains (1 mark)===
 
</div>
 
 
Refer to your tree or the reference tree for the following two tasks. Be specific, to support your arguments, i.e. use specific branchpoints (by numbers or letters) and OTU or gene names in your arguments (see the example below).
 
 
&nbsp;<br>
 
&nbsp;<br>
 
<div style="padding: 5px; background: #FFCC99;">
 
;Analysis (1 mark)
 
 
Discuss briefly how many APSES domain proteins the fungal cenancestor appears to have posessed and what evidence you see in the tre that this is so.
 
</div>
 
&nbsp;<br>
 
&nbsp;
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
 
===(2.2) Unraveling your organism's APSES domains (2 marks)===
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
 
<div style="padding: 5px; background: #FFCC99;">
 
;Analysis (2 marks)
 
 
Assume that the phylogenetic tree for fungi is correct, and that the mixed gene tree is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the species tree and either your constructed tree or the [[APSES_domains_reference_tree|reference tree]]. (2 marks)
 
</div>
 
&nbsp;<br>
 
&nbsp;
 
 
For example the following discusion for ''Saccharomyces cerevisiae'' would be sufficient for full marks:
 
:(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). There are four subclades that are shared by most current species, they branch from 129, 108, 76 and (94 + 102). For the latter case, the precise resolution appears not be well resolved, but by comparison with the species tree, we can argue that branch 102 corresponds to branch (H) and should be inserted between branchpoints 94 (corresponding to (A) ) and 96 (B) , not after branch 74. This is because the species under 95 and 102 share a common ancestor (B) that is distinct from 95.  ''Saccharomyces cerevisiae'' has one gene in each of these major subclades, there is no gene loss.  (Note however that there is no Dikaryomycota (2) orthologue of a Sok2 gene.) ''Saccharomyces cerevisiae'' has an additional paralogue to Sok2 that created the Phd1 gene. This is shared with ''Candida albicans''. There are three possibilites to explain this: (''i'') the gene could have been duplicated before (H) and then lost in separate, independent events after I,J,K,M and N in those species that do not possess an orthologue. (''ii'') the gene could have arisen after (N) or after (K) and then passed by horizontal gene transfer from or to  ''S. cerevisiae'', or ('''iii''') the annotations of orthologues could be incorrect and some of the genes labelled SokA (Sok2 paralogues) could in fact be Phd1 orthologues; if this were the case it would require a reassessment of how much gene-loss would be necessary to explain the subclade below 108.
 
 
&nbsp;
 
&nbsp;
 
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
 
==(3) Summary of Resources==
 
</div>
 
&nbsp;<br>
 
 
;Links
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf '''Review (PDF, restricted)''' Sandra Baldauf: Phylogeny for the Faint of Heart]
 
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
 
:* [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' Web Service at the Institut Pasteur]
 
:*[[Assignment_5_fallback_data|'''Fallback data''']]
 
 
;APSES domain alignment
 
:* [[APSES_domains_MUSCLE_revised|All '''APSES domains - MUSCLE aligned''' and sequence names revised]]
 
 
;Tree
 
:*[[APSES_domains_reference_tree|'''APSES domains reference tree''']]
 
 
&nbsp;
 
&nbsp;
 
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
 
[End of assignment]
 
</div>
 
 
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2011@googlegroups.com Course Mailing List]
 

Latest revision as of 12:44, 27 September 2015

 

Hardware

High performance computing
Cloud computing
 

Systems and Tools

Unix
Network Configuration
Apache
MySQL
Tools for the bioinformatics lab
GBrowse and LDAS
 

Programming

IDE (Integrated Development Environment)
Regular Expressions
Screenscraping
Perl
BioPerl
PHP
Data modelling
BioPython
Graphical output
Autonomous agents

Algorithms

Algorithms on Sequences
Dynamic Programming
Multiple Sequence Alignment
Genome Assembly
 
Algorithms on Structures
Docking
Protein Structure Prediction
 
Algorithms on Trees
Computing with trees
 
Algorithms on Networks
Network metrics
Dijkstras Algorithm
Floyd Warshall Algorithm


Communication and collaboration

MediaWiki
HTML essentials
HTML 5
SADI Semantic Automated Discovery and Integration
CGI
 

Statistics

Pattern discovery
Correlation
Clustering methods
Cluster metrics
The Map Equation
Machine learning
R
R plotting
R programming
R EDA
R regression
R PCA
R Clustering
R Classification
R hypothesis testing
Bioconductor
 

Applications

Data integration
Text mining
HMMER
High-throughput sequencing
Functional annotation
Microarray analysis