Difference between revisions of "BIO Assignment Week 11"

From "A B C"
Jump to navigation Jump to search
m
 
(27 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 11<br />
 
Assignment for Week 11<br />
<span style="font-size: 70%">Calculating Phylogenies</span>
+
<span style="font-size: 70%">Protein-Protein Interactions</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_10|&lt;&nbsp;Assignment&nbsp;10]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">&nbsp;</td>
 +
</tr></table>
  
 
{{Template:Inactive}}
 
{{Template:Inactive}}
Line 14: Line 18:
  
 
&nbsp;
 
&nbsp;
 +
 
==Introduction==
 
==Introduction==
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
{{task|1=
  
&nbsp;
+
* Carefully read the lecture notes for this unit <span class="PDFlink">[http://steipe.biochemistry.utoronto.ca/abc/CourseMaterials/BCH441/11-Interactions_LectureNotes.pdf Week 11: Annotated Notes <small>(PDF&nbsp;12.2&nbsp;MB)</small>]</span>.
  
;Nothing in Biology makes sense except in the light of evolution.
+
* For a useful overview of graph-theory concepts you could additionally have a look at:
:''Theodosius Dobzhansky''
+
{{#pmid: 21527005}}
</div>
 
  
... but does evolution make sense in the light of biology?
+
However, the concepts you need to know for this assignment should become clear from the notes.
 
 
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?
 
 
 
We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 (with ''reciprocal best match'') and you have identified the full complement of APSES domain genes in your assigned organism (as a result of your PSI-BLAST search). In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
 
 
 
A number of excellent tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP'''] package, the [http://www.megasoftware.net/ '''MEGA''' package] and the (commercial) PAUP* package. Of these, only MEGA is still under active development, although PHYLIP still functions perfectly (except for problems with graphical windows under Mac OS 10.6). ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.
 
 
 
However: regarding algorithm and resources, we will take a shortcut in this assignment (something you should not do in real life). We will assume that the tree the algorithm constructs is correct. In "real life" you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. <small>(If you are interested, I can mail you the procedure for running a bootstrap analysis on the tree you are computing, but this may require a day or so of computing time on your computer.</small> In this assignment, we will simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes we have sequenced come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.
 
 
 
 
 
For this assignment, we start from the APSES domains you have collected previously. You will align these domains with a set of reference domains and edit the alignment to make it suitable for phylogenetic analysis, using Jalview. Then you will construct a phylogenetic tree and interpret the tree. The goal is to identify orthologues and paralogues. <!-- Optionally, you will look at structural and functional conservation of residues. -->
 
 
 
In case you want to review concept of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis here and to the resource section at the bottom of this page.
 
 
 
{{#pmid: 12801728}}
 
 
 
 
 
==Preparing input alignments==
 
 
 
In this section, we start from a collection of homologous APSES domains, construct a multiple sequence alignment, and edit the alignment to make it suitable for phylogenetic analysis.
 
 
 
 
 
===Principles===
 
 
 
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold '''aligned characters in corresponding positions'''. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, '''after''' the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.
 
 
 
 
 
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
 
 
 
 
 
'''Distance based''' phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
 
 
 
* they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
 
* this score is stored in a "distance matrix" ...
 
* ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).
 
 
 
They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
 
 
 
 
 
'''Parsimony based''' phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
 
 
 
 
 
'''ML''', or '''Maximum Likelihood''' methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.
 
 
 
ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.
 
 
 
 
 
'''Bayesian''' methods don't estimate the tree that gives the highest likelihood for the observed data, but find the most probably tree, given that the data have been observed. If this sounds conceptually similar to you, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a "prior" on trees before observation.
 
 
 
 
 
===Choosing sequences===
 
 
 
 
 
In principle, we have discussed strategies for using PSI-BLAST to collect suitable sequences earlier. To prepare the process, I have collected all APSES domains for six reference fungal species, together with the KilA-N domain of ''E. coli''. The process is explained on the [[Reference APSES domains|reference APSES domains page]].
 
 
 
 
 
====Renaming sequences====
 
  
 +
}}
  
Renaming sequences so that their species is apparent is crucial for the interpretation of mixed gene trees. Refer to  the [[Reference APSES domains|reference APSES domains page]] to see how I have prepared the FASTA sequence headers.
 
  
 +
{{Vspace}}
  
===Adding an outgroup===
+
==Data Sources==
  
  
To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the [[Reference APSES domains|reference APSES domains page]]. The procedure is explained in detail on that page.
+
'''Interaction databases''' have similar problems as sequence databases: the need for standards for abstracting biological concepts into computable objects, data integrity, search and retrieval, and the metrics of comparison. There is however an added complication: interactions are rarely all-or-none, and the high-throughput experimental methods have large false-positive and false-negative rates. This makes it necessary to define '''confidence scores''' for interactions. On top of experimental methods, there are also a variety of methods for {{WP|Protein–protein_interaction_prediction|computational interaction prediction}}. However, even though the "gold standard" are careful, small-scale laboratory experiments, different curated efforts on the same experimental publication usually lead to different results - with as little as 42% overlap between databases being reported.
  
>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
+
Currently, likely the best integrated protein-protein interaction database is [http://www.ebi.ac.uk/intact/ '''IntAct'''], at the EBI, which besides curating interactions from the literature hosts interactions from the IMEx consortium, an extensive data-sharing agreement between a number of general and specialized source databases.
<span style="color: #999999;">MTSFQLSLISRE</span>IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
 
FKGGRPENQGTWVHPDIAINLAQ<span style="color: #999999;">WLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
 
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
 
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF</span>
 
''E. coli'' KilA-N protein. Residues that do not align with APSES domains are shown in grey.
 
  
 +
{{vspace}}
  
===Calculating alignments===
 
 
Rather than have you go through the process of using these alignments and adding your YFO sequences to them in Jalview (which would be a bit redundant with what you have done earlier), I have created an alignment of all APSES domains for this assignment. The process too is explained in detail [[Reference_APSES_domains#All_APSES_domains_for_all_course_species|'''on the  reference APSES domains page''']]. Read the explanation.
 
 
 
<!--
 
 
{{task|1=
 
{{task|1=
#Navigate to the [[Reference APSES domains|reference APSES domains page]] and copy the APSES/KilA-N domain sequences.
 
#Open Jalview, select '''File &rarr; Input Alignment &rarr; from Textbox''' and paste the sequences into the textbox.
 
#Add the APSES domain sequences '''from your species (YFO)''' that you have previously defined through PSI-BLAST. Don't worry that the sequences are longer, the MSA algorithm should be able to take care of that. However: do rename your sequences to follow the pattern for the other domains, i.e. edit the FASTA header line to begin with the five-letter abbreviated species code.
 
#When all the sequences are present, click on '''New Window'''.
 
#In Jalview, select Web Service &rarr; Alignment &rarr; MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
 
#Choose any colour scheme and add '''Colour &rarr; by Conservation'''. Adjust the slider left or right to see which columns are highly conserved.
 
#Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.
 
}}
 
-->
 
 
===Editing sequences===
 
As discussed in the lecture, we should edit our alignments to make them suitable for phylogeny calculations. Here are the principles:
 
 
Follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
 
 
*Remove all stretches of residues in which the ''alignment'' appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
 
*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains. You want to only retain the APSES domains. All the extra residues from the YFO sequence can be deleted.
 
*Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
 
*Remove all but approximately one column from gapped regions '''in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact.''' (Some researchers simply remove all gapped regions).
 
*Remove sections N- and C- terminal of gaps where the alignment appears questionable. 
 
*If the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input. If you do run out of memory try removing columns of sequence. Or remove species that you are less interested in from the alignment.
 
*Move your outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.
 
 
====Handling indels====
 
 
Gaps are a real problem, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this '''underestimates''' the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this '''overestimates''' the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.
 
 
 
[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
 
 
I have taken a shortcut here, by simply removing all columns that contains more than 90% gap characters. For this purpose, I have written the following short Perl script - you could do the same thing with a few lines of '''R''' - or you could actually do this by hand. It will take you a little while, but these kinds of tasks are still doable, albeit tedious. Of course, this is not a Perl course, but try reading through the comments to get a sense of what this does.
 
 
<source lang="Perl">
 
#!/usr/bin/perl
 
# RemoveGaps.pl
 
# Read an aligned multi FASTA file and remove all columns that exceed
 
# a threshold of gap characters.
 
# Write output sorted by species code.
 
# BS Nov 2013
 
use strict;
 
use warnings;
 
 
my $mfaFile = "APSES_domains.mfa";
 
#my $mfaFile = "test.mfa";
 
 
my $MAXGAP = 0.9;  # Maximum allowed gap characters in a column
 
my $GAPCHAR = '-';  # The gap character
 
 
my %seq;    # Hash to hold the concatenated sequences that we
 
            # read from the .mfa file
 
my @ali =(());  # 2D array to hold the aligned sequences
 
my %phy;  # Hash to hold the output with gap columns removed
 
 
my $key;
 
 
#read the .mfa file into our hash
 
open IN, $mfaFile or die "$!";
 
while (my $line = <IN>) { # process all lines from this file
 
    # use regular expression to parse information about annotated Mbp1 RBMs
 
    if ($line =~ m/^>(\S+)/) { # new header
 
        $key = $1;
 
    }
 
    else {
 
        chomp($line); # remove linebreaks
 
        $seq{$key} .= $line; # add to sequence (or create new entry)
 
    }
 
}
 
close IN;
 
 
# iterate through the hash and convert all the strings to arrays
 
foreach my $key (keys(%seq)) {
 
    my @a = split(//, $seq{$key}); # split the string into single characters.
 
    push(@ali, [ @a ]); # store the array of characters in our array
 
    $seq{$key} = scalar(@ali) - 1; # store the row index in the original hash
 
}
 
 
my $nrow = scalar(@ali);        # number of rows...
 
my $ncol = scalar(@{$ali[0]});  # number of columns ...
 
 
for (my $iC=0; $iC<$ncol; $iC++) {      # for all columns...
 
    my $gaps = 0;                      # clear number of gaps and ...
 
    for (my $iR=0; $iR<$nrow; $iR++) {      # for all rows ...
 
my $x = $ali[$iR][$iC];
 
        if ($ali[$iR][$iC] eq $GAPCHAR) {
 
            $gaps++;                        # count gaps
 
        }
 
    }
 
    if ($gaps / $nrow < $MAXGAP) {      # then, if less gaps than allowed
 
                                        # in this column, ...
 
        # append the characters to the output hash
 
        foreach my $key (keys(%seq)) {
 
            my $iRow = $seq{$key};          # fetch the row index for this key...
 
            $phy{$key} .= $ali[$iRow][$iC];  # and append array cell contents
 
        }
 
    }
 
}
 
 
# Now iterate through all keys in %phy and print sequences in
 
# multi FASTA format. But do this nicely sorted by organism!
 
 
foreach my $key (sort({substr($a,5,5) cmp substr($b,5,5) } keys(%phy))) {
 
    print (">");
 
    print ("$key\n");
 
    print ("$phy{$key}\n");
 
}
 
 
exit();
 
 
</source>
 
 
Finally: here is the resulting file - 284 sequences in all! They are sorted into three blocks: at the top is the ''E. coli'' KilA-N sequence - the "outgroup". This is followed by the domains from  six fungal species that span the phylogentic tree of fungi, well call them our "reference species". Finally: all the other genome sequenced fungi.
 
 
<small>
 
<source lang="text">
 
>KilA_ESCCO
 
----------RAKDGYINATSMCRT----AGKLLSDYTRLLSRDMGIPISEIQSFKGGRPENQGTWVHPDIAINLAQ-----
 
 
 
 
>Mbp1_SACCE
 
IHSTGS-IMKRKKDDWVNATHILKA----ANFAKAKRTRILEKEVLKE--THEKVQGGFGKYQGTWVPLNIAKQLAEKFSVY
 
>Phd1_SACCE
 
--NGIS-VVRRADNNMINGTKLLNV----TKMTRGRRDGILRSEK-----VREVVKIGSMHLKGVWIPFERAYILAQREQI-
 
>Sok2_SACCE
 
--NGIS-VVRRADNDMVNGTKLLNV----TKMTRGRRDGILKAEK-----IRHVVKIGSMHLKGVWIPFERALAIAQREKI-
 
>Swi4_SACCE
 
---TKI-VMRRTKDDWINITQVFKI----AQFSKTKRTKILEKESNDM--QHEKVQGGYGRFQGTWIPLDSAKFLVNKYEI-
 
>Xbp1_SACCE
 
------DFHWNNIKPELRICQSYKDF--LINELG--PDQIDLPNL-NPANFTKRIRGGYIKIQGTWLPMEISRLLCLRFC--
 
>Aps1_CANAL
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALAMAQREQI-
 
>Aps2_CANAL
 
MMNESS-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKVQGGFGRFQGTWIPLEDARKLAKTYGV-
 
>Aps4_CANAL
 
---NNHWVIWDYETGWVHLTGIWKASNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYY
 
>Aps7_CANAL
 
-HGEII-VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSN--LKYF--GSNIYLQGIWIPYDKAVNLALKFDIY
 
>Mbp1_CANAL
 
VTSEGP-IMRRKKDSWINATHILKI----AKFPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLDLGAAIARNFGVY
 
>Aps3_CANAL
 
---NIL-VSRREDTNYINGTKLLNV----IGMTRGKRDGILKTEK-----IKNVVKVGSMNLKGVWIPFDRAYEIARNEGV-
 
>Aps6_CANAL
 
MMNESS-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKVQGGFGRFQGTWIPLEDARRLAKTYGV-
 
>Aps5_CANAL
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALAMAQREQI-
 
>MBP1_USTMA
 
IINNVA-VMRRRSDDWLNATQILKV----VGLDKPQRTRVLEREIQKG--IHEKVQGGYGKYQGTWIPLDVAIELAERYNI-
 
>Aps4_USTMA
 
---RGHTMMIDVDTSFVRFTSITQAL----GKNKVNFGRLVKTCP-ALDPHITKLKGGYLSIQGTWLPFDLAKELSRR----
 
>Aps2_USTMA
 
-VRGIA-VMRRRGDGWLNATQILKI----AGIEKTRRTKILEKSILTG--EHEKIQGGYGKFQGTWIPLQRAQQVAAEYNV-
 
>Aps2_NEUCR
 
---GIC-VARREDNAMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps3_NEUCR
 
---PSYFLMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTT--SSEETAG------NVWIPPEQALILAEEYQI-
 
>Aps4_NEUCR
 
-------VMRRRHDDWVNATHILKA----AGFDKPARTRILEREVQKD--THEKIQGGYGRYQGTWIPLEQAEALARRNNIY
 
>Aps1_NEUCR
 
--NNVA-VMRRQKDGWVNATQILKV----ANIDKGRRTKILEKEIQIG--EHEKVQGGYGKYQGTWIPFERGLEVCRQYGV-
 
>Mbp1_ASPNI
 
-----S-VMRRRSDDWINATHILKV----AGFDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWIPLQEGRQLAERNNI-
 
>Aps4_ASPNI
 
----TYFLMRRSKDGFVSATGMFKIAFPWAKLDEERSEREYLKTRTET--SEDEIAG------NVWISPLLALELAKEYQMY
 
>Aps3_ASPNI
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps7_ASPNI
 
---KQWTVMWDYNIGLVRTTHLFKCN----DYSKTTPAKMLNQNP-GLRDICHSITGGALAAQGYWMPYEAAKAIAATFC--
 
>Aps6_ASPNI
 
-----S-VMRRRSDDWINATHILKV----AGFDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWIPLPEGRMLAERNNI-
 
>Aps5_ASPNI
 
---KTWVISWDYNVGLVLTRSLFKCN----GHPKTAPAKVLKMNP-GLGDISHSITGGALVGQGYWMPFRAAKALATTFC--
 
>Aps8_ASPNI
 
--NGVA-VMKRRSDSWLNATQILKV----AGVVKARRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVNYQRGVELCREYHV-
 
>Aps9_ASPNI
 
--NGVA-VMKRRSDGWLNATQILKV----AGVVKARRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVNYQRGVELCREYHV-
 
>Aps2_ASPNI
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPET--SEDEIAG------NVWISPVLALELAAEYKMY
 
>Aps1_ASPNI
 
--KGVC-VARREDNGMINGTKLLNV----AGMTRGRRDGILKSEK-----VRNVVKIGPMHLKGVWIPFDRALEFANKEKI-
 
>Res1_SCHPO
 
-INGFP-LMKRCHDNWLNATQILKI----AELDKPRRTRILEKFAQKG--LHEKIQGGCGKYQGTWVPSERAVELAHEYNVF
 
>CdcA_SCHPO
 
---GDNVALRRCPDSYFNISQILRL----AGTSSSENAKELDDIIESG--DYENVDSKHPQIDGVWVPYDRAISIAKRYGVY
 
>Aps4_SCHPO
 
-----HFLMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNP--IEDKRVA------GLWVPPADALALAKDYSM-
 
>MBP1_SCHPO
 
-IKGVS-VMRRRRDSWLNATQILKV----ADFDKPQRTRVLERQVQIG--AHEKVQGGYGKYQGTWVPFQRGVDLATKYKV-
 
 
 
 
>Aps3_AJEDE
 
---KTYTVMWDYNIGLVRTTSLFRCN----NYSKTAPAKMLNANP-GLREICHSITGGALAAQGYWMPFEAAKAVAATFC--
 
>Aps2_AJEDE
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRNVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps1_AJEDE
 
--NGVA-VMRRRSDSWLNATQILKV----AGVMKARRTKTLEKEVAAG--EHEKVQGGYGKYQGTWVNYERGVELCRHYHVF
 
>Mbp1_AJEDE
 
-------VMRRRADDWINATHILKV----AGLDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWVPLQEGRELAERNGI-
 
>Aps2_ARTBE
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps1_ARTBE
 
-------VMRRRVDDWVNATHILKA----AGLDKPSRTRILERDVQRG--VHEKIQGGYGKYQGTWIPLAEARALADKNNV-
 
>Aps3_ARTBE
 
--------MRRRSDSWLNATQILKV----AGVAKARRTKTLEKEVAAG--EHEKVQGGYGKYQGTWVSYERGLELCRRYQV-
 
>Aps5_ARTGY
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps4_ARTGY
 
-------VMRRRVDDWVNATHILKA----AGLDKPSRTRILEREVQRG--VHEKIQGGYGKYQGTWIPLAEARALADKNGV-
 
>Aps3_ARTGY
 
---KVYTVMWDYNIGLVRTTSLFRCN----NYSKTAPAKMLNANP-GLREICHSITGGALAAQGYWMPFEAAKAVAATFC--
 
>Aps1_ARTGY
 
----SYFLMRRSRDGHISASGMFKIAFPWAKHSEESDERDYLRTRPET--SEDEIAG------NVWISPELALELAREYGI-
 
>Aps2_ARTGY
 
--NGVA-MMRRRSDSWLNATQILKV----AGVAKARRTKTLEKEVAAG--DHEKVQGGYGKYQGTWVSYERGLELCRRYQV-
 
>Aps2_ASHGO
 
--NGVS-VVRRADNDMINGTKLLNV----AKMTRGRRDGILKAEK-----VRHVVKIGSMHLKGVWIPFERALALAQREKI-
 
>Aps1_ASHGO
 
-----I-VMRRLHDDWVNITQVFKV----ATFSKTQRTKILEKESADI--SHEKIQGGYGRFQGTWIPLDSAKGLVAKYEI-
 
>Aps4_ASHGO
 
-----TDVHWNQVDPTWKLCRLYQQ------------EKNLDFTP-EFQDCYKRIRGGYIKIQGTWLPMEICKRLCIRFC--
 
>Aps3_ASHGO
 
LHPTGS-IMKRKADDWVNATHILKA----AKFAKAKRTRILEKEVIKD--THEKVQGGFGKYQGTWVPLDIARRLAQKFEV-
 
>Aps1_ASPCL
 
--NGVA-VMKRRSDSWLNATQILKV----AGVVKARRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVNYQRGVDLCREYHV-
 
>Aps3_ASPCL
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps5_ASPCL
 
---GES-VMRRRGDNWINATHILKV----AGFDKPARTRILEREVQKG--THEKVQGGYGKYQGTWIPLPEGRLLAERNNI-
 
>Aps4_ASPCL
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKLEEEKAEREYLKSRDET--SEDEIAG------NIWISPTLALELAKEYQMY
 
>Aps2_ASPCL
 
---KEWTVMWDYNIGLVRTTHLFKCN----DYSKTTPAKMLNLNP-GLREICHSITGGALAAQGYWMPFEAAKAVAATFC--
 
>Aps2_ASPFU
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKLEEEKAEREYLKTREGT--SEDEIAG------NIWVSPLLALELAKEYQMY
 
>Mbp1_ASPFU
 
--------MRRRGDDWINATHILKV----AGFDKPARTRILEREVQKG--THEKVQGGYGKYQGTWIPLHEGRLLAERNNI-
 
>Aps4_ASPFU
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps1_ASPFU
 
--NGVA-VMKRRSDSWLNATQILKV----AGVVKARRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVNYQRGVELCREYHV-
 
>Aps3_ASPFU
 
---KEWIVMWDYNIGLVRTTHLFKCN----DYS-----KMLNANP-GLREICHSITGGALAAQGYWMPYEAAKAVAATFC--
 
>Aps1_ASPTE
 
--KGVC-VARREDNSMINGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Mbp1_ASPTE
 
-----S-VMRRRADDWINATHILKV----AGFDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWIPLPEGRLLAERNNI-
 
>Aps3_ASPTE
 
----TYFLM----DGYVSATGMFKIAFPWAKLDEERSEREYLKSREET--SEDEIAG------NVWISPKLALELAGEYQMY
 
>Aps4_ASPTE
 
--NGVA-VMKRRSDSWLNATQILKV----AGVVKARRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVNYQRGVDLCREYHV-
 
>Aps2_ASPTE
 
---KEWLIMWDYNIGLVRTTPLFRSQ----NYSKTTPAKVLDANP-GLREISHSITGGAIVAQGYWIPFEAAKAVAATFC--
 
>Aps4_CANDU
 
---NIL-VSRREDTNYINGTKLLNV----IGMTRGKRDGILKTEK-----IKNVVKVGSMNLKGVWIPFDRAYEIARNEGV-
 
>Aps1_CANDU
 
IMNDYS-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKVQGGFGRFQGTWIPLEDARRLAESYGV-
 
>Aps3_CANDU
 
VTSEGP-IMRRKKDSWINATHILKI----AKFPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLDLGAAIAKNFGVY
 
>Aps5_CANDU
 
-HNEII-VLRRVQDSFVNITQLFQILIKLDLLSASQVNNYFDNEILSN--LEYF--GSNTFLQGIWIPYDRAVNLALKFDVY
 
>Aps2_CANDU
 
---NNHWVIWDYETGWVHLTGIWKASNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPFKLCKILARRFCYY
 
>Aps6_CANDU
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALVMAQREGI-
 
>Mbp1_CANGA
 
IHPTGS-IMKRKNDGWVNATHILKA----ANFAKAKRTRILEKEVLKE--MHEKVQGGFGKYQGTWVPLNIAINLAEKFDVY
 
>Aps3_CANGL
 
--NGVS-VVRRADNDMINGTKLLNV----TKMTRGKRDGILRSEK-----YRKVVKIGSMHLKGVWIPFERALFIAKREKI-
 
>Aps4_CANGL
 
-----I-VMRRTMDDWVNVTQVFKI----AQFSKTQRTKILEKESTNM--KHEKVQGGYGRFQGTWVPLEAAKFMTTKYNI-
 
>Aps2_CANGL
 
-HNGVT-VVRRADNDMVNGTKLLNV----TGMTRGRRDGILKNEP-----VRDVVKGGPMTLKGVWIPIDRARAIARQEGI-
 
>Aps1_CANGL
 
------DFHWFDISEKVRIFEQFKQH--LEKDRN--VDCSTIP---KAEEYIQRIRGGYIKIQGTWVPWYIAKLICIRFC--
 
>Mbp1_CANOR
 
VTSEGP-IMRRKGDSWINATHILKI----AKLPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLKLGEVIARNYDVY
 
>Swi4_CANOR
 
--NDSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKVQGGFGRFQGTWIPLEDARRLACTYGV-
 
>Efh1_CANOR
 
--NEIL-VSRREDNNYINCTKLLNV----TGMSRGKRDGILKTEK-----VKDVVKVGTMNLKGVWVPFDRAYEIARNEGV-
 
>Efg1_CANOR
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALSMAQRENI-
 
>Swi6_CANOR
 
---EII-VLRRVQDSFINASQLLKILVRLHIVTPIQVKNYLNNEVLSN--LEYF--GNNKSLRGIWVPYNKGVKIALDFDVY
 
>Aps1_CANOR
 
---NDHWVIWDYETGFVHLTGIWKASPPCASHFKADIVKLLESTPKQYQAYIKRIRGGFLKIQGTWLPFKLCKILARRFCY-
 
>Aps1_CANTR
 
---NNHWVIWDYETGWVHLTGIWKASNVSPSHMKADIVKLLESTPKEYQHYIKRIRGGFLKIQGTWLPYKLCKILARRFCYH
 
>Aps5_CANTR
 
---NIL-VSRREDSNYINGTKLLNV----IGMTRGKRDGILKTEK-----VKNVVKVGSMNLKGVWIPFDRAYEIARNEGV-
 
>Aps3_CANTR
 
-DEELI-ILRRVQDSFINVTQLFEILVKLDLLTLSQLNNFFDNEILSN--LKYF--GSNTYIKGIWIPYDKAVELALKFDIY
 
>Aps4_CANTR
 
VTSEGP-IMRRKSDSWINATHILKI----AKFPKARRTRILEKDVQTG--VHEKVQGGYGKYQGTYVPLELGATIAKNFGVY
 
>Aps2_CANTR
 
--NDSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKVQGGFGRFQGTWIPLEDARRLAETYGV-
 
>Aps3_CHAGL
 
--NNVA-VMRRQTDGWLNATQILKV----AGVDKGRRTKILEKEIQTG--EHEKVQGGYGKYQGTWIPFERGFEVCRQYGV-
 
>Aps2_CHAGL
 
----SYTVMWDYN---------------------TAPAKMLNLNP-GLKDITYSITGGSIKAQGYWMPYSCAKAVCATFC--
 
>Aps1_CHAGL
 
---PSYFLMRRSHDGFVSATGMFKG-----------------HSLPST--SHEETAG------NVWIPPEEALVLAEEYNI-
 
>Aps4_CHAGL
 
---GIC-VARREDNAMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPYDRALDFANKEKI-
 
>Mbp1_CHAGL
 
-------VMRRREDNWINATHILKA----AGFDKPARTRILERDVQKD--VHEKIQGGYGKYQGTWIPLEQGRALAQRNNIY
 
>Aps3_CLALU
 
----SQWIIWDHETGNVLLTSLWRAADKLRAPPKADIVKLLESTPKELHASIKRVRGGFLKIQGTWVPHALCRRLARRFCYY
 
>Aps2_CLALU
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----IRHVVKIGSMHLKGVWIPFERALAMAQREGI-
 
>Aps4_CLALU
 
----VV-VSRREKDDYVNGTKLLNV----TGMSRGKRDGLLKTEK-----GRIVVRNGPMNLKGVWIPFHRASEIARNEGV-
 
>Aps6_CLALU
 
VTKEGP-IMRRKSDSWINATHILKI----AKFPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLDLGAEIAKSFGIF
 
>Aps1_CLALU
 
-DKPIL-VLRRVQDSYVNVSQMLEILVLTGHFSKDQVSGFLRNEILHS--TQYLPRGNVEQIRGLWIPYDKAVSIAVRFDLY
 
>Aps5_CLALU
 
--------MRRCKDDWVNATQILKL----CNFPKAKRTKILEKGVQQG--LHEKVQGGYGRFQGTWIPLADARRLADEYGI-
 
>Aps3_COCIM
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKLADEKSEREYLRGLPET--SPDEVAG------NLWISPELALELAEEYRM-
 
>Aps2_COCIM
 
---KIHTVMWDYNVGLVRTTSLFKCN----NYPKTAPGKMLDANR-GLREICHSITGGALAAQGYWMPFEAAKAVAATFC--
 
>Aps1_COCIM
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps4_COCIM
 
-----S-VMRRRHDDWINATHILKV----AGLDKPSRTRILEREVQKG--THEKIQGGYGKYQGTWVPLADGRAVAERNKV-
 
>Mbp1_COCPO
 
-----S-VMRRRHDDWINATHILKV----AGLDKPSRTRILEREVQKG--THEKIQGGYGKYQGTWVPLADGRAVAERNKV-
 
>Aps3_COCPO
 
---KIHTVMWDYNVGLVRTTSLFKCN----NYPKTAPGKMLDANR-GLREICHSITGGALAAQGYWMPFEAAKAVAATFC--
 
>Aps4_COCPO
 
--NGVA-VMRRRSDSWLNATQILKV----AGVVKARRTKTLEKEVVSG--EHEKVQGGYGKYQGTWVSYQRGVELCRRYHV-
 
>Aps2_COCPO
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKLADEKSEREYLRGLPET--SPDEVAG------NLWISPELALELAEEYRM-
 
>Aps1_COCPO
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps6_DEBHA
 
-DDPIV-ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTN--TQYL--SSVRDLRGLWIPYDRAVSLALKFDIY
 
>Aps1_DEBHA
 
--NNSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKIQGGYGRFQGTWIPLADAQRLAASYGV-
 
>Aps5_DEBHA
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALAMAQREGI-
 
>Aps4_DEBHA
 
---GIL-VSRREDTNYVNGTKLLNV----AGMTRGKRDGILKTEK-----TKSVVKVGAMNLKGVWIPFERASEIARNEGI-
 
>Aps2_DEBHA
 
---NNHWIIWDYETGFVHLTGIWKASVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYH
 
>Aps3_DEBHA
 
VTSEGP-IMRRKSDSWINATHILKI----AKFPKAKRTRILEKDVQTG--VHEKVQGGYGKYQGTYVPLDLGADIAKNFGVF
 
>Aps2_ERECY
 
--NSVS-VVRRADNDMINGTKLLNV----AKMTRGRRDGILKAEK-----VRHVVKIGSMHLKGVWIPFERALALAQREKI-
 
>Aps3_ERECY
 
IHPTGS-IMKRKADDWVNATHILKA----AKFAKAKRTRILEKEVIKD--IHEKVQGGFGKYQGTWVPLDIARRLAEKFDV-
 
>Aps1_ERECY
 
-----TDVHWNQLDPAWKLCQMFQEI--RKNMPRTGSSEHLDFTL-DFQDCYKRIRGGYIKIQGTWLPLEISRQLCTRFC--
 
>Aps4_ERECY
 
-----I-VMRRLHDDWVNITQVFKV----ASFTKTQRTKVLEKESTDI--NHEKIQGGYGRFQGTWIPLLSAQNLVAKYCI-
 
>Aps2_KAZAF
 
---SHI-VMRRTRDDWINITQVFKV----AKFSKNHRTKVLERESSNL--RHEKVQGGYGRFQGTWIPLVDAKRLIAEYNI-
 
>Aps7_KAZAF
 
LRKRYIELHWQNITATMKLFNEFKNY-VLEHEPN--VDATLFQNY-NMADLIHRIRGGCIKVQGTWFPMELAKLFCIKF---
 
>Aps1_KAZAF
 
IHPTGS-IMKRKKDGWVNATHILKA----ANFAKAKRTRILEKEVLPG--THEKVQGGFGKYQGTWIPLESAIALAEKFAVY
 
>Aps8_KAZAF
 
-----V-VMRRTRDDWVNITQVFKI----AQFSKTQRTKLLEKESMNI--QHEKVQGGYGRFQGTWVPLDAARDIAAKYSI-
 
>Aps3_KAZAF
 
LHPAGS-IMKRRIDNWVNATHVLKI----ANFNKSKRLRLLEKEVIKAGKAYEKIQGGSGKYQGTWVPLEVAKELAVKFEV-
 
>Aps6_KAZAF
 
--NGVS-VVRRADNDMINGTKLLNV----TKMTRGRRDGILRGEK-----VRNVVKIGSMHLKGVWIPFERAYLIAQREKI-
 
>Aps4_KAZAF
 
--NGVS-VVRRADNDMINGTKLLNV----TKMTRGRRDGILKAEK-----IRHVVKIGSMHLKGVWIPFERARYMAEKEKI-
 
>Aps5_KAZAF
 
--------HWNNLSKELKILKNFKDF--LINEKH--LTEENLLNY-NLNNLIQRIRGGYIKIQGTWLPMEIAKLICSRFC--
 
>Aps2_KLULA
 
--NGVS-VVRRADNDMINGTKLLNV----TRMTRGRRDGILKAEK-----IRHVVKIGSMHLKGVWIPFERALVMAQREKI-
 
>Aps3_KLULA
 
IHPTGS-IMKRKADNWVNATHILKA----AKFPKAKRTRILEKEVITD--THEKVQGGFGKYQGTWIPLELASKLAEKFEV-
 
>Aps1_KLULA
 
-------IMRRCNDNWLNITQVFKA----GSFTKAQRTKILEKEANEI--KHEKIQGGYGRFQGTWIPWESTKYLVEKYNI-
 
>Aps4_KOMPA
 
--NGVS-VVRRADNNMINGTKLLNV----AKMTRGRRDGMLKSEK-----IRHVVKIGSMHLKGVWIPFDRALAMAQKEHI-
 
>Aps2_KOMPA
 
VTPLTS-VMRRKSDDWINATHILKV----ADFPKAKRTRILERDIQVG--THEKVQGGYGKYQGTWVPLESAVKIAETFDV-
 
>Aps1_KOMPA
 
VVQKIP-LSRRADNDYVNATKLLNL----TGMRRGRRDGILKLEK-----QRQVVKTGTIDLKGVWVPLKRAIKLAKAEQVF
 
>Aps3_KOMPA
 
ICNTFP-LMRRCSDDWVNVTQILKI----AQFPKAQRTKILEKEVHDK--THQRIQGGYGRFQGTWTPLDIARNLAMNYG--
 
>Aps2_LACTH
 
-----I-VMRRCMDNWVNITQVFKI----ASFSKTQRTKILEKESNMV--KHEKIQGGYGRFQGTWIPLENAHYLVQKYSV-
 
>Aps1_LACTH
 
--NGVS-VVRRADNDMINGTKLLNV----AKMTRGRRDGILKAEK-----IRHVVKVGSMHLKGVWIPFDRALAMAQREKI-
 
>Mbp1_LACTH
 
IHPTGS-IMKRKEDDWVNATHILKA----AKFAKAKRTRILEKEVIKD--THEKVQGGFGKYQGTWVPLDIARSLAAKFEV-
 
>Aps3_LODEL
 
--NDSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--VHEKIQGGFGRFQGTWIPLEDARRLAATYGV-
 
>Aps1_LODEL
 
---EGP-IMRRKLDSWINATHILKI----AKLPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLELGEIIARNYDVY
 
>Aps4_LODEL
 
---NIL-VSRREDTNYINCTKLLNV----VGMTRGKRDGILKTEK-----VKQVVKVGSMNLKGVWIPFDRAYEIARNEGV-
 
>Aps2_LODEL
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKLEK-----VRHVVKIGSMHLKGVWIPFERALTMAQRENI-
 
>Aps1_MAGOR
 
---GVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----MRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps2_MAGOR
 
--NGVA-VMKRIGDSKLNATQILKV----AGVEKGKRTKILEKEIQTG--EHEKVQGGYGKYQGTWIKYERALEVCRQYGV-
 
>Aps3_MAGOR
 
-------VMRRRVDDWINATHILKA----AGFDKPARTRILEREVQKD--QHEKVQGGYGKYQGTWIPLEAGEALAHRNNIF
 
>Aps4_MAGOR
 
---NAYFLMRRSSDGYVSATGMFKATFPYADAEDEEAERNYIKSLPAT--SKEETAG------NVWISPDQALALAEEYSI-
 
>Aps1_MALGL
 
--KGVC-VARRHDNNMVNGTKLLNV----CGMSRGKRDGILKNEK-----ERIVVKVGAMHLKGVWIAFSRGKQLAEQHGI-
 
>Aps3_MALGL
 
---GIA-LMRRRSDGYLNATQILKI----AGIEKARRTRILEKEILTG--EHDKVQGGYGTFQGTWIPLQRAQELAISYNVY
 
>Aps2_MALGL
 
IIKDVA-VMRRRSDAWLNATQILKV----VGLDKSQRTRVLEKEVQKG--THEKVQGGYGKYQGTWIPMDVAIALAEHYHI-
 
>Aps5_MEYGU
 
---SLV-ILRRVQDSFVNVSQLFSILVRLGHSNPDQISSFLSNEILSS--SHYT--GSNPMLQGLWVSYDRAVALALRFDIY
 
>Aps2_MEYGU
 
---GVL-VSRREDTNYINGTKLLNV----AGMSRGKRDGILKTEK-----DRYVVRAGAMSLKGVWIPYERAKEIARNEGV-
 
>Mbp1_MEYGU
 
VTSEGP-IMRRKLDSWINATHILKI----ARFPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLNLGAEIAQSFGVY
 
>Aps4_MEYGU
 
---NGQSIIWDYESGYVHLTGIWKAADLPKSNSKADIVKLLESTPRQHQAKIKRIRGGFLKIQGTWLPYSLCRILARRFCYH
 
>Aps1_MEYGU
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFDRALAMAQREGI-
 
>Aps3_MEYGU
 
--------MRRVKDNWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKIQGGYGRFQGTWIPLEDAQQLAANYGL-
 
>Aps2_MILFA
 
VTSEGP-IMRRKSDSWINATHILKI----AKFPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLELGAEIARSFGIY
 
>Aps6_MILFA
 
---NNQWIIWDYETGLVHLTGIWKASQSGSKSVKADIMKLLESTPKQYHSNIKRIRGGFLKIQGTWMPYDLCKVLARRFCYH
 
>Aps5_MILFA
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALAMAQREGI-
 
>Aps8_MILFA
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALAMAQREGI-
 
>Aps7_MILFA
 
--NNSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKIQGGYGRFQGTWIPLANAQKLAASYGV-
 
>Aps4_MILFA
 
---NNQWIIWDYETSLVHLTGIWKASSSGSKSVKADIMKLLESTPKQYHSNIKRIRGGYLKIQGTWMPYGLCKVLARRFCYH
 
>Aps9_MILFA
 
--NNSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKIQGGYGRFQGTWIPLANAQKLAASYGV-
 
>Mbp1_MILFA
 
VTSEGP-IMRRKSDSWINATHILKI----AKFPKAKRTRILEKDVQTG--IHEKVQGGYGKYQGTYVPLDLGAEIARSFGIY
 
>Aps3_MILFA
 
---GIL-VSRREDTNFVNGTKLLNV----AGMTRGKRDGILKTEK-----TKSVIKVGTMNLKGVWIPFERAAEIARNEGI-
 
>Aps1_MILFA
 
----VI-ILRRVQDSYVNISQLLSILVKMGHFNQTRLNNFLNNEIITN--PQYS--AEVRQLRGLWIPYDKAVSLALKFDIY
 
>ApsA_MILFA
 
----VI-ILRRVQDSYVNISQLLSILVKMGHFNQTRLNNFLNNEIITN--PQYS--ADVKQLRGLWISYDKAVSLALKFDIY
 
>Aps1_MYCTH
 
---GIC-VARREDNSMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps4_MYCTH
 
---PSYFLMRRSEDGYVSATGMFKATFPYATQEEEEAERKYIKSLPST--SPEETAG------NVWIPPEQALILAEEYQI-
 
>Mbp1_MYCTH
 
-------VMRRREDNWINATHILKA----AGFDKPARTRILERDVQKD--IHEKIQGGYGKYQGTWIPLEHGEALAQRNNVY
 
>Aps2_MYCTH
 
---TDYTVMWDHNVGLVRMTPFFKCR----GYSKTTPAKMLNLNP-GLKDITYSITGGSIKAQGYWMPYSCAKAVCATFC--
 
>Aps3_MYCTH
 
--NNVA-VMRRQADGWLNATQILKV----AGVDKGRRTKILEKEIQTG--EHEKVQGGYGKYQGTWIPFERGYEVCRQYGV-
 
>Aps4_NAUCA
 
-----KDFHWNNLPPILKAINHFRNI--LQMEKG--ITSDYLASM-KDCDFCQRIRGGYIKIQGTWLPIEMAKLICTKFC--
 
>Aps2_NAUCA
 
--NGVS-VVRRADNDMINGTKLLNV----TKMTRGRRDGILKSEK-----IRHVVKIGSMHLKGVWVPFERARLMAGREHI-
 
>Aps1_NAUCA
 
-CNGVA-VVRRADNDMINGTKLLNV----TKMTRGRRDGILRAEK-----VRSVIKIGSMHLKGVWIPFDRALMMAKREKI-
 
>Aps5_NAUCA
 
IHPTGS-VMKRKKDDWVNATHILKA----ANFAKAKRTRILDKEVMGR--KHEKVQGGFGKYQGTWVPLEIATELAMKFDVY
 
>Aps3_NAUCA
 
-----SDLHWNNMSPDLQITESFKKD--LIINKH--CNEQDLKDL-NLSNLIQRIRGGYIKIQGTWLPLEIARLLSLRFC--
 
>Aps6_NAUCA
 
-----I-VMRRTKDDWINVTQVFKI----ADFSKAHRTKVLEKESSDM--MHEKVQGGYGRFQGTWIPLESALMLVQKYKI-
 
>Aps1_NAUDA
 
--NGVS-VVRRADNDMINGTKLLNV----SKMTRGRRDGILKAEK-----IRHVVKIGSMHLKGVWIPFERARIMAEKEKI-
 
>Aps5_NAUDA
 
--NSVS-VIRRADNDMINGTKLLNV----TKMTRGRRDGILRTEK-----IRKVVKIGSMHLKGVWIPFDRAYEIARREKI-
 
>Aps2_NAUDA
 
-----SDLHWNNISSNIKLCDSFKQY--LTKREN--IPAETLKNL-TLSMLIQRIRGGYIKIQGTWLPMEICRSLCLRFC--
 
>Aps4_NAUDA
 
VHPTGS-VMKRKSDDWVNATHILKV----ANFSKAKRTRILEKEVLKE--THEKVQGGFGKYQGTWVPMNIALNLAEKYGVY
 
>Aps3_NAUDA
 
----KV-VMRRTRDDWINITQVFKI----GKFSKAQRTKVLELEANEM--KHEKVQGGYGRFQGTWIPLESAMFLAKKYTI-
 
>Aps1_NECHA
 
---TEYAVMWDYNVGLVRMTPFFKCC----RYGKTIPAKMLGLNQ-GLKEITHSITGGSIAAQGYWMPYQCARAVCATFC-Y
 
>Mbp1_NECHA
 
-------VMRRRQDNWINATHILKA----AGFDKPARTRILERDVQKD--VHEKIQGGYGKYQGTWIPLESGQALAERHSV-
 
>Aps3_NECHA
 
---GIC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPYDRALDFANKEKI-
 
>Aps2_NECHA
 
---NSYFLMRRSFDGYVSATGMFKATFPYAEAADEEAERKFIKSLATT--SPEETAG------NIWIPPEQALALADEYQI-
 
>Aps4_NECHA
 
--NNIA-VMRRRNDSWLNATQILKV----AGVDKGKRTKILEKEIQTG--EHEKVQGGYGKYQGTWITFDRGVQVCRQYGV-
 
>Aps4_NEOFI
 
---GES-VMRRRGDNWINATHILKV----AGFDKPARTRILEREVQKG--THEKVQGGYGKYQGTWIPLPEGRLLAERNNI-
 
>Aps3_NEOFI
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps2_NEOFI
 
---KEWIVMWDYNIGIVRTTHLFKCN----DYSKTTPAKMLNANP-GLREICHSITGGALAAQGYWMPYEAAKAVAATFC--
 
>Aps5_NEOFI
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKLEEEKAEREYLKTREGT--SEDEIAG------NIWVSPLLALELAKEYQMY
 
>Aps1_NEOFI
 
--NGVA-VMKRRSDSWLNATQILKV----AGVVKARRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVNYQRGVELCREYHV-
 
>Aps1_PUCGR
 
---NGQYIMIDCETGMVHFTGIWKAL----GHTKADVVKLVESDP-TIAPYLRKVRGGYLKIQGTWLPFDTAQTLARR----
 
>Aps2_PUCGR
 
-CEGIA-VMRRRSDSWLNATQILKV----AGFDKPQRTRVLEREIQKG--THEKIQGGYGKYQGTWVPLDRGIDLAKQYGV-
 
>Aps4_PUCGR
 
-HKGVT-VGRLKGSGLVNGTKLLNL----AGISRGKRDGILKNEK-----IRKVVKHGTMHLKGVWIAFDRAVFLAEQHSI-
 
>Aps3_PUCGR
 
---GIG-VMRRRSDSYMNATQILKV----AGLDKSKRTRILEREIIQG--EHEKIQGGYGRYQGTWVPFTRAQELATQLNV-
 
>Aps2_PYRTE
 
----SYFLMRRSSDGYISATGMFKAAFPWASLIEEDAERKYQKTFPSA--GAEEVAG------SVWIAPEEALALSEEYGM-
 
>Aps4_PYRTE
 
---KEYVVVWDYNVGLVRMTPFFKSC----KYSKTIPAKALRENP-GLKEISYSITGGALVCQGYWMPYHAARAIAATFC-Y
 
>Aps5_PYRTE
 
--NGNH-VMRRRADDWINATHILKV----ADYDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWIPLEEGRHLAERNGV-
 
>Aps1_PYRTE
 
--NRVA-VMRRRSDGWLNATQILKV----AGVDKGKRTKVLEKEILTG--EHEKVQGGYGKYQGTWINYRRGREFCRQYGV-
 
>Aps3_PYRTE
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----TRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps5_PYRTR
 
----SYFLMRRSSDGYISATGMFKAAFPWASLIEEDAERKYQKTFPSA--GAEEVAG------SVWIAPEEALALSEEYGM-
 
>Aps1_PYRTR
 
---KEYVVVWDYNIGLVRMTPFFKSC----KYSKTIPAKALRENP-GLKEISYSITGGALVCQGYWMPYHAAKAIAATFC-Y
 
>Aps3_PYRTR
 
--NRVA-VMRRRSDGWLNATQILKV----AGVDKGKRTKVLEKEILTG--EHEKVQGGYGKYQGTWINYRRGREFCRQYGV-
 
>Aps4_PYRTR
 
--NGNH-VMRRRADDWINATHILKV----ADYDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWIPLEEGRHLAERNGV-
 
>Aps2_PYRTR
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----TRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Aps4_SCHJA
 
---NPHFLMRMAKNSHISATSMFRSAFPKATPEEEEAEMSWIQQHLHP--VEEKQVS------GLWVSPEDALALAKDYHM-
 
>Aps2_SCHJA
 
LIKGVS-VMRRRHDSWLNATQILKV----ADFDKPQRTRILEKEVQKG--HHEKVQGGYGKYQGTWVPFKRGLELAVQFKV-
 
>Aps1_SCHJA
 
IVNGVA-VMKRCRDGWLNATQILKV----AELDKPKRTRVLEKFAQRG--IHEKVQGGYGKYQGTWVPLQRGVELAMEFQVH
 
>Aps3_SCHJA
 
---GKR-VLRRCSDSYVNLSHVLQL----IGSSPMQIARELDPIIAAG--DFENVDGRDAELNGVWVPLSRIGNICEKHGL-
 
>Aps5_SCHST
 
--NDSP-IMRRCKDDWVNATQILKC----CNFPKAKRTKILEKGVQQG--LHEKVQGGFGRFQGTWIPLPDAQRLATMYGV-
 
>Aps1_SCHST
 
---GVL-VSRREDTNFVNGTKLLNV----IGMTRGKRDGILKTEK-----TRNVVKVGSMNLKGVWIPFDRAFEIARNEGV-
 
>Aps2_SCHST
 
--NNVS-VVRRADNNMINGTKLLNV----AQMTRGRRDGILKSEK-----VRHVVKIGSMHLKGVWIPFERALAMAQREGI-
 
>Aps3_SCHST
 
VTSEGP-IMRRKSDSWINATHILKI----AKFPKAKRTRILEKDVQTG--VHEKVQGGYGKYQGTYVPLELGRDIAKNFGVF
 
>Aps4_SCHST
 
LDNTVV-ILRRVQDSYVNVTQLFGILLKLGHFNETQLNNFFNNEIVTN--IQLQ--GANTQLRGLWISYDRAVALALQFDIY
 
>Aps4_SCLSC
 
--NRIA-VMRRRKDSWLNATQILKV----AGIEKGKRTKVLEKEILIG--DHEKVQGGYGKYQGTWIRFERGVEFCKQYGV-
 
>Aps2_SCLSC
 
----SYFLMRRSSDGYISATGMFKATFPYAEAAEEEMERRYIKSLPTT--SVDETAG------NVWIPPHHALELAEEYQI-
 
>Aps1_SCLSC
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----MRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Mbp1_SCLSC
 
-------VMRRRHDDWINATHILKA----AGFDKPARTRILEREVQKE--EHEKIQGGYGKYQGTWVPLEKGQALAQRNNIY
 
>Aps3_SCLSC
 
---KDYTVMWDYNVGLVRITPFFKCC----KYSKTTPAKMLGLNP-GLKEITHSITGGALAAQGYWMPYSCALAVCTTFCSH
 
>Aps1_SORMA
 
--NNVA-VMRRQKDGWVNATQILKV----ANIDKGRRTKILEKEIQIG--EHEKVQGGYGKYQGTWIPFERGLEVCRQYGV-
 
>Aps4_SORMA
 
-------VMRRRHDDWVNATHILKA----AGFDKPARTRILEREVQKD--THEKIQGGYGRYQGTWIPLEQAEALARRNNIY
 
>Aps3_SORMA
 
---GIC-VARREDNAMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps2_SORMA
 
---PSYFLMRRSQDGYISATGMFKATFPYASTEEEEAERKYIKSLPTT--SHEETAG------NVWIPPEQALILAEEYQI-
 
>Aps4_TALMA
 
---GEC-LMRRRADDWINATHILKV----AGFDKPSRTRILEREVQKG--VHEKVQGGYGKYQGTWIPLPEARLLAERNNI-
 
>Aps2_TALMA
 
--NGIA-VMKRRSDSWLNATQILKV----AGVVKAKRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVSYQRGVELCREYQV-
 
>Aps3_TALMA
 
----TYFLMRRSKDGYISATGMFKIAFPWAKAEEEKTEREYVKSKTET--SIDETAG------NLWISPLLALELAKEYQM-
 
>Aps5_TALMA
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPYERALDFANKEKI-
 
>Aps1_TALMA
 
---KTWTMMWDYNIGLVRTTHLFKCL----DYPKTTPAKMLNSNE-GLRDICHSITGGALAAQGYWMPFETAKAVAATFC-Y
 
>Aps5_TALST
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----VRHVVKIGPMHLKGVWIPYERALDFANKEKI-
 
>Aps3_TALST
 
---GEC-LMRRRADDWINATHILKV----AGFDKPSRTRILEREVQKG--VHEKVQGGYGKYQGTWIPLPEARLLAERNNI-
 
>Aps2_TALST
 
--NGIA-VMKRRSDSWLNATQILKV----AGVVKAKRTKTLEKEIAAG--EHEKVQGGYGKYQGTWVSYQRGVELCREYQV-
 
>Aps1_TALST
 
-----WTIMWDYNIGLVRTTHLFKCL----DYPKTTPAKMLNANE-GLRDICHSITGGALAAQGYWMPFETAKAVAATFC-Y
 
>Aps4_TALST
 
----TYFLMRRSKDGYISATGMFKIAFPWAKAEEEKAEREYVKSKTET--SVDETAG------NLWISPMLALELAKEYQM-
 
>Aps3_TETBL
 
-----I-VMRRTKNDWINITQVFKL----ASFSKTKRTKILEKESIDI--EHEKVQGGYGRFQGTWIPLHYAKLLVNKYNI-
 
>Aps2_TETBL
 
-----------------KLVDGYRAF--LCRQYP--EHAEELRHV-PFASLLQRIRGGYIKIQGTWLPYEVSRQICTRFC--
 
>Aps1_TETBL
 
LHPTGS-IMKRKTDNWVNATHILKA----AHLPKAKRTRILERQILNN--NHEKVQGGFGKYQGTWIPLEDAVALAREFGVY
 
>Aps1_TETPH
 
IANGVV-VLRRADNHMVNGTKLLNV----TGMTRGRRDRMLRSEK-----ERHVVKVGLMHSKGVWIPLERARYLAEKTNI-
 
>Aps5_TETPH
 
-----I-VMRRKNNDWVNITQVLKL----ASFSKTKRTKIIEKESMNM--EHEKVQGGYGRFQGTWIPLSSTKELIEKYNI-
 
>Mbp1_TETPH
 
LHSTGS-VMKRKKDGWVNATHILKT----ANFAKAKRTRILEKEVIQE--THEKVQGGFGKYQGTWVPLSVAISLAQKFEVY
 
>Aps4_TETPH
 
---TKT-VMRKVSNDWVNATQIFKI----ANFTKNKRTRILEREAKLI--KHEKIQGGYGRFQGTWIPLDDAKMLVNKYEI-
 
>Aps3_TETPH
 
--------HWANVSNYLKLLIVFKNY--ILNGENDGVNTDKMQNL-SIYDLINRIRGGYIKIQGTWLPWIMAKEICKRFC--
 
>Aps2_TETPH
 
--NGIS-VVRRADNDMINGTKLLNV----TKMTRGRRDGILKAEK-----TRKVVKMGTLNLKGVWIPFDRAYCIARREKI-
 
>Mbp1_TETRE
 
IHPTGS-IMKRKIDGWVNATHILKA----AKFPKAKRTRILEKEVIHE--IHEKVQGGFGKYQGTWVPTDIATRLSKKFGVF
 
>Aps3_THITE
 
---GIC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Mbp1_THITE
 
-------VMRRREDNWINATHILKA----AGFDKPARTRILEREVQKE--AHRKIQGGYGKYQGTWISLEQGEVLARRNNVY
 
>Aps1_THITE
 
--NNVA-VMRRQHDSWLNATQILKV----AGVDKGRRTKILEKEIQTG--QHEKVQGGYGKYQGTWIPFERGVEVCRQYGV-
 
>Aps2_THITE
 
---PSYFLMRRSVDGFVSATGMFKATFPYATQEEEEAERKYIRSLSST--SPEETAG------NVWIPPEQALALAEDYKI-
 
>Aps3_TORDE
 
-----I-VMRRTADDWVNITQVFKI----AQFSKTQRTKVLEKESTDM--RHEKVQGGYGRFQGTWIPLENAKYMVSKYNI-
 
>Aps1_TORDE
 
IHPTGS-VMKRKTDDWVNATHILKA----AKFAKAKRTRILEKEVIKE--VHEKVQGGFGKYQGTWVPLDIATRLANKFDVY
 
>Aps2_TORDE
 
--NGVS-VVRRADNDMINGTKLLNV----AKITRGRRDGILKAER-----IRHVVKIGSMHLKGVWIPFERAHAMAQREKI-
 
>Aps4_TRIRU
 
----SYFLMRRSRDGHISASGMFKIAFPWAKHSEEADEREYLRTRPET--SEDEIAG------NVWISPELALELAREYGI-
 
>Aps3_TRIRU
 
--NGVA-MMRRRSDSWLNATQILKV----AGVAKARRTKTLEKEVAAG--EHEKVQGGYGKYQGTWVSYERGLELCRRYQV-
 
>Aps5_TRIRU
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps1_TRIRU
 
-------VMRRRVDDWVNATHILKA----AGLDKPSRTRILERDVQRG--VHEKIQGGYGKYQGTWIPLAEARALADKNNV-
 
>Aps2_TRIRU
 
---KVYTVMWDYNIGLVRTTSLFRCN----NYSKTAPAKMLNANP-GLREICHSITGGALAAQGYWMPFEAAKAVAATFC--
 
>Aps3_TRIVE
 
--KGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps1_TRIVE
 
-------VMRRRVDDWVNATHILKA----AGLDKPSRTRILERDVQRG--VHEKIQGGYGKYQGTWIPLAEARALADKNNV-
 
>Aps2_TRIVE
 
--NGVA-MMRRRSDSWLNATQILKV----AGVAKARRTKTLEKEVAAG--EHEKVQGGYGKYQGTWVSYERGLELCRRYQV-
 
>Aps3_UNCRE
 
--KGVC-VARREDNHMVNGTKLLNV----AGMTRGRRDGILKSEK-----IRHVVKIGPMHLKGVWIPFERALEFANKEKI-
 
>Mbp1_UNCRE
 
-----S-VMRRRHDDWINATHILKV----AGLDKPSRTRILEREVQKG--THEKIQGGYGKYQGTWVPLPDGRHLAERNNV-
 
>Aps1_UNCRE
 
----TYFLMRRSKDGYVSATGMFKIAFPWAKQAEEKGEREYLRGHPNT--SSDETAG------NLWISPELALELAEEYKM-
 
>Aps2_UNCRE
 
--NGVA-VMRRRSDSWLNATQILKV----AGVVKARRTKTLEKEVASG--EHEKVQGGYGKYQGTWVSYQRGVELCRRYHV-
 
>Aps2_VANPO
 
VVNGIT-VLRRDDNNMINGTKLLNV----TKMTRGRRDRILRAEK-----IRHVVKIGSMHLKGVWIPLERAKRMAQMENIY
 
>Aps1_VANPO
 
--NGVS-VVRRADNDMINGTKLLNV----TKMTRGRRDGILKAEK-----IRHVVKVGSMNLKGVWIPFERALLMAKKEKI-
 
>Aps4_VANPO
 
IHPTGS-VMKRKLDNWVNATHILKA----ANFAKAKRTRILEKEVIKE--THEKVQGGFGKYQGTWVPLDIARKLAEKFGVH
 
>Aps3_VANPO
 
-----I-VMRRTSNDWINITQIFKL----ASFTKTKRTKVLEIESNNI--QHEKVQGGYGRFQGTWIPLNDAKNLVQKYNI-
 
>Aps6_VANPO
 
--------HWNNISNELKLLITFKDY--LRIKRN--LPESQLTNL-TIYDLIQRIRGGYIKIQGTWLPWEISRILCIRFC-Y
 
>Aps5_VANPO
 
-----T-VMRRTLDDWINITQVFKL----ASFSKTKRTKILEKETKSI--DHEKIQGGYGRFQGTWIPLICAKTIVIKYNI-
 
>Aps4_VERAL
 
---GIC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----LRHVVKIGPMHLKGVWIPFERALDFANKEKI-
 
>Aps3_VERAL
 
--VAEFMVMWDYNIGLVRMTPFFKCC----KYGKTVPAKMLSLNP-GLKDITHSITGGAILAQGYWMPYNCAKAVCATFC-Y
 
>Aps2_VERAL
 
---NSYFLMRRSHDGYVSATGMFKATYPYAEAHEEETERRYIKSLPST--SPEETAG------NVWIPPDHALSLAEEYGV-
 
>Aps5_VERAL
 
-------VMRRRQDNWINATHILKA----AGFDKPARTRILEREVQKE--KHEKVQGGYGKYQGTWIPLNQGQQLAQRNNCY
 
>Aps1_VERAL
 
---GVA-VMRRRNDSWLNATQILKV----AGVEKGKRTKILEKEIQTG--EHEKVQGGYGKYQGTWIKFERAVEVCRQYGV-
 
>Aps1_YARLI
 
---NNQWIIWDYHTGYVHLTGLWKAI----GNSKADIVKLIDNSP-DLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYF
 
>Aps5_YARLI
 
---GIC-VARREDNDMINGTKLLNV----AGMTRGRRDGILKGEK-----LRHVVKAGAMHLKGVWIPYDRALEFANKEKI-
 
>Aps2_YARLI
 
-CKNVA-VMRRKSDGWVNATHILKV----AGFDKPQRTRILEKEVQKG--VHEKVQGGYGKYQGTWVPLERAREIATLYDV-
 
>Aps4_YARLI
 
---GVC-VARREDNNMINGTKLLNV----VGMTRGRRDGILKTEK-----IRHVVKIGAMHLKGVWIPYERALAFAQRERI-
 
>Aps3_YARLI
 
MANDVA-VMRRRTDSSLNATQILKV----AGVEKSKRTKILEKEILTG--AHEKVQGGYGKYQGTWIPYERGVDLCRQYSVY
 
>Aps2_ZYGRO
 
-----I-VMRRTQDDWVNITQVFKI----AQFSKTQRTKVLEKESNDM--RHEKVQGGYGRFQGTWIPLEDAKYMVTKYNI-
 
>Aps1_ZYGRO
 
--NGVS-VVRRADNDMINGTKLLNV----AKITRGRRDGILKAER-----IRHVVKIGSMHLKGVWIPFERAQVMAEREKI-
 
>Mbp1_ZYGRO
 
IHPTGS-VMKRRDDDWVNATHILKA----ARFAKAKRTRILEKEVIKE--VHEKVQGGFGKYQGTWVPMDVARTLATKFGVH
 
>Aps1_ZYMTR
 
-------VMRRRSDDWINATHILKV----AQYDKPARTRILEREVQKG--VHEKVQGGYGKYQGTWIPLPDGRLLAQKNSV-
 
>Aps3_ZYMTR
 
-VHNVA-VMRRRSDGWLNATQILKV----AGVDKGKRTKVLEKEILPG--EHEKVQGGYGKYQGTWISYQRGREFCRQYGV-
 
>Aps4_ZYMTR
 
-----YFLMRRSSDGFISATGMFKAAFPYAQQEEELLEKDYIKSLPAA--SSEEVAG------NVWIDAHKALELADEYGI-
 
>Aps2_ZYMTR
 
--NGVC-VARREDNHMINGTKLLNV----AGMTRGRRDGILKSEK-----TRHVVKIGPMHLKGVWIPFDRALDFANKEKI-
 
  
</source>
+
* Access [http://www.ebi.ac.uk/intact/ '''IntAct'''] and enter the UniProt ID for yeast Mbp1 <tt>P39678</tt>.
</small>
+
* Click on the "Graph" tab to load a network graph.
 +
* Switch "Merge edges" '''off''' to show the reported edges for this interaction individually. Which protein pair has the most interactions? Does this make sense?
  
;... now we can calculate phylogenies.
+
But then what?
 
 
But wait: do we really want to work with all these sequences? 284 sequences is not too large, strictly speaking - I have run a maximum likelihood tree on my aging, wheezing laptop in a bit under an hour. You are welcome to try this for yourself. But for the purposes of the assignment, you may want to reduce the number of sequences to those from the six "reference species": (<tt>SACCE, CANAL, USTMA, NEUCR, ASPNI</tt> and <tt>SCHPO</tt>), plus the outgroup, plus YFO.
 
 
 
 
 
{{task|1=
 
  
#Prepare a PHYLIP input file from a selection of the prepared sequences above. The simplest way to achieve this appears to be:
+
If you are like me, you would now like to be able to link expression profiles, information about known complexes, GO annotations, knock-out phenotypes etc. etc. Too bad.
##Copy the sequences you want into a textfile. Make sure the "reference sequences", are included, the outgroup and the sequences from YFO.
 
##In a browser, navigate to the [http://www-bimas.cit.nih.gov/molbio/readseq/ '''Readseq sequence conversion service'''].
 
##Paste your sequences into the form and choose '''Phylip''' as the output format. Click on '''submit'''.
 
##Save the resulting page as a text file. Give it some useful name such as <code>APSES_domains.phy</code>.  
 
  
 
}}
 
}}
  
==Calculating trees==
+
{{Vspace}}
 
 
In this section we perform the actual phylogenetic calculation.
 
  
  
 +
==Working with biological graphs in R==
  
 
{{task|1=
 
{{task|1=
  
#Download the PHYLIP package from the [http://evolution.genetics.washington.edu/phylip.html Phylip homepage] and install it on your computer.
+
* Open RStudio.
# Make a copy of your PHYLIP formatted sequence alignment file and name it <code>infile</code>. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else. Place this file into the directory where the PHYLIP executables reside on your computer.
+
* Choose File &rarr; Recent Projects &rarr; BCH441_2016.
#Run the '''proml''' program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree (on the Mac, use proml.app). The program will automatically use "infile" for its input. Use the default parameters except that you should change option <code>S: Speedier but rougher analysis?</code> to <code>No, not rough</code> - your analysis should not sacrifice accuracy for speed. The calculation will take some fifteen minutes or so..
+
* Pull the latest version of the project repository from GitHub.
 
+
* type <tt>init()</tt>
 
+
* Open the file <tt>BCH441_A11.R</tt> and work through the entire tutorial.
The program produces two output files: the <code>outfile</code> contains a summary of the run, the likelihood of bifurcations, and '''an ASCII representation of the tree'''. Open it with your usual text editor to have a look, and save the file with a meaningful name. The <code>outtree</code> contains the resulting tree in so-called "Newick" format. Again, have a look and save it with a meaningful filename.
 
  
 +
* At the end of the tutorial, you are being asked to print '''R''' code and data on a sheet of paper and bring this to class. This will be marked by me and worth maximally 4 marks. Be careful to follow the instructions exactly, especially regarding how to use your student number as a randomization seed.
  
 
}}
 
}}
  
 +
;This is all that is required. There is optional material below that you may find interesting.
  
<!-- Bootstrapping ...
+
{{Vspace}}
* run seqboot
 
* rename outfile to infile
 
* rerun proml, use option M for multiple datasets with speedy option (use "jumble" of 1)
 
* rename outtree to intree
 
* run consense
 
* Use option R to define trees as rooted
 
 
 
Should run at least overnight.
 
-->
 
 
 
==Analysing your tree==
 
 
 
In order to analyse your tree, you need a species tree as reference. Then you can begin comparing your expectations with the observed tree.
 
 
 
 
 
===The species tree reference===
 
  
  
I have constructed a cladogram for many of the species we are analysing, based on data published for 1551 fungal ribosomal sequences. The six reference species are included. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
+
==Optional: Data visualization and analysis ==
  
[[Image:FungiCladogram.jpg|frame|none|Cladogram of many fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.]]
+
{{Vspace}}
  
Your species may not be included in this cladogram, but you can easily calculate your own with the following procedure:
+
If you work a lot with interaction networks, sooner or later you will come across [http://www.cytoscape.org/ Cytoscape]. It is more or less the standard among "professional" systems biologists. But it is not an online tool.
  
 
{{task|1=
 
{{task|1=
#Access the [http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=taxonomy NCBI taxonomy database Entrez query page].
+
* Navigate to the [http://www.cytoscape.org/ '''Cytoscape'''] homepage and inform yourself what the program does and how to install it. There are many tutorials online available. But this is software that needs to be downloaded, and installed and it definitively has a learning curve.
#Edit the list of reference species below to include your species and paste it into the form.
 
  
"Aspergillus nidulans"[Scientific Name] OR
+
}}
"Candida albicans"[Scientific Name] OR
 
"Neurospora crassa"[Scientific Name] OR
 
"Saccharomyces cerevisiae"[Scientific Name] OR
 
"Schizosaccharomyces pombe"[Scientific Name] OR
 
"Ustilago maydis"[Scientific Name]
 
  
#Next, as '''Display Settings''' option, select '''Common Tree'''.
+
{{Vspace}}
  
You can use that tree as is - or visualize it more nicely as follows
+
The state of integrated '''online''' interaction viewers these days could be improved. Have a look at this article that discusses the gap between what one would need to do, and what is offered:
 +
{{#pmid: 26077899}}
  
#Select the '''phylip tree''' option from the menu, and click '''save as''' to save the tree in Newick format.
 
#The output can be edited, and visualized in any program that reads Newick trees. One particularly nice viewer is the [http://itol.embl.de/ '''iTOL''' - Interactive Tree of Life project''']. Copy the contents of the <code>phyliptree.phy</code> file that the NCBI page has written, navigate to the iTOL project, click on '''Data Upload''', paste your tree and click '''Upload'''. Then '''go to the main display page''' to view the tree. Change the view from '''Circular''' to '''Normal'''.
 
}}
 
  
;Alternatively ...
+
{{vspace}}
You can look up your species in the latest version of the species tree for the fungi:
 
{{#pmid: 22114356}}
 
  
===Visualizing the tree===
 
  
 
+
The online resource that comes out as the best is the one at the [http://string-db.org/ String database].
Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip <code>outfile</code> - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file <code>outtree</code>. Visualize it as follows:
 
  
 
{{task|1=
 
{{task|1=
  
 
+
* Navigate to the [http://string-db.org/ '''String database'''] and search for ''saccharomyces cerevisiae'' Mbp1 interactors.
#Open <code>outtree</code> in a texteditor and copy the tree.
+
* Visualize the network. Add a few proteins by clicking the ('''+''') button a two or three times.
#Visualize the tree in alternative representations:
+
* Click on a node to get a synopsis of its function.
##I have already mentioned the [http://itol.embl.de/ '''iTOL''' - Interactive Tree of Life project'''] viewer.
+
* Explore the "confidence", "evidence" and "actions" networks for the retrieved interactors.
##Navigate to the [http://www.proweb.org/treeviewer/ Proweb treeviewer], paste and visualize your tree.
+
* Not all interacting proteins are also predicted to have a '''functional''' relationship with Mbp1. Do you agree?
##Navigate to the [http://www.trex.uqam.ca/index.php?action=newick&project=trex Trex-online Newick tree viewer] for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
+
* Explore the clustering and layout options. Do you understand what they do?
##In your Jalview window, choose '''File &rarr; Load associated Tree''' and load the Phylip <code>outtree</code> file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades (plus the outgroup). This view is particularly informative, since you can associate the clades of the tree with the actual sequences in the alignment, and get a good sense what sequence features the tree is based on.
+
* Explore the '''Views''' on
##Look at the '''View''' options and note that you can sort the sequences by their position in the tree. Also not that you can flip the tree around a node by double-clicking on it. This is especially useful: try to rearrange the tree so that the two smaller main clades are next to each other. In particular note that it would take only a single rearangement of the topology to join the smaller two of the three main clades.
+
:*Neighborhood (not relevant for our query though)
##Study the tree: understand what you see and what you would have expected.  
+
:*Fusion (also not relevant for our query)
 +
:*Occurence
 +
:*Coexpression
 +
:*Experiments
 +
:*Database, and
 +
:*Textmining
 +
Each of these are methods for predicting functional relationships. Figure out how each one contributes to evidence of a functional interaction between Mbp1 and its predicted functional partners. I find the '''Occurrence view''' a unique and intriguing tool: visualizing in which organisms '''groups of genes''' are either all absent or all present allows to quickly establish functional clusters.
  
 
}}
 
}}
  
 +
In summary, String is a convincingly well built tool to explore functional relationships between proteins.
  
 +
{{vspace}}
  
Here are two principles that will help you make sense of the tree.
 
  
 +
<!--
  
A: '''A gene that is present in an ancestral species is inherited in all descendant species'''. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).
+
&nbsp;
 +
==Introductory reading==
 +
<section begin=reading />
 +
{{#pmid:20940177}}
 +
<section end=reading />
  
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.
 
  
 +
&nbsp;
 +
==Contents==
 +
* Abstraction and standards
 +
* Databases
 +
* Confidence scores
 +
{{#pmid:22115179}}
  
With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.
 
  
 +
&nbsp;
  
===The APSES domains of LCA===
+
==Further reading and resources==
 +
;Standards
 +
{{#pmid:21063946}}
 +
;Data
 +
{{#pmid:18823568}}
 +
{{#pmid:20221918}}
 +
{{#pmid:21863499}}
 +
{{#pmid:21877287}}
 +
{{#pmid: 21078182}}
 +
;Databases
 +
{{#pmid: 15173116}}
 +
{{#pmid: 21045058}}
 +
{{#pmid: 22611057}}
  
Note: A common confusion about cenancestral genes (LCA = Last Common Ancestor) arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have diverged beyond recognizability. In general you have to ask: '''given the species represented in a subclade, what is the last common ancestor of that branch'''? The expectation is that '''all''' descendants of that ancestor should be represented in that branch '''unless''' one of the above reasons why a gene might be absent would apply.
 
  
  
{{task|1=
+
==Interaction prediction==
 +
Interologs for YFO...
  
  
* Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so. Note that the hallmark of a clade that originated in the cenancestor is that it contains species from '''all''' subsequent major branches of the species tree.
+
&nbsp;
  
 +
==Visualizing Interactions==
  
}}
 
  
 +
'''[http://www.cytoscape.org/ Cytoscape]''' is a program originally written in Trey Ideker's lab at the [http://www.systemsbiology.org/ Institue for Systems Biology], that is now a thriving, open-source community project for the development of a biology-oriented network display and analysis tool.
  
  
===The APSES domains of YFO===
+
{{#pmid:21063955}}
  
Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to  identify the sequence of duplications and/or gene loss in your organism through which YFO has ended up with the APSES domains it possesses today.
 
  
{{task|1=
+
Cytoscape is now [http://cytoscape.org/ available as '''version 3'''] and should be straightforward to download and install.
  
# Print the tree to a single sheet of paper.
+
<div class="reference-box">Cytoscape 3 tutorials <small>([http://opentutorials.cgl.ucsf.edu/index.php/Portal:Cytoscape3])</small>
# Mark the clades for the genes of the cenancestor.
+
* [http://opentutorials.cgl.ucsf.edu/index.php/Tutorial:Introduction_to_Cytoscape_3 Introduction to Cytoscape 3: User Interface]
# Label all subsequent branchpoints that affect the gene tree for YFO  with either '''"D"''' (for duplication) or '''"S"''' (for speciation). Remember that specific speciation events can appear more than once in a tree. Identify such events.
+
* [http://opentutorials.cgl.ucsf.edu/index.php/Tutorial:Introduction_to_Cytoscape_3.1-part2 Introduction to Cytoscape 3.1: Part 2 - importing networks]
# ;Bring this sheet with you to the quiz on Wednesday.
+
* [http://opentutorials.cgl.ucsf.edu/index.php/Tutorial:Introduction_to_Cytoscape_3-part3 Introduction to Cytoscape 3: Part 3 - Web import]
 +
* [http://opentutorials.cgl.ucsf.edu/index.php/Tutorial:Filtering_and_Editing_in_Cytoscape_3 Cytoscape 3: Filtering and editing]
 +
</div>
  
}}
 
  
 +
<div class="reference-box">Cytoscape tutorials <small>([http://wiki.cytoscape.org/Presentations/Basic])</small>
 +
* Browse over the [http://wiki.cytoscape.org/Presentations/03_Download_Data Cytoscape Downloading Data tutorial]
 +
* Work through the [http://irefindex.uio.no/wiki/iRefScape '''iRefScape''' &mdash; iRefIndex Cytoscape plugin tutorial:] Installation, data selection and use.
 +
* Work through the [http://wiki.cytoscape.org/Presentations/04_Expression_Data Cytoscape Basic expression analysis tutorial]
  
==Bonus: when did it happen?==
+
{{#pmid:20926419}}
 +
{{#pmid:21877285}}
 +
</div>
  
A very cool resource I have only just discovered is [http://www.timetree.org/ '''Timetree'''] - a tool that allows you to estimate divergence times between species. For example, the speciation event that separated the main branches of the fungi - i.e. the time when the fungal cenacestor lived - is given by the divergence time of ''Schizosaccharomyces pombe'' and ''Saccharomyces cerevisiaea'': 761,000,000 years ago. For comparison, these two fungi are therefore approximately as related to each other as '''you''' are ...
+
<div class="reference-box">The [http://wiki.cytoscape.org/Welcome '''Cytoscape wiki''' and manual], and the [http://wiki.cytoscape.org/Cytoscape_User_Manual/Network_Formats Cytoscape manual page on '''network formats'''].</div>
 +
;Platform
 +
{{#pmid:14597658}}
 +
{{#pmid:17947979}}
 +
{{#pmid:19597788}}
 +
{{#pmid:21149340}}
 +
;Plugins
 +
{{#pmid:20122237}}
 +
{{#pmid:20926419}}
 +
{{#pmid:21473782}}
 +
{{#pmid:21975162}}
 +
{{#pmid:22070249}}
 +
</div>
  
A) to the rabbit?<br>
+
==Complex Analysis==
B) to the opossum?<br>
 
C) to the chicken?<br>
 
D) to the rainbow trout?<br>
 
E) to the warty sea squirt?<br>
 
F) to the bumblebee?<br>
 
G) to the earthworm?<br>
 
H) to the fly agaric?<br>
 
  
Check it out - the question will be on the quiz.
+
* https://www.bioconductor.org/packages/release/bioc/html/RCytoscape.html
  
  
  
  
 +
&nbsp;
  
 
;That is all.
 
;That is all.
Line 965: Line 221:
  
 
&nbsp;
 
&nbsp;
 +
 +
-->
  
 
== Links and resources ==
 
== Links and resources ==
  
;Literature
+
{{vspace}}
{{#pmid: 22114356}}
+
{{#pmid: 21527005}}
{{#pmid: 19190756}}
 
{{#pmid: 12801728}}
 
:* [http://evolution.genetics.washington.edu/phylip/phylip.html '''PHYLIP''' documentation]
 
{{PDF
 
|authors= Tuimala, Jarno
 
|year= 2006
 
|title= A primer to phylogenetic analysis using the PHYLIP package
 
|journal=
 
|volume=
 
|pages=
 
|URL= http://koti.mbnet.fi/tuimala/oppaat/phylip2.pdf
 
|doi=
 
|file= Tuimala_PHYLIP.pdf
 
|abstract= The purpose of this tutorial is to demonstrate how to use PHYLIP, a collection of phylogenetic analysis software, and some of the options that are available. This tutorial is not intended to be a course in phylogenetics, although some phylogenetic concepts will be discussed briefly. There are other books available which cover the theoretical sides of the phylogenetic analysis, but the actual data analysis work is less well covered. Here we will mostly deal with molecular sequence data analysis in the current PHYLIP version 3.66.
 
}}
 
  
 
+
<!--
;Software
+
{{#pmid: 18823568}}
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
+
{{#pmid: 22115179}}
:* [http://itol.embl.de/ '''iTOL''' - Interactive Tree of Life project''']
+
{{#pmid: 19957275}}
 
+
-->
;Sequences
 
:* [[Reference APSES domains|'''reference APSES domains page''']]
 
 
 
 
 
<!-- {{#pmid: 19957275}} -->
 
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
Line 1,002: Line 240:
 
&nbsp;
 
&nbsp;
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 +
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_10|&lt;&nbsp;Assignment&nbsp;10]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">&nbsp;</td>
 +
</tr></table>
  
  

Latest revision as of 04:12, 13 December 2016

Assignment for Week 11
Protein-Protein Interactions

< Assignment 10  

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

Task:

  • For a useful overview of graph-theory concepts you could additionally have a look at:
Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.

However, the concepts you need to know for this assignment should become clear from the notes.


 

Data Sources

Interaction databases have similar problems as sequence databases: the need for standards for abstracting biological concepts into computable objects, data integrity, search and retrieval, and the metrics of comparison. There is however an added complication: interactions are rarely all-or-none, and the high-throughput experimental methods have large false-positive and false-negative rates. This makes it necessary to define confidence scores for interactions. On top of experimental methods, there are also a variety of methods for computational interaction prediction. However, even though the "gold standard" are careful, small-scale laboratory experiments, different curated efforts on the same experimental publication usually lead to different results - with as little as 42% overlap between databases being reported.

Currently, likely the best integrated protein-protein interaction database is IntAct, at the EBI, which besides curating interactions from the literature hosts interactions from the IMEx consortium, an extensive data-sharing agreement between a number of general and specialized source databases.


 

Task:

  • Access IntAct and enter the UniProt ID for yeast Mbp1 P39678.
  • Click on the "Graph" tab to load a network graph.
  • Switch "Merge edges" off to show the reported edges for this interaction individually. Which protein pair has the most interactions? Does this make sense?

But then what?

If you are like me, you would now like to be able to link expression profiles, information about known complexes, GO annotations, knock-out phenotypes etc. etc. Too bad.


 


Working with biological graphs in R

Task:

  • Open RStudio.
  • Choose File → Recent Projects → BCH441_2016.
  • Pull the latest version of the project repository from GitHub.
  • type init()
  • Open the file BCH441_A11.R and work through the entire tutorial.
  • At the end of the tutorial, you are being asked to print R code and data on a sheet of paper and bring this to class. This will be marked by me and worth maximally 4 marks. Be careful to follow the instructions exactly, especially regarding how to use your student number as a randomization seed.
This is all that is required. There is optional material below that you may find interesting.


 


Optional: Data visualization and analysis

 

If you work a lot with interaction networks, sooner or later you will come across Cytoscape. It is more or less the standard among "professional" systems biologists. But it is not an online tool.

Task:

  • Navigate to the Cytoscape homepage and inform yourself what the program does and how to install it. There are many tutorials online available. But this is software that needs to be downloaded, and installed and it definitively has a learning curve.


 

The state of integrated online interaction viewers these days could be improved. Have a look at this article that discusses the gap between what one would need to do, and what is offered:

Jeanquartier et al. (2015) Integrated web visualizations for protein-protein interaction databases. BMC Bioinformatics 16:195. (pmid: 26077899)

PubMed ] [ DOI ] BACKGROUND: Understanding living systems is crucial for curing diseases. To achieve this task we have to understand biological networks based on protein-protein interactions. Bioinformatics has come up with a great amount of databases and tools that support analysts in exploring protein-protein interactions on an integrated level for knowledge discovery. They provide predictions and correlations, indicate possibilities for future experimental research and fill the gaps to complete the picture of biochemical processes. There are numerous and huge databases of protein-protein interactions used to gain insights into answering some of the many questions of systems biology. Many computational resources integrate interaction data with additional information on molecular background. However, the vast number of diverse Bioinformatics resources poses an obstacle to the goal of understanding. We present a survey of databases that enable the visual analysis of protein networks. RESULTS: We selected M=10 out of N=53 resources supporting visualization, and we tested against the following set of criteria: interoperability, data integration, quantity of possible interactions, data visualization quality and data coverage. The study reveals differences in usability, visualization features and quality as well as the quantity of interactions. StringDB is the recommended first choice. CPDB presents a comprehensive dataset and IntAct lets the user change the network layout. A comprehensive comparison table is available via web. The supplementary table can be accessed on http://tinyurl.com/PPI-DB-Comparison-2015. CONCLUSIONS: Only some web resources featuring graph visualization can be successfully applied to interactive visual analysis of protein-protein interaction. Study results underline the necessity for further enhancements of visualization integration in biochemical analysis tools. Identified challenges are data comprehensiveness, confidence, interactive feature and visualization maturing.


 


The online resource that comes out as the best is the one at the String database.

Task:

  • Navigate to the String database and search for saccharomyces cerevisiae Mbp1 interactors.
  • Visualize the network. Add a few proteins by clicking the (+) button a two or three times.
  • Click on a node to get a synopsis of its function.
  • Explore the "confidence", "evidence" and "actions" networks for the retrieved interactors.
  • Not all interacting proteins are also predicted to have a functional relationship with Mbp1. Do you agree?
  • Explore the clustering and layout options. Do you understand what they do?
  • Explore the Views on
  • Neighborhood (not relevant for our query though)
  • Fusion (also not relevant for our query)
  • Occurence
  • Coexpression
  • Experiments
  • Database, and
  • Textmining

Each of these are methods for predicting functional relationships. Figure out how each one contributes to evidence of a functional interaction between Mbp1 and its predicted functional partners. I find the Occurrence view a unique and intriguing tool: visualizing in which organisms groups of genes are either all absent or all present allows to quickly establish functional clusters.

In summary, String is a convincingly well built tool to explore functional relationships between proteins.


 


Links and resources

 
Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.


 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 10