Reference APSES domains (reference species)

From "A B C"
Revision as of 19:34, 14 October 2008 by Boris (talk | contribs)
Jump to navigation Jump to search


Multi FASTA file of all APSES domains in fungal proteins.

Executing the PSI-BLAST search

The starting point of this list is a BLAST search with one known APSES domain sequence. This query sequence - the Mbp1 APSES domain - was defined as follows, based on Pfam profile 02292: APSES.

>Yeast Mbp1 APSES domain (AA 24..102 of NP_010227)
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
QGTWVPLNIAKQLAEKFSVYDQLKPLFDF

A PSI-BLAST search was executed, searching in the nr subset of GenPept without further restrictions (Oct. 2007). The default parameters for PSI-BLAST were used, except for using the BLOSUM45 matrix and reducing the Evalue to 1.0 from 10.0.

The search converged after 6 iterations, i.e. PSI-BLAST had found no additional new hits above the inclusion threshold E-value of 0.005. 164 sequences were found and contributed to the profile. However, some of these sequences are redundant, i.e. they are matches to the same amino acid sequence in different database entries, and some of these sequences are from organisnms other than the ones we are considering in the assignment. Even if these latter sequences are removed, it was appropriate to keep them included initially: they contribute to the information in the PSI-BLAST search profile and improve the sensitivity and specificity of the search.

It would certainly not be impossible - albeit somewhat tedious - to manually edit the list of proteins by checking/unchecking which hits to include. I have written a short Perl script to automate this task and to rename the sequences at the same time. Renaming is not required and does not add information; RefSeq / GenPept accession numbers will do just fine to name the sequences uniquely. However the final analysis of sequence alignment or phylogeny results is much easier to do if the sequence labels actually tell us something about the organisms they came from and which other sequence they might be similar to.

After removing redundant sequences, sequence fragments that did not span the entire Mbp1 APSES domain, and sequences from fungi that are not in the list of organisms for this course, 69 sequences remained for analysis.


Constructing the multi-FASTA file

A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence.

The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the result page: they are presented from alink at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:

  • Pairwise: the default
  • Pairwise with identities: showing only differences to the query sequence
  • query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
  • flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
  • hit-table: this gives only the numerical parameters describing the quality of the matches.

When we select the flat-query anchored with/without identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.

Renaming sequences

To make the interpretation of alignments and gene trees easier, the Mbp1 orthologues for all species were labeled Mbp1_???? (e.g. Mbp1_ASPFU). All yeast sequences were labelled with their gene name (e.g. Sok2_SACCE). All other sequences were named according to the yeast gene they share the most identities with, where the last digit was replaced with A, B, C - as required. (e.g. SokA_ASHGO). Note that such relabeling sequences does not change the data or its interpretation, it is just helpful. Finally the squences were sorted to have the Mbp1 orthologues first in the list, then all other sequences sorted by organism.

The final 69 sequences

>Mbp1_SACCE (79  ids)  NP_010227    (024..102)
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
>Mbp1_ASHGO (66  ids)  NP_986147    (031..109)
SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEVIKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLFDF
>Mbp1_ASPFU (49  ids)  XP_754232    (001..077)
MRRRGDDWINATHILKVAGFDKPARTRILEREVQKGTHEKVQGGYGKYQGTWIPLHEGRLLAERNNIIDKLRPIFDY
>Mbp1_ASPNI (50  ids)  XP_660758    (028..106)
SVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIFDY
>Mbp1_ASPTE (49  ids)  XP_001213217 (028..106)
SVMRRRADDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIFDY
>Mbp1_CANAL (53  ids)  XP_723071    (026..103)
IMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIFEF
>Mbp1_CANGL (71  ids)  XP_445458    (024..102)
SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEVLKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLFDF
>Mbp1_COPCI (43  ids)  EAU84310     (025..103)
AVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEF
>Mbp1_CRYNE (47  ids)  XP_570545    (133..211)
SVMRRASDSWVNATQILKVAGVHKSARTKILEKEVLNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVFDF
>Mbp1_DEBHA (50  ids)  XP_458784    (027..104)
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIFEF
>Mbp1_GIBZE (48  ids)  XP_390560    (040..117)
VMRRRSDDWINATHILKAAGFDKPARTRILERDVQKDVHEKIQGGYGKYQGTWIPLESGQALAERHSVIDRLRPIFEY
>Mbp1_KLULA (64  ids)  XP_454189    (025..103)
SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEVITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLFDF
>Mbp1_MAGGR (48  ids)  XP_362974    (040..117)
VMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEF
>Mbp1_NEUCR (50  ids)  XP_955821    (037..114)
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIYERLKPIFEF
>Mbp1_PICST (52  ids)  XP_001386821 (026..103)
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLELGRDIAKNFGVFDILKPIFDF
>Mbp1_SCHPO (43  ids)  NP_595496    (027..103)
MKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHEYNVFDLIQPLIEY
>Mbp1_USTMA (41  ids)  XP_762343    (026..104)
AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPITSY
>Mbp1_YARLI (49  ids)  XP_500257    (022..100)
AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEVQKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIFNY
>Swi4_ASHGO (58  ids)  NP_986370    (043..115)
VMRRLHDDWVNITQVFKVATFSKTQRTKILEKESADISHEKIQGGYGRFQGTWIPLDSAKGLVAKYEITDIVV
>Sok2_ASHGO (67  ids)  NP_983001    (352..425)
SVVRRADNDMINGTKLLNVAKMTRGRRDGILKAEKVRHVVKIGSMHLKGVWIPFERALALAQREKIVDMLFPLF
>MbpB_ASPFU (22  ids)  XP_751244    (151..225)
VMWDYNIGLVRTTHLFKCNDYSKMLNANPGLREICHSITGGALAAQGYWMPYEAAKAVAATFCWKIRHALTPLFG
>MbpA_ASPFU (40  ids)  XP_748947    (105..183)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
>Sok2_ASPFU (58  ids)  XP_755125    (152..224)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpB_ASPNI (19  ids)  XP_001392970 (124..203)
ISWDYNVGLVLTRSLFKCNGHPKTAPAKVLKMNPGLGDISHSITGGALVGQGYWMPFRAAKALATTFCWNIRFVLTPMFG
>SokB_ASPNI (21  ids)  XP_663009    (131..211)
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFCWKIRFALTPLFG
>MbpA_ASPNI (40  ids)  XP_001391313 (118..196)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
>SokA_ASPNI (56  ids)  XP_663440    (152..224)
VARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKITDLLYPLF
>Sok2_ASPNI (58  ids)  XP_001390623 (153..225)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpB_ASPTE (21  ids)  XP_001212599 (130..212)
IMWDYNIGLVRTTPLFRSQNYSKTTPAKVLDANPGLREISHSITGGAIVAQDKPGYWIPFEAAKAVAATFCWRIRYALTPIFG
>MbpA_ASPTE (40  ids)  XP_001215548 (007..085)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVDLCREYHVEELLRPLLEY
>Sok2_ASPTE (59  ids)  XP_001218256 (139..211)
VARREDNSMINGTKLLNVAGMTRGRRDGILKSEKIRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpC_CANAL (22  ids)  XP_723412    (087..178)
VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYLQGIWIPYDKAVNLALKFDIYEITKKLF
>MbpB_CANAL (25  ids)  XP_710918    (256..346)
VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYYLRYSLIPIFG
>MbpA_CANAL (48  ids)  XP_712970    (006..082)
SIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAKTYGVTEELAPVL
>Sok2_CANAL (49  ids)  XP_711513    (469..541)
VSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGVDSLLYPLF
>Phd1_CANAL (65  ids)  XP_714237    (228..301)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
>SokA_CANGL (56  ids)  XP_449680    (143..216)
TVVRRADNDMVNGTKLLNVTGMTRGRRDGILKNEPVRDVVKGGPMTLKGVWIPIDRARAIARQEGIEQWLYPLF
>Swi4_CANGL (61  ids)  XP_444966    (062..140)
VMRRTMDDWVNVTQVFKIAQFSKTQRTKILEKESTNMKHEKVQGGYGRFQGTWVPLEAAKFMTTKYNIDNPVVNTILSF
>Sok2_CANGL (64  ids)  XP_448847    (224..297)
SVVRRADNDMINGTKLLNVTKMTRGKRDGILRSEKYRKVVKIGSMHLKGVWIPFERALFIAKREKIVDLLYPLF
>MbpA_COPCI (26  ids)  EAU85126     (059..139)
IMMDIDDGYILWTGIWKALGNSKADIVKMIDSQPDLAPLIRRVRGGYLKIQGTWMPYEVALKLSRRVAWPIRHDLVPLFGF
>MbpA_CRYNE (42  ids)  XP_569090    (036..114)
AVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDY
>MbpB_DEBHA (26  ids)  XP_459773    (187..275)
IIWDYETGFVHLTGIWKASINDEVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYHIRFQLIPIFG
>Swi4_DEBHA (26  ids)  XP_459901    (067..158)
ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTNTQYLSSGGSNPQFNDLRNHEVRDLRGLWIPYDRAVSLALKFDIYELAKSLF
>MbpA_DEBHA (45  ids)  XP_457246    (028..103)
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKIQGGYGRFQGTWIPLADAQRLAASYGVTPDLAPVL
>SokA_DEBHA (50  ids)  XP_460447    (213..285)
VSRREDTNYVNGTKLLNVAGMTRGKRDGILKTEKTKSVVKVGAMNLKGVWIPFERASEIARNEGIDGLLYPLF
>Sok2_DEBHA (64  ids)  XP_459785    (307..380)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>MbpB_GIBZE (21  ids)  XP_389978    (139..219)
AVMWDYNIGLVRMTPFFKCRGYGKTIPAKMLGLNPGLKEITHSITGGSIAAQGYWMPYRCAKAICATFCHPIAGALIPIFG
>MbpA_GIBZE (39  ids)  XP_384396    (045..123)
AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLLTY
>Sok2_GIBZE (55  ids)  XP_390305    (226..298)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPYDRALDFANKEKITELLYPLF
>Swi4_KLULA (50  ids)  XP_454890    (119..197)
IMRRCNDNWLNITQVFKAGSFTKAQRTKILEKEANEIKHEKIQGGYGRFQGTWIPWESTKYLVEKYNINNKVVKRIVEF
>Sok2_KLULA (67  ids)  XP_455299    (386..459)
SVVRRADNDMINGTKLLNVTRMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALVMAQREKIVDLLYALF
>MbpB_MAGGR (20  ids)  XP_369301    (096..176)
TVMWDYGCGLVRMTHFFKCRGYTKTVPGKVLNQNHGLKDITYSITGGSISAQGYWMPFACARAVCATFCHPIAGALIPIFG
>MbpA_MAGGR (39  ids)  XP_365024    (131..209)
AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLLEY
>Sok2_MAGGR (57  ids)  XP_368552    (133..205)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKMRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
>MbpA_NEUCR (40  ids)  XP_962967    (071..147)
AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
>MbpA_PICST (46  ids)  XP_001383745 (006..081)
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLPDAQRLATMYGVTADAAPVL
>SokA_PICST (49  ids)  XP_001385235 (239..311)
VSRREDTNFVNGTKLLNVIGMTRGKRDGILKTEKTRNVVKVGSMNLKGVWIPFDRAFEIARNEGVDEALHPLF
>Sok2_PICST (64  ids)  XP_001383609 (194..267)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>Sok2_SACCE (74  ids)  EDN64408     (435..508)
SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF
>Phd1_SACCE (74  ids)  NP_012881    (208..281)
SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF
>Swi4_SACCE (79  ids)  EDN63086     (060..138)
VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTF
>MbpB_SCHPO (21  ids)  NP_596132    (088..164)
LRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKRYGVYEILQPLISF
>MbpA_SCHPO (41  ids)  NP_593032    (027..104)
SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILS
>MbpA_USTMA (24  ids)  XP_760925    (057..138)
TMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSRRIAWEIRDHLVPLFGY
>Swi4_USTMA (42  ids)  XP_761485    (182..260)
AVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAEYNVSHLLQPILEF
>MbpB_YARLI (26  ids)  XP_505499    (080..159)
IIWDYHTGYVHLTGLWKAIGNSKADIVKLIDNSPDLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYFIRFALIPLFG
>MbpA_YARLI (44  ids)  XP_501770    (036..114)
AVMRRRTDSSLNATQILKVAGVEKSKRTKILEKEILTGAHEKVQGGYGKYQGTWIPYERGVDLCRQYSVYDVLQPLLAF
>SokA_YARLI (55  ids)  CAB45654     (144..216)
VARREDNDMINGTKLLNVAGMTRGRRDGILKGEKLRHVVKAGAMHLKGVWIPYDRALEFANKEKIIDLLFPLF
>Sok2_YARLI (60  ids)  XP_501102    (130..202)
VARREDNNMINGTKLLNVVGMTRGRRDGILKTEKIRHVVKIGAMHLKGVWIPYERALAFAQRERIVDVLYPLF