Reference APSES domains (reference species)

From "A B C"
Revision as of 07:00, 23 November 2011 by Boris (talk | contribs)
Jump to navigation Jump to search


Multi FASTA file of APSES domains in six fungal reference species.

This page collects APSES domain sequences from six fungal species that are used as reference species for the course. The species are:

  • Aspergillus nidulans (ASPNI)
  • Candida albicans (CANAL)
  • Neurospora crassa (NEUCR)
  • Saccharomyces cerevisiae (SACCE)
  • Schizosaccharomyces pombe (SCHPO)
  • Ustilago maydis (USTMA)


Executing the PSI-BLAST search

Defining the APSES Domain sequence
  1. Navigate to the NCBI BLAST page, accessed protein BLAST;
  2. Follow the link to protein BLAST and enter the yeast Mbp1 refseq ID NP_010227 into the input form;
  3. Select the PHI-BLAST algorithm to search for domains in the sequence and Run BLAST;
  4. Click on the graphical summary of the result to access the CDD conserved domains report for the sequence;
  5. Click on the (+) sign next to the link to KilA-N(pfam 04383) domain to display the query/profile alignment. This is what it looks like:
                          10        20        30        40        50        60        70        80
                  ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....|
gi 6320147     19 IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQ---------------GGFGKYQGTWVPLNIA 83
Cdd:pfam04383   3 YNDFEIIIRRDKDGYINATKLCKAAGAKGKRFRNWLRLESTKELIEELSkennpdkliiienrkGKGGRLQGTYVHPDLA 82


                          90
                  ....*....|....
gi 6320147     84 KQLA----EKFSVY 93
Cdd:pfam04383  83 LAIAswisPEFALK 96

This gives us the following APSES domain sequence:

>Yeast Mbp1 APSES domain (AA 19..93 of NP_010227)
IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG 
GFGKYQGTWVPLNIAKQLAEKFSVY


Searching for APSES domains

A PSI-BLAST search was executed, searching in the refseq' subset of the NCBI protein database and restricting the species to the six fungal reference species plu Escherichi coli. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analyis.

The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10-4 were also removed. The final result included 39 sequences. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on Get selected sequences created a results page of 29 sequences. These were then displayed in a FASTA(text) format.



Constructing the multi-FASTA file

A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence.

The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the result page: they are presented from alink at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:

  • Pairwise: the default
  • Pairwise with identities: showing only differences to the query sequence
  • query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
  • flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
  • hit-table: this gives only the numerical parameters describing the quality of the matches.

When we select the flat-query anchored with/without identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.

Renaming sequences

To make the interpretation of alignments and gene trees easier, the Mbp1 orthologues for all species were labeled Mbp1_???? (e.g. Mbp1_ASPFU). All yeast sequences were labelled with their gene name (e.g. Sok2_SACCE). All other sequences were named according to the yeast gene they share the most identities with, where the last digit was replaced with A, B, C - as required. (e.g. SokA_ASHGO). Note that such relabeling sequences does not change the data or its interpretation, it is just helpful. Finally the squences were sorted to have the Mbp1 orthologues first in the list, then all other sequences sorted by organism.

The final 69 sequences

>Mbp1_SACCE (79  ids)  NP_010227    (024..102)
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
>Mbp1_ASHGO (66  ids)  NP_986147    (031..109)
SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEVIKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLFDF
>Mbp1_ASPFU (49  ids)  XP_754232    (001..077)
MRRRGDDWINATHILKVAGFDKPARTRILEREVQKGTHEKVQGGYGKYQGTWIPLHEGRLLAERNNIIDKLRPIFDY
>Mbp1_ASPNI (50  ids)  XP_660758    (028..106)
SVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIFDY
>Mbp1_ASPTE (49  ids)  XP_001213217 (028..106)
SVMRRRADDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIFDY
>Mbp1_CANAL (53  ids)  XP_723071    (026..103)
IMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIFEF
>Mbp1_CANGL (71  ids)  XP_445458    (024..102)
SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEVLKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLFDF
>Mbp1_COPCI (43  ids)  EAU84310     (025..103)
AVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEF
>Mbp1_CRYNE (47  ids)  XP_570545    (133..211)
SVMRRASDSWVNATQILKVAGVHKSARTKILEKEVLNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVFDF
>Mbp1_DEBHA (50  ids)  XP_458784    (027..104)
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIFEF
>Mbp1_GIBZE (48  ids)  XP_390560    (040..117)
VMRRRSDDWINATHILKAAGFDKPARTRILERDVQKDVHEKIQGGYGKYQGTWIPLESGQALAERHSVIDRLRPIFEY
>Mbp1_KLULA (64  ids)  XP_454189    (025..103)
SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEVITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLFDF
>Mbp1_MAGGR (48  ids)  XP_362974    (040..117)
VMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEF
>Mbp1_NEUCR (50  ids)  XP_955821    (037..114)
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIYERLKPIFEF
>Mbp1_PICST (52  ids)  XP_001386821 (026..103)
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLELGRDIAKNFGVFDILKPIFDF
>Mbp1_SCHPO (43  ids)  NP_595496    (027..103)
MKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHEYNVFDLIQPLIEY
>Mbp1_USTMA (41  ids)  XP_762343    (026..104)
AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPITSY
>Mbp1_YARLI (49  ids)  XP_500257    (022..100)
AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEVQKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIFNY
>Swi4_ASHGO (58  ids)  NP_986370    (043..115)
VMRRLHDDWVNITQVFKVATFSKTQRTKILEKESADISHEKIQGGYGRFQGTWIPLDSAKGLVAKYEITDIVV
>Sok2_ASHGO (67  ids)  NP_983001    (352..425)
SVVRRADNDMINGTKLLNVAKMTRGRRDGILKAEKVRHVVKIGSMHLKGVWIPFERALALAQREKIVDMLFPLF
>MbpB_ASPFU (22  ids)  XP_751244    (151..225)
VMWDYNIGLVRTTHLFKCNDYSKMLNANPGLREICHSITGGALAAQGYWMPYEAAKAVAATFCWKIRHALTPLFG
>MbpA_ASPFU (40  ids)  XP_748947    (105..183)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
>Sok2_ASPFU (58  ids)  XP_755125    (152..224)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpB_ASPNI (19  ids)  XP_001392970 (124..203)
ISWDYNVGLVLTRSLFKCNGHPKTAPAKVLKMNPGLGDISHSITGGALVGQGYWMPFRAAKALATTFCWNIRFVLTPMFG
>SokB_ASPNI (21  ids)  XP_663009    (131..211)
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFCWKIRFALTPLFG
>MbpA_ASPNI (40  ids)  XP_001391313 (118..196)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
>SokA_ASPNI (56  ids)  XP_663440    (152..224)
VARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKITDLLYPLF
>Sok2_ASPNI (58  ids)  XP_001390623 (153..225)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpB_ASPTE (21  ids)  XP_001212599 (130..212)
IMWDYNIGLVRTTPLFRSQNYSKTTPAKVLDANPGLREISHSITGGAIVAQDKPGYWIPFEAAKAVAATFCWRIRYALTPIFG
>MbpA_ASPTE (40  ids)  XP_001215548 (007..085)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVDLCREYHVEELLRPLLEY
>Sok2_ASPTE (59  ids)  XP_001218256 (139..211)
VARREDNSMINGTKLLNVAGMTRGRRDGILKSEKIRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpC_CANAL (22  ids)  XP_723412    (087..178)
VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYLQGIWIPYDKAVNLALKFDIYEITKKLF
>MbpB_CANAL (25  ids)  XP_710918    (256..346)
VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYYLRYSLIPIFG
>MbpA_CANAL (48  ids)  XP_712970    (006..082)
SIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAKTYGVTEELAPVL
>Sok2_CANAL (49  ids)  XP_711513    (469..541)
VSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGVDSLLYPLF
>Phd1_CANAL (65  ids)  XP_714237    (228..301)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
>SokA_CANGL (56  ids)  XP_449680    (143..216)
TVVRRADNDMVNGTKLLNVTGMTRGRRDGILKNEPVRDVVKGGPMTLKGVWIPIDRARAIARQEGIEQWLYPLF
>Swi4_CANGL (61  ids)  XP_444966    (062..140)
VMRRTMDDWVNVTQVFKIAQFSKTQRTKILEKESTNMKHEKVQGGYGRFQGTWVPLEAAKFMTTKYNIDNPVVNTILSF
>Sok2_CANGL (64  ids)  XP_448847    (224..297)
SVVRRADNDMINGTKLLNVTKMTRGKRDGILRSEKYRKVVKIGSMHLKGVWIPFERALFIAKREKIVDLLYPLF
>MbpA_COPCI (26  ids)  EAU85126     (059..139)
IMMDIDDGYILWTGIWKALGNSKADIVKMIDSQPDLAPLIRRVRGGYLKIQGTWMPYEVALKLSRRVAWPIRHDLVPLFGF
>MbpA_CRYNE (42  ids)  XP_569090    (036..114)
AVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDY
>MbpB_DEBHA (26  ids)  XP_459773    (187..275)
IIWDYETGFVHLTGIWKASINDEVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYHIRFQLIPIFG
>Swi4_DEBHA (26  ids)  XP_459901    (067..158)
ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTNTQYLSSGGSNPQFNDLRNHEVRDLRGLWIPYDRAVSLALKFDIYELAKSLF
>MbpA_DEBHA (45  ids)  XP_457246    (028..103)
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKIQGGYGRFQGTWIPLADAQRLAASYGVTPDLAPVL
>SokA_DEBHA (50  ids)  XP_460447    (213..285)
VSRREDTNYVNGTKLLNVAGMTRGKRDGILKTEKTKSVVKVGAMNLKGVWIPFERASEIARNEGIDGLLYPLF
>Sok2_DEBHA (64  ids)  XP_459785    (307..380)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>MbpB_GIBZE (21  ids)  XP_389978    (139..219)
AVMWDYNIGLVRMTPFFKCRGYGKTIPAKMLGLNPGLKEITHSITGGSIAAQGYWMPYRCAKAICATFCHPIAGALIPIFG
>MbpA_GIBZE (39  ids)  XP_384396    (045..123)
AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLLTY
>Sok2_GIBZE (55  ids)  XP_390305    (226..298)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPYDRALDFANKEKITELLYPLF
>Swi4_KLULA (50  ids)  XP_454890    (119..197)
IMRRCNDNWLNITQVFKAGSFTKAQRTKILEKEANEIKHEKIQGGYGRFQGTWIPWESTKYLVEKYNINNKVVKRIVEF
>Sok2_KLULA (67  ids)  XP_455299    (386..459)
SVVRRADNDMINGTKLLNVTRMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALVMAQREKIVDLLYALF
>MbpB_MAGGR (20  ids)  XP_369301    (096..176)
TVMWDYGCGLVRMTHFFKCRGYTKTVPGKVLNQNHGLKDITYSITGGSISAQGYWMPFACARAVCATFCHPIAGALIPIFG
>MbpA_MAGGR (39  ids)  XP_365024    (131..209)
AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLLEY
>Sok2_MAGGR (57  ids)  XP_368552    (133..205)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKMRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
>MbpA_NEUCR (40  ids)  XP_962967    (071..147)
AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
>MbpA_PICST (46  ids)  XP_001383745 (006..081)
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLPDAQRLATMYGVTADAAPVL
>SokA_PICST (49  ids)  XP_001385235 (239..311)
VSRREDTNFVNGTKLLNVIGMTRGKRDGILKTEKTRNVVKVGSMNLKGVWIPFDRAFEIARNEGVDEALHPLF
>Sok2_PICST (64  ids)  XP_001383609 (194..267)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>Sok2_SACCE (74  ids)  EDN64408     (435..508)
SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF
>Phd1_SACCE (74  ids)  NP_012881    (208..281)
SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF
>Swi4_SACCE (79  ids)  EDN63086     (060..138)
VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTF
>MbpB_SCHPO (21  ids)  NP_596132    (088..164)
LRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKRYGVYEILQPLISF
>MbpA_SCHPO (41  ids)  NP_593032    (027..104)
SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILS
>MbpA_USTMA (24  ids)  XP_760925    (057..138)
TMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSRRIAWEIRDHLVPLFGY
>Swi4_USTMA (42  ids)  XP_761485    (182..260)
AVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAEYNVSHLLQPILEF
>MbpB_YARLI (26  ids)  XP_505499    (080..159)
IIWDYHTGYVHLTGLWKAIGNSKADIVKLIDNSPDLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYFIRFALIPLFG
>MbpA_YARLI (44  ids)  XP_501770    (036..114)
AVMRRRTDSSLNATQILKVAGVEKSKRTKILEKEILTGAHEKVQGGYGKYQGTWIPYERGVDLCRQYSVYDVLQPLLAF
>SokA_YARLI (55  ids)  CAB45654     (144..216)
VARREDNDMINGTKLLNVAGMTRGRRDGILKGEKLRHVVKAGAMHLKGVWIPYDRALEFANKEKIIDLLFPLF
>Sok2_YARLI (60  ids)  XP_501102    (130..202)
VARREDNNMINGTKLLNVVGMTRGRRDGILKTEKIRHVVKIGAMHLKGVWIPYERALAFAQRERIVDVLYPLF