Difference between revisions of "Reference APSES domains (reference species)"

From "A B C"
Jump to navigation Jump to search
m
m
Line 47: Line 47:
 
=====Searching for APSES domains=====
 
=====Searching for APSES domains=====
  
A PSI-BLAST search was executed, searching in the '''refseq''' subset of the NCBI protein database and restricting the species to the six fungal reference species plu ''Escherichi coli'''. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analyis.  
+
A PSI-BLAST search was executed, searching in the '''refseq''' subset of the NCBI protein database and restricting the species to the six fungal reference species plu ''Escherichia coli''. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analysis.  
  
The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10<sup>-4</sup> were also removed. The final result included 39 sequences. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on '''Get selected sequences''' created a results page of 29 sequences. These were then displayed in a FASTA(text) format.
+
The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10<sup>-4</sup> were also removed. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on '''Get selected sequences''' created a results page of 27 sequences. These were then displayed in a FASTA(text) format and their headers were slightly edited to create a dataset of [[Reference APSES full length proteins]].
 
 
 
 
<!-- CONTINUE HERE -->
 
  
  
 
====Constructing the multi-FASTA file====
 
====Constructing the multi-FASTA file====
  
A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence.
+
A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence. To generate the multi-FASTA file of APSES domains, we could have simply adited the full length proteins manually. But there is a simpler way to achieve this. The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the BLAST result page: they are presented from alink at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:
 
 
The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the result page: they are presented from alink at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:
 
  
 
*'''Pairwise''': the default
 
*'''Pairwise''': the default
Line 70: Line 65:
  
 
====Renaming sequences====
 
====Renaming sequences====
To make the interpretation of alignments and gene trees easier, the Mbp1 orthologues for all species were labeled <code>Mbp1_????</code> (e.g. <code>Mbp1_ASPFU</code>). All yeast sequences were labelled with their gene name  (e.g. <code>Sok2_SACCE</code>). All other sequences were named according to the yeast gene they share the most identities with, where the last digit was replaced with A, B, C - as required.  (e.g. <code>SokA_ASHGO</code>). Note that such relabeling sequences does not change the data or its interpretation, it is just helpful. Finally the squences were sorted to have the Mbp1 orthologues first in the list, then all other sequences sorted by organism.
+
To make the interpretation of alignments and gene trees easier, all yeast sequences were labelled with their gene name  (e.g. <code>Sok2_SACCE</code>). All other sequences were named APS1_, APS2_, APS3_ ... - as required.  (e.g. <code>APS1_USTMA</code>). Note that such relabeling sequences does not change the data or its interpretation, it is just helpful.
  
====The final 69 sequences====
+
====The final 27 APSES domain reference sequences====
  
  >Mbp1_SACCE (79  ids)  NP_010227    (024..102)
+
  >APS1_USTMA XP_762343 UM06196
  SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
+
  IINNVAVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAE
  >Mbp1_ASHGO (66  ids)  NP_986147    (031..109)
+
  RYNI
  SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEVIKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLFDF
+
   
  >Mbp1_ASPFU (49  ids)  XP_754232    (001..077)
+
  >APS1_NEUCR XP_962967 NCU07587
  MRRRGDDWINATHILKVAGFDKPARTRILEREVQKGTHEKVQGGYGKYQGTWIPLHEGRLLAERNNIIDKLRPIFDY
+
  VNNVAVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQ
  >Mbp1_ASPNI (50  ids)  XP_660758    (028..106)
+
  YGV
  SVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIFDY
+
   
  >Mbp1_ASPTE (49  ids)  XP_001213217 (028..106)
+
  >APS2_NEUCR XP_955821 NCU07246
  SVMRRRADDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIFDY
+
  VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIY
  >Mbp1_CANAL (53  ids)  XP_723071    (026..103)
+
   
IMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIFEF
+
  >MBP1_SACCE NP_010227 Mbp1
  >Mbp1_CANGL (71  ids)  XP_445458    (024..102)
+
  IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAE
  SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEVLKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLFDF
+
  KFSVY
  >Mbp1_COPCI (43  ids)  EAU84310    (025..103)
+
   
  AVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEF
+
  >APS1_CANAL XP_712970 potential DNA binding component of SBF
  >Mbp1_CRYNE (47  ids)  XP_570545    (133..211)
+
  MMNESSIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAK
  SVMRRASDSWVNATQILKVAGVHKSARTKILEKEVLNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVFDF
+
  TYGV
  >Mbp1_DEBHA (50  ids)  XP_458784    (027..104)
+
   
  IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIFEF
+
  >APS1_ASPNI XP_660758.1 AN3154
  >Mbp1_GIBZE (48  ids)  XP_390560    (040..117)
+
  IGTDSVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAER
VMRRRSDDWINATHILKAAGFDKPARTRILERDVQKDVHEKIQGGYGKYQGTWIPLESGQALAERHSVIDRLRPIFEY
+
  NNI
>Mbp1_KLULA (64  ids)  XP_454189    (025..103)
+
   
SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEVITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLFDF
+
  >APS1_SCHPO NP_595496 MBF transcription factor complex subunit Res1
>Mbp1_MAGGR (48  ids)  XP_362974    (040..117)
+
  INGFPLMKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHE
VMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEF
+
  YNVF
>Mbp1_NEUCR (50  ids) XP_955821    (037..114)
+
   
  VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIYERLKPIFEF
+
  >APS2_SCHPO NP_593032 MBF transcription factor complex subunit Res2
  >Mbp1_PICST (52  ids)  XP_001386821 (026..103)
+
  IKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATK
  IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLELGRDIAKNFGVFDILKPIFDF
+
  YKV
  >Mbp1_SCHPO (43  ids)  NP_595496   (027..103)
+
   
  MKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHEYNVFDLIQPLIEY
+
  >APS2_CANAL XP_723071 potential DNA binding component of MBF
  >Mbp1_USTMA (41  ids)  XP_762343    (026..104)
+
  VTSEGPIMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIAR
  AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPITSY
+
  NFGVY
  >Mbp1_YARLI (49  ids)  XP_500257    (022..100)
+
   
  AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEVQKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIFNY
+
  >APS2_ASPNI XP_664319 hypothetical protein AN6715
  >Swi4_ASHGO (58  ids)  NP_986370    (043..115)
+
  VNGVAVMKRRSDGWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCRE
  VMRRLHDDWVNITQVFKVATFSKTQRTKILEKESADISHEKIQGGYGRFQGTWIPLDSAKGLVAKYEITDIVV
+
  YHV
  >Sok2_ASHGO (67  ids)  NP_983001    (352..425)
+
   
  SVVRRADNDMINGTKLLNVAKMTRGRRDGILKAEKVRHVVKIGSMHLKGVWIPFERALALAQREKIVDMLFPLF
+
  >APS2_USTMA XP_761485 UM05338
  >MbpB_ASPFU (22  ids)  XP_751244    (151..225)
+
  VRGIAVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAE
  VMWDYNIGLVRTTHLFKCNDYSKMLNANPGLREICHSITGGALAAQGYWMPYEAAKAVAATFCWKIRHALTPLFG
+
  YNV
  >MbpA_ASPFU (40  ids)  XP_748947    (105..183)
+
   
  AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
+
  >SWI4_SACCE NP_011036 Swi4p
  >Sok2_ASPFU (58  ids)  XP_755125    (152..224)
+
  TKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYE
  VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
+
  I
  >MbpB_ASPNI (19  ids)  XP_001392970 (124..203)
+
   
ISWDYNVGLVLTRSLFKCNGHPKTAPAKVLKMNPGLGDISHSITGGALVGQGYWMPFRAAKALATTFCWNIRFVLTPMFG
+
  >APS3_SCHPO NP_596132 MBF transcription factor complex subunit Cdc10
>SokB_ASPNI (21  ids)  XP_663009    (131..211)
+
  GDNVALRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKR
  TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFCWKIRFALTPLFG
+
  YGVY
  >MbpA_ASPNI (40  ids)  XP_001391313 (118..196)
+
   
  AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
+
  >APS3_CANAL XP_714237 potential DNA binding regulator of filamentous growth
  >SokA_ASPNI (56  ids)  XP_663440    (152..224)
+
  NNVSVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQI
  VARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKITDLLYPLF
+
   
  >Sok2_ASPNI (58  ids)  XP_001390623 (153..225)
+
  >SOK2_SACCE NP_013729 Sok2p
  VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
+
  NGISVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKI
  >MbpB_ASPTE (21  ids)  XP_001212599 (130..212)
+
   
IMWDYNIGLVRTTPLFRSQNYSKTTPAKVLDANPGLREISHSITGGAIVAQDKPGYWIPFEAAKAVAATFCWRIRYALTPIFG
+
  >APS3_ASPNI XP_663440 STUA CELL PATTERN FORMATION-ASSOCIATED PROTEIN
>MbpA_ASPTE (40  ids)  XP_001215548 (007..085)
+
  GVCVARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKI
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVDLCREYHVEELLRPLLEY
+
   
>Sok2_ASPTE (59  ids)  XP_001218256 (139..211)
+
  >PHD1_SACCE NP_012881 Phd1p
VARREDNSMINGTKLLNVAGMTRGRRDGILKSEKIRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
+
  NGISVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQI
>MbpC_CANAL (22  ids)  XP_723412    (087..178)
+
   
  VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYLQGIWIPYDKAVNLALKFDIYEITKKLF
+
  >APS4_CANAL XP_710918 CaO19.5210
  >MbpB_CANAL (25  ids)  XP_710918    (256..346)
+
  LNNHWVIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTW
  VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYYLRYSLIPIFG
+
  LPYKLCKILARRFCYY
  >MbpA_CANAL (48  ids)  XP_712970    (006..082)
+
   
SIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAKTYGVTEELAPVL
+
  >APS3_NEUCR XP_960837 NCU01414
>Sok2_CANAL (49  ids)  XP_711513    (469..541)
+
  GICVARREDNAMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALDFANKEKI
VSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGVDSLLYPLF
+
   
>Phd1_CANAL (65  ids)  XP_714237   (228..301)
+
  >APS5_CANAL XP_711513 potential DNA binding protein
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
+
  NILVSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGV
  >SokA_CANGL (56  ids)  XP_449680    (143..216)
+
   
  TVVRRADNDMVNGTKLLNVTGMTRGRRDGILKNEPVRDVVKGGPMTLKGVWIPIDRARAIARQEGIEQWLYPLF
+
  >APS4_ASPNI XP_663009 AN5405
  >Swi4_CANGL (61  ids)  XP_444966    (062..140)
+
  TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFC
  VMRRTMDDWVNVTQVFKIAQFSKTQRTKILEKESTNMKHEKVQGGYGRFQGTWVPLEAAKFMTTKYNIDNPVVNTILSF
+
   
  >Sok2_CANGL (64  ids)  XP_448847    (224..297)
+
  >APS3_USTMA XP_760925 UM04778
SVVRRADNDMINGTKLLNVTKMTRGKRDGILRSEKYRKVVKIGSMHLKGVWIPFERALFIAKREKIVDLLYPLF
+
  VRGHTMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSR
  >MbpA_COPCI (26  ids)  EAU85126    (059..139)
+
  R
IMMDIDDGYILWTGIWKALGNSKADIVKMIDSQPDLAPLIRRVRGGYLKIQGTWMPYEVALKLSRRVAWPIRHDLVPLFGF
+
   
  >MbpA_CRYNE (42  ids)  XP_569090    (036..114)
+
  >APS4_SCHPO NP_596166
  AVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDY
+
  HFLMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNPIEDKRVAGLWVPPADALALAKDYSM
  >MbpB_DEBHA (26  ids)  XP_459773    (187..275)
+
   
  IIWDYETGFVHLTGIWKASINDEVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYHIRFQLIPIFG
+
  >KILA_ESCCO ZP_07189117 KilA-N domain protein
  >Swi4_DEBHA (26  ids)  XP_459901    (067..158)
+
  IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSFKGGRPENQGTW
ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTNTQYLSSGGSNPQFNDLRNHEVRDLRGLWIPYDRAVSLALKFDIYELAKSLF
+
  VHPDIAINLAQ
  >MbpA_DEBHA (45  ids)  XP_457246    (028..103)
+
   
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKIQGGYGRFQGTWIPLADAQRLAASYGVTPDLAPVL
+
  >APS6_CANAL XP_723412 potential transcriptional co-activator
>SokA_DEBHA (50  ids)  XP_460447    (213..285)
+
  HGEIIVLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYL
  VSRREDTNYVNGTKLLNVAGMTRGKRDGILKTEKTKSVVKVGAMNLKGVWIPFERASEIARNEGIDGLLYPLF
+
  QGIWIPYDKAVNLALKFDIY
  >Sok2_DEBHA (64  ids)  XP_459785    (307..380)
+
   
  SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
+
  >APS4_NEUCR XP_962267 NCU06560
  >MbpB_GIBZE (21  ids)  XP_389978    (139..219)
+
  FLMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTTSSEETAGNVWIPPEQALILAEEYQI
  AVMWDYNIGLVRMTPFFKCRGYGKTIPAKMLGLNPGLKEITHSITGGSIAAQGYWMPYRCAKAICATFCHPIAGALIPIFG
+
   
  >MbpA_GIBZE (39  ids)  XP_384396    (045..123)
+
  >APS5_ASPNI XP_657766 AN0162
AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLLTY
+
  TYFLMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPETSEDEIAGNVWISPVLALELAAEYKMY
  >Sok2_GIBZE (55  ids)  XP_390305    (226..298)
 
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPYDRALDFANKEKITELLYPLF
 
  >Swi4_KLULA (50  ids)  XP_454890    (119..197)
 
  IMRRCNDNWLNITQVFKAGSFTKAQRTKILEKEANEIKHEKIQGGYGRFQGTWIPWESTKYLVEKYNINNKVVKRIVEF
 
  >Sok2_KLULA (67  ids)  XP_455299    (386..459)
 
  SVVRRADNDMINGTKLLNVTRMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALVMAQREKIVDLLYALF
 
  >MbpB_MAGGR (20  ids)  XP_369301    (096..176)
 
TVMWDYGCGLVRMTHFFKCRGYTKTVPGKVLNQNHGLKDITYSITGGSISAQGYWMPFACARAVCATFCHPIAGALIPIFG
 
  >MbpA_MAGGR (39  ids)  XP_365024    (131..209)
 
AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLLEY
 
>Sok2_MAGGR (57  ids)  XP_368552    (133..205)
 
  VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKMRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
 
  >MbpA_NEUCR (40  ids)  XP_962967    (071..147)
 
AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
 
>MbpA_PICST (46  ids)  XP_001383745 (006..081)
 
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLPDAQRLATMYGVTADAAPVL
 
>SokA_PICST (49  ids)  XP_001385235 (239..311)
 
  VSRREDTNFVNGTKLLNVIGMTRGKRDGILKTEKTRNVVKVGSMNLKGVWIPFDRAFEIARNEGVDEALHPLF
 
  >Sok2_PICST (64  ids)  XP_001383609 (194..267)
 
  SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
 
  >Sok2_SACCE (74  ids)  EDN64408    (435..508)
 
SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF
 
  >Phd1_SACCE (74  ids)  NP_012881    (208..281)
 
  SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF
 
  >Swi4_SACCE (79  ids)  EDN63086    (060..138)
 
  VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTF
 
  >MbpB_SCHPO (21  ids)  NP_596132    (088..164)
 
LRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKRYGVYEILQPLISF
 
>MbpA_SCHPO (41  ids)  NP_593032    (027..104)
 
SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILS
 
>MbpA_USTMA (24  ids)  XP_760925    (057..138)
 
TMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSRRIAWEIRDHLVPLFGY
 
>Swi4_USTMA (42  ids)  XP_761485    (182..260)
 
  AVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAEYNVSHLLQPILEF
 
  >MbpB_YARLI (26  ids)  XP_505499    (080..159)
 
  IIWDYHTGYVHLTGLWKAIGNSKADIVKLIDNSPDLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYFIRFALIPLFG
 
  >MbpA_YARLI (44  ids)  XP_501770    (036..114)
 
  AVMRRRTDSSLNATQILKVAGVEKSKRTKILEKEILTGAHEKVQGGYGKYQGTWIPYERGVDLCRQYSVYDVLQPLLAF
 
  >SokA_YARLI (55  ids)  CAB45654    (144..216)
 
VARREDNDMINGTKLLNVAGMTRGRRDGILKGEKLRHVVKAGAMHLKGVWIPYDRALEFANKEKIIDLLFPLF
 
  >Sok2_YARLI (60  ids)  XP_501102    (130..202)
 
  VARREDNNMINGTKLLNVVGMTRGRRDGILKTEKIRHVVKIGAMHLKGVWIPYERALAFAQRERIVDVLYPLF
 

Revision as of 06:55, 24 November 2011


Multi FASTA file of APSES domains in six fungal reference species.

This page collects APSES domain sequences from six fungal species that are used as reference species for the course. The species are:

  • Aspergillus nidulans (ASPNI)
  • Candida albicans (CANAL)
  • Neurospora crassa (NEUCR)
  • Saccharomyces cerevisiae (SACCE)
  • Schizosaccharomyces pombe (SCHPO)
  • Ustilago maydis (USTMA)


Executing the PSI-BLAST search

Defining the APSES Domain sequence
  1. Navigate to the NCBI BLAST page, accessed protein BLAST;
  2. Follow the link to protein BLAST and enter the yeast Mbp1 refseq ID NP_010227 into the input form;
  3. Select the PHI-BLAST algorithm to search for domains in the sequence and Run BLAST;
  4. Click on the graphical summary of the result to access the CDD conserved domains report for the sequence;
  5. Click on the (+) sign next to the link to KilA-N(pfam 04383) domain to display the query/profile alignment. This is what it looks like:
                          10        20        30        40        50        60        70        80
                  ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....|
gi 6320147     19 IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQ---------------GGFGKYQGTWVPLNIA 83
Cdd:pfam04383   3 YNDFEIIIRRDKDGYINATKLCKAAGAKGKRFRNWLRLESTKELIEELSkennpdkliiienrkGKGGRLQGTYVHPDLA 82


                          90
                  ....*....|....
gi 6320147     84 KQLA----EKFSVY 93
Cdd:pfam04383  83 LAIAswisPEFALK 96

This gives us the following APSES domain sequence:

>Yeast Mbp1 APSES domain (AA 19..93 of NP_010227)
IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG 
GFGKYQGTWVPLNIAKQLAEKFSVY


Searching for APSES domains

A PSI-BLAST search was executed, searching in the refseq subset of the NCBI protein database and restricting the species to the six fungal reference species plu Escherichia coli. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analysis.

The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10-4 were also removed. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on Get selected sequences created a results page of 27 sequences. These were then displayed in a FASTA(text) format and their headers were slightly edited to create a dataset of Reference APSES full length proteins.


Constructing the multi-FASTA file

A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence. To generate the multi-FASTA file of APSES domains, we could have simply adited the full length proteins manually. But there is a simpler way to achieve this. The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the BLAST result page: they are presented from alink at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:

  • Pairwise: the default
  • Pairwise with identities: showing only differences to the query sequence
  • query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
  • flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
  • hit-table: this gives only the numerical parameters describing the quality of the matches.

When we select the flat-query anchored with/without identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.

Renaming sequences

To make the interpretation of alignments and gene trees easier, all yeast sequences were labelled with their gene name (e.g. Sok2_SACCE). All other sequences were named APS1_, APS2_, APS3_ ... - as required. (e.g. APS1_USTMA). Note that such relabeling sequences does not change the data or its interpretation, it is just helpful.

The final 27 APSES domain reference sequences

>APS1_USTMA XP_762343 UM06196
IINNVAVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAE
RYNI

>APS1_NEUCR XP_962967 NCU07587
VNNVAVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQ
YGV

>APS2_NEUCR XP_955821 NCU07246
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIY

>MBP1_SACCE NP_010227 Mbp1
IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAE
KFSVY

>APS1_CANAL XP_712970 potential DNA binding component of SBF
MMNESSIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAK
TYGV

>APS1_ASPNI XP_660758.1  AN3154
IGTDSVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAER
NNI

>APS1_SCHPO NP_595496 MBF transcription factor complex subunit Res1
INGFPLMKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHE
YNVF

>APS2_SCHPO NP_593032 MBF transcription factor complex subunit Res2
IKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATK
YKV

>APS2_CANAL XP_723071 potential DNA binding component of MBF
VTSEGPIMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIAR
NFGVY

>APS2_ASPNI XP_664319 hypothetical protein AN6715
VNGVAVMKRRSDGWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCRE
YHV

>APS2_USTMA XP_761485 UM05338
VRGIAVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAE
YNV

>SWI4_SACCE NP_011036 Swi4p
TKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYE
I

>APS3_SCHPO NP_596132 MBF transcription factor complex subunit Cdc10
GDNVALRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKR
YGVY

>APS3_CANAL XP_714237 potential DNA binding regulator of filamentous growth
NNVSVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQI

>SOK2_SACCE NP_013729 Sok2p
NGISVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKI

>APS3_ASPNI XP_663440 STUA CELL PATTERN FORMATION-ASSOCIATED PROTEIN
GVCVARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKI

>PHD1_SACCE NP_012881 Phd1p
NGISVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQI

>APS4_CANAL XP_710918 CaO19.5210
LNNHWVIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTW
LPYKLCKILARRFCYY

>APS3_NEUCR XP_960837 NCU01414
GICVARREDNAMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALDFANKEKI

>APS5_CANAL XP_711513 potential DNA binding protein
NILVSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGV

>APS4_ASPNI XP_663009 AN5405
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFC

>APS3_USTMA XP_760925 UM04778
VRGHTMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSR
R

>APS4_SCHPO NP_596166
HFLMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNPIEDKRVAGLWVPPADALALAKDYSM

>KILA_ESCCO ZP_07189117 KilA-N domain protein
IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSFKGGRPENQGTW
VHPDIAINLAQ

>APS6_CANAL XP_723412 potential transcriptional co-activator
HGEIIVLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYL
QGIWIPYDKAVNLALKFDIY

>APS4_NEUCR XP_962267 NCU06560
FLMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTTSSEETAGNVWIPPEQALILAEEYQI

>APS5_ASPNI XP_657766 AN0162
TYFLMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPETSEDEIAGNVWISPVLALELAAEYKMY