Difference between revisions of "Reference APSES domains (reference species)"

From "A B C"
Jump to navigation Jump to search
Line 43: Line 43:
  
 
====Renaming sequences====
 
====Renaming sequences====
 +
To support the interpretation of alignments and gene trees, the Mbp1 orthologues for all species were named accordingly (e.g. <code>MBP1_ASPFU</code>). All yeast genes were given the yeast-gene-name  (e.g. <code>SOK2_SACCE</code>). All other sequences were named with the last four digits of their RefSeq ID and a five character species code according to their species  (e.g. <code>SOK2_SACCE</code>). This is a pain to do by hand, so I wrote a little perl script to parse this information from the original BLAST report and modify the headers in the multi-FASTA file accordingly. However, note that renaming sequences is somewhat "cosmetic" and does not change the data or its interpretation.
  
To support the interpretation of alignments and gene trees, the Mbp1 orthologues for all species were named accordingly. All other sequences were named according to their species and the first four digits of their RefSeq ID. This is a pain to do by hand, so I wrote a little perl script to parse this information from the BLAST output and modify the headers accordingly. However, this is really "cosmetic" and does not change the data or its interpretation.
 
  
====Defining the most similar ASPES domain in yeast====
+
====The final 74 sequences====
  
Normally we would find the most similar protein in another species by executing a BLAST search. In our case however, we have 70 sequences. Doing this by hand is possible - but painful. Even clicking through the precomputed '''''BLink'''''s (that we would conveniently find on the page returned through "Get selected sequences") will not help us, since, we are not looking for the most similar protein ''per se'', but for the most similar '''ASPES domain'''. So what we need is (1) an input file of ASPES domain sequences, and then (2) a way to BLAST them against the yeast genome. Let's ignore for the time being the requirement for full-length domain sequences and stick with the that PSI-BLAST has found. Parsing the BLAST file and extracting the sequences by hand is, again, possible, but painful. Fortunately there is a simpler way.
+
>MBP1_SACCE NP_010227 024..107
 
+
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDG
====The ASPES domain sequences====
+
>MBP1_YARLI XP_500257 022..105
 
+
AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEVQKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIFNYDDEDG
 
+
>XP_955821 037..118
 
+
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIYERLKPIFEFQPGN
 
+
>XP_569090 036..117
 
+
AVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPT
This requires to take all sequence identifiers, use their APSES domains and search them against the yeast genome. I actually have given up on finding a Web-tool to do this. Of course this '''can''' be done manually, through '''''Blinks''''' - but having to do this for 70 sequences was an uninspiring prospect. And it seems there are no BLAST Webservices that will accept batch-input of lists of sequences.
+
>MBP1_ASPNI XP_660758 028..110
 
+
SVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIFDYVAGD
 
+
>MBP1_KLULA XP_454189 025..108
 
+
SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEVITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLFDFTQQEG
The full-length protein sequences were copied from the previously prepared input file of [[All_APSES_proteins|'''86 proteins'''] and pasted into the input form of the [http://www.ebi.ac.uk/clustalw/ '''EBI ClustalW service''']. While this is no longer considered state-of-the-art for multiple sequence alignments, it is computationally efficient and sufficiently accurate for the purpose of approximate domain boundary definition. What we want to construct an input file for aligning just the APSES domains: this should contain the following
+
>MBP1_GIBZE XP_384396 045..129
 
+
AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLLTYDMGQDG
* our yeast APSES domain (this defines the boundaries of the domain we are interested in)
+
>MBP1_ASPTE XP_001213217 028..110
* enough sequence extending it N- and C-terminally for the other proteins to ensure we are not throwing out conserved amino acids
+
SVMRRRADDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIFDYVAGD
* but not too much, since irrelevant sequence can cause problems for the alignment.
+
>MBP1_CANAL XP_723071 026..108
 
+
IMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIFEFQYIEG
Scrolling through the ClustalW result page, the alignment blocks containing the Mbp1 APSES domain sequence were copied and pasted into a MSWord test document, then manually edited to contain only the APSES domains plus some 10 or 20 residues on each end. Through some simple replace commands, this was then brought into a FASTA format. What's a bit annoying is that this changes the headers to contain only the first word (in our case mostly the GI number) .. i.e. from a FASTA input of ...
+
>MBP1_CANGL XP_445458 024..107
 
+
SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEVLKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLFDFSEENG
  >6320147 NP_010227.1 Mbp1p [Saccharomyces cerevisiae]
+
>XP_501770 036..116
  MSNQIYSARYSGVDVYEFIHST...
+
AVMRRRTDSSLNATQILKVAGVEKSKRTKILEKEILTGAHEKVQGGYGKYQGTWIPYERGVDLCRQYSVYDVLQPLLAFDP
 
+
>XP_362974 121..199
... we get a Clustal record of ...
+
VMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEFS
 
+
>XP_761485 182..262
  6320147        --------------------------------------MSNQIYSARYSGVDVYEFIHST 22
+
AVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAEYNVSHLLQPILEFDP
 
+
>MBP1_USTMA XP_762343 026..107
...which we can change back into a FASTA record:
+
AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPITSYVPS
 
+
>XP_390560 040..120
  >6320147
+
VMRRRSDDWINATHILKAAGFDKPARTRILERDVQKDVHEKIQGGYGKYQGTWIPLESGQALAERHSVIDRLRPIFEYVQG
  MSNQIYSARYSGVDVYEFIHST
+
>XP_754232 001..081
 
+
MRRRGDDWINATHILKVAGFDKPARTRILEREVQKGTHEKVQGGYGKYQGTWIPLHEGRLLAERNNIIDKLRPIFDYVAGD
Tuhs losing part of the header information. There is no easy way to repair the headers in MSWord, but using a trivial perl program this can be automated:
+
>MBP1_CRYNE XP_570545 133..214
 
+
SVMRRASDSWVNATQILKVAGVHKSARTKILEKEVLNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVFDFVPS
 
+
>MBP1_NEUCR XP_962967 071..155
 
+
  AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLLTHNRGQEG
However I consider this cosmetics - the file would have been just as valid with only the GI numbers in the header. Here is the resulting FASTA file containing only APSES domains:
+
>MBP1_DEBHA XP_458784 027..109
 
+
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIFEFTYVEG
==Sources==
+
>XP_712876 006..088
 
+
SIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARRLAKTYGVTEELAPVLFLDFSD
*[http://www.ebi.ac.uk/cgi-bin/clustalw/result?tool=clustalw&jobid=clustalw-20061115-19463756&poll=yes]
+
>MBP1_MAGGR XP_365024 131..210
*[http://www.ebi.ac.uk/cgi-bin/jobresults/clustalw/clustalw-20061115-19463756.aln]
+
AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLLEYN
*[http://www.ebi.ac.uk/cgi-bin/jobresults/clustalw/clustalw-20061115-19463756.dnd]
+
>XP_664319 119..198
 +
AVMKRRSDGWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEYD
 +
>MBP1_ASPFU XP_748947 105..184
 +
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEYD
 +
>MBP1_SCHPO NP_593032 027..110
 +
SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSLDIDEG
 +
>XP_001215548 007..086
 +
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVDLCREYHVEELLRPLLEYD
 +
>NP_595496 026..106
 +
LMKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHEYNVFDLIQPLIEYSGS
 +
>XP_457246 028..109
 +
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKIQGGYGRFQGTWIPLADAQRLAASYGVTPDLAPVLYLDASD
 +
>MBP1_EREGO NP_986147 031..114
 +
SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEVIKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLFDFTRRDG
 +
>NP_986370 043..124
 +
VMRRLHDDWVNITQVFKVATFSKTQRTKILEKESADISHEKIQGGYGRFQGTWIPLDSAKGLVAKYEITDIVVLTVINFQPD
 +
>SWI4_SACCE NP_011036 060..141
 +
VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTFQFD
 +
>XP_454890 119..200
 +
IMRRCNDNWLNITQVFKAGSFTKAQRTKILEKEANEIKHEKIQGGYGRFQGTWIPWESTKYLVEKYNINNKVVKRIVEFIPD
 +
  >XP_444966 062..140
 +
VMRRTMDDWVNVTQVFKIAQFSKTQRTKILEKESTNMKHEKVQGGYGRFQGTWVPLEAAKFMTTKYNIDNPVVNTILSF
 +
>XP_459785 307..380
 +
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
 +
>XP_663009 131..216
 +
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFCWKIRFALTPLFGDNFPD
 +
>SOK2_SACCE NP_013729 436..509
 +
SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF
 +
>XP_449680 143..216
 +
TVVRRADNDMVNGTKLLNVTGMTRGRRDGILKNEPVRDVVKGGPMTLKGVWIPIDRARAIARQEGIEQWLYPLF
 +
  >NP_983001 352..425
 +
SVVRRADNDMINGTKLLNVAKMTRGRRDGILKAEKVRHVVKIGSMHLKGVWIPFERALALAQREKIVDMLFPLF
 +
>XP_714197 227..300
 +
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
 +
>XP_714237 228..301
 +
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
 +
>XP_001218256 139..211
 +
VARREDNSMINGTKLLNVAGMTRGRRDGILKSEKIRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
 +
>XP_663440 152..224
 +
VARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKITDLLYPLF
 +
>XP_502292 285..357
 +
VARREDNDMINGTKLLNVAGMTRGRRDGILKGEKLRHVVKAGAMHLKGVWIPYDRALEFANKEKIIDLLFPLF
 +
>XP_501102 130..202
 +
VARREDNNMINGTKLLNVVGMTRGRRDGILKTEKIRHVVKIGAMHLKGVWIPYERALAFAQRERIVDVLYPLF
 +
>XP_755125 152..224
 +
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
 +
>PHD1_SACCE NP_012881 208..281
 +
SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF
 +
>XP_448847 224..297
 +
SVVRRADNDMINGTKLLNVTKMTRGKRDGILRSEKYRKVVKIGSMHLKGVWIPFERALFIAKREKIVDLLYPLF
 +
>XP_505499 080..165
 +
IIWDYHTGYVHLTGLWKAIGNSKADIVKLIDNSPDLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYFIRFALIPLFGQDFPGT
 +
>XP_455299 386..459
 +
SVVRRADNDMINGTKLLNVTRMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALVMAQREKIVDLLYALF
 +
>XP_390305 226..298
 +
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPYDRALDFANKEKITELLYPLF
 +
>XP_960837 139..211
 +
VARREDNAMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
 +
>XP_368552 127..199
 +
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKMRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
 +
>XP_460447 213..285
 +
VSRREDTNYVNGTKLLNVAGMTRGKRDGILKTEKTKSVVKVGAMNLKGVWIPFERASEIARNEGIDGLLYPLF
 +
>XP_389978 139..218
 +
  AVMWDYNIGLVRMTPFFKCRGYGKTIPAKMLGLNPGLKEITHSITGGSIAAQGYWMPYRCAKAICATFCHPIAGALIPIF
 +
>XP_711513 469..541
 +
VSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGVDSLLYPLF
 +
>NP_596132 088..165
 +
LRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKRYGVYEILQPLISFN
 +
>XP_751244 151..230
 +
VMWDYNIGLVRTTHLFKCNDYSKMLNANPGLREICHSITGGALAAQGYWMPYEAAKAVAATFCWKIRHALTPLFGLDFPS
 +
>XP_760925 057..143
 +
TMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSRRIAWEIRDHLVPLFGYDFPST
 +
>XP_001212599 130..218
 +
IMWDYNIGLVRTTPLFRSQNYSKTTPAKVLDANPGLREISHSITGGAIVAQDKPGYWIPFEAAKAVAATFCWRIRYALTPIFGLDFPSQ
 +
  >XP_459773 187..274
 +
IIWDYETGFVHLTGIWKASINDEVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYHIRFQLIPIF
 +
>XP_710918 256..352
 +
VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYYLRYSLIPIFGTDFPDS
 +
>XP_459901 067..158
 +
ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTNTQYLSSGGSNPQFNDLRNHEVRDLRGLWIPYDRAVSLALKFDIYELAKSLF
 +
  >XP_657766 089..163
 +
  LMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPETSEDEIAGNVWISPVLALELAAEYKMYDWVRALLD
 +
>XP_385459 077..154
 +
LMRRSYDGFVSATGMFKASFPYAEASDEDAERKYIKSLPTTSHEETAGNVWIPPEQALILAEEYKISPWIRALLDPTP
 +
>XP_962267 085..162
 +
LMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTTSSEETAGNVWIPPEQALILAEEYQITPWIRALLDPSD
 +
>XP_753510 089..163
 +
LMRRSKDGYVSATGMFKIAFPWAKLEEEKAEREYLKTREGTSEDEIAGNIWVSPLLALELAKEYQMYDWVRALLD
 +
>XP_363762 084..161
 +
LMRRSSDGYVSATGMFKATFPYADAEDEEAERNYIKSLPATSKEETAGNVWISPDQALALAEEYSIATWIRALLDPTD
 +
>XP_723412 087..178
 +
VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYLQGIWIPYDKAVNLALKFDIYEITKKLF
 +
>NP_596166 062..140
 +
LMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNPIEDKRVAGLWVPPADALALAKDYSMTPFINALLEASST
 +
>XBP1_SACCE NP_012165 314..415
 +
RDLICQSYKDFLINELGPDQIDLPNLNPANFTKRIRGGYIKIQGTWLPMEISRLLCLRFCFPIRYFLVPIFGPDFPKDCESWYLAHQNVTFASSTTGAGAAT
 +
>XP_001216355 084..197
 +
TYFLMDGYVSATGMFKIAFPWAKLDEERSEREYLKSREETSEDEIAGNVWISPKLALELAGEYQMYNWVRALLDPTDIVQSPSSAKKQITPPPRYDLPPIEAPTQLTATSTRS
 +
>XP_369301 092..188
 +
EEYTVMWDYGCGLVRMTHFFKCRGYTKTVPGKVLNQNHGLKDITYSITGGSISAQESPNFGRMVIDRELVAHATREAESMYGRSMQAQAQQQGPLR
 +
>XP_455262 289..388
 +
YGKLDKPSKKDSQQKWNKWFQRESFSTYIDLHWHKLNPTLSTLLGQSYDAKIPFERMVKRIRGGYIKIQGTWLPYPVSKELCSRFCYPLRYLLVPLFGPDFPEKCEYWY
 +
>NP_983869 277..365
 +
YTDVHWNQVDPTWKQRLCRLYQQEKNLDFTPEFQDCYKRIRGGYIKIQGTWLPMEICKRLCIRFCFPIRYFLVPIFGEGFLQECHNWYF
 +
>XP_446482 295..390
 +
STSNSSVNYLDFHWFDISEKVRSQIFEQFKQHLEKDRNVDCSTIPKAEEYIQRIRGGYIKIQGTWVPWYIAKLICIRFCFPIRYLLVPIFGEQFPV

Revision as of 01:04, 20 November 2006


Multi FASTA file of all APSES domains in fungal proteins.

Executing the PSI-BLAST search

A PSI-BLAST search was executed with default parameters, searching in the RefSeq database, restricted to Fungi. The query sequence - the Mbp1 APSES domain - was defined as follows

>Yeast Mbp1 APSES domain (AA 24..107 of NP_010227)
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
QGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDG

The search returned 81 hits with significant e-values by the 5th iteration. 5 of these were from the organism Chaetomia globosum and were removed from the list since this is not one of the organisms we are studying. 6 hits were aligned only along a part of the APSES domain. For five of these hits, reasonable similarity to the whole APSES domain was independently verified by manually performing a Needleman-Wunsch optimal alignment with the Mbp1 APSES domain sequence. (EMBOSS NEEDLE using EBLOSUM 30, default gap parameters).

However the match to the Neurospora crassa protein XP_962373 suggested an incorrect gene model. Consider the alignment:

QUERY       1 SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY     50
                                          .:.:.:.||:....:.||..|.
XP_962373   1                             MLNQNPGLKDIAYSITGGAIKA     22

QUERY      51 QGTWVPLNIAKQLAEKF--SVYDQLKPLF--DFTQ---TDG              84
              ||.|.|:..||::...|  .:..:|.|||  ||..   :.|         
XP_962373  23 QGYWMPYACAKAVCATFCYQIAGALIPLFGPDFPSECISPGEPRYGIMII     72

In this situation you have to be suspicious that the gene-finder algorithm skipped a part of the N-terminus. Or, the sequence was derived from a partial m-RNA. This sequence was removed from analysis.

Further, XP_712876 and XP_712970 were found to be identical sequences from the same organism. Only one of these duplicates was kept.

This gave a total of 74 ASPES domain sequences for analysis.


A multi-FASTA file

Since we are interested in only the APSES domain, we need to display the search results in an appropriate format. If we navigate to the page from where we sent the BLAST query, we have several options to display search results:

  • Pairwise: the default
  • Pairwise with identities: showing only differences to the query sequence
  • query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
  • flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
  • hit-table: this gives only the numerical parameters describing the quality of the matches.

Using the flat-query anchored with/without identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands. Of course, the sequences for which only partial matches were found need to be completed "by hand" (from the reults of the pairwise sequence alignment described above to validate these sequences).


Renaming sequences

To support the interpretation of alignments and gene trees, the Mbp1 orthologues for all species were named accordingly (e.g. MBP1_ASPFU). All yeast genes were given the yeast-gene-name (e.g. SOK2_SACCE). All other sequences were named with the last four digits of their RefSeq ID and a five character species code according to their species (e.g. SOK2_SACCE). This is a pain to do by hand, so I wrote a little perl script to parse this information from the original BLAST report and modify the headers in the multi-FASTA file accordingly. However, note that renaming sequences is somewhat "cosmetic" and does not change the data or its interpretation.


The final 74 sequences

>MBP1_SACCE NP_010227 024..107
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDG
>MBP1_YARLI XP_500257 022..105
AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEVQKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIFNYDDEDG
>XP_955821 037..118
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIYERLKPIFEFQPGN
>XP_569090 036..117
AVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPT
>MBP1_ASPNI XP_660758 028..110
SVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIFDYVAGD
>MBP1_KLULA XP_454189 025..108
SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEVITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLFDFTQQEG
>MBP1_GIBZE XP_384396 045..129
AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLLTYDMGQDG
>MBP1_ASPTE XP_001213217 028..110
SVMRRRADDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIFDYVAGD
>MBP1_CANAL XP_723071 026..108
IMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIFEFQYIEG
>MBP1_CANGL XP_445458 024..107
SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEVLKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLFDFSEENG
>XP_501770 036..116
AVMRRRTDSSLNATQILKVAGVEKSKRTKILEKEILTGAHEKVQGGYGKYQGTWIPYERGVDLCRQYSVYDVLQPLLAFDP
>XP_362974 121..199
VMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEFS
>XP_761485 182..262
AVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAEYNVSHLLQPILEFDP
>MBP1_USTMA XP_762343 026..107
AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPITSYVPS
>XP_390560 040..120
VMRRRSDDWINATHILKAAGFDKPARTRILERDVQKDVHEKIQGGYGKYQGTWIPLESGQALAERHSVIDRLRPIFEYVQG
>XP_754232 001..081
MRRRGDDWINATHILKVAGFDKPARTRILEREVQKGTHEKVQGGYGKYQGTWIPLHEGRLLAERNNIIDKLRPIFDYVAGD
>MBP1_CRYNE XP_570545 133..214
SVMRRASDSWVNATQILKVAGVHKSARTKILEKEVLNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVFDFVPS
>MBP1_NEUCR XP_962967 071..155
AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLLTHNRGQEG
>MBP1_DEBHA XP_458784 027..109
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIFEFTYVEG
>XP_712876 006..088
SIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARRLAKTYGVTEELAPVLFLDFSD
>MBP1_MAGGR XP_365024 131..210
AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLLEYN
>XP_664319 119..198
AVMKRRSDGWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEYD
>MBP1_ASPFU XP_748947 105..184
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEYD
>MBP1_SCHPO NP_593032 027..110
SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSLDIDEG
>XP_001215548 007..086
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVDLCREYHVEELLRPLLEYD
>NP_595496 026..106
LMKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHEYNVFDLIQPLIEYSGS
>XP_457246 028..109
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKIQGGYGRFQGTWIPLADAQRLAASYGVTPDLAPVLYLDASD
>MBP1_EREGO NP_986147 031..114
SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEVIKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLFDFTRRDG
>NP_986370 043..124
VMRRLHDDWVNITQVFKVATFSKTQRTKILEKESADISHEKIQGGYGRFQGTWIPLDSAKGLVAKYEITDIVVLTVINFQPD
>SWI4_SACCE NP_011036 060..141
VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTFQFD
>XP_454890 119..200
IMRRCNDNWLNITQVFKAGSFTKAQRTKILEKEANEIKHEKIQGGYGRFQGTWIPWESTKYLVEKYNINNKVVKRIVEFIPD
>XP_444966 062..140
VMRRTMDDWVNVTQVFKIAQFSKTQRTKILEKESTNMKHEKVQGGYGRFQGTWVPLEAAKFMTTKYNIDNPVVNTILSF
>XP_459785 307..380
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>XP_663009 131..216
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFCWKIRFALTPLFGDNFPD
>SOK2_SACCE NP_013729 436..509
SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF
>XP_449680 143..216
TVVRRADNDMVNGTKLLNVTGMTRGRRDGILKNEPVRDVVKGGPMTLKGVWIPIDRARAIARQEGIEQWLYPLF
>NP_983001 352..425
SVVRRADNDMINGTKLLNVAKMTRGRRDGILKAEKVRHVVKIGSMHLKGVWIPFERALALAQREKIVDMLFPLF
>XP_714197 227..300
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
>XP_714237 228..301
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
>XP_001218256 139..211
VARREDNSMINGTKLLNVAGMTRGRRDGILKSEKIRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>XP_663440 152..224
VARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKITDLLYPLF
>XP_502292 285..357
VARREDNDMINGTKLLNVAGMTRGRRDGILKGEKLRHVVKAGAMHLKGVWIPYDRALEFANKEKIIDLLFPLF
>XP_501102 130..202
VARREDNNMINGTKLLNVVGMTRGRRDGILKTEKIRHVVKIGAMHLKGVWIPYERALAFAQRERIVDVLYPLF
>XP_755125 152..224
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>PHD1_SACCE NP_012881 208..281
SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF
>XP_448847 224..297
SVVRRADNDMINGTKLLNVTKMTRGKRDGILRSEKYRKVVKIGSMHLKGVWIPFERALFIAKREKIVDLLYPLF
>XP_505499 080..165
IIWDYHTGYVHLTGLWKAIGNSKADIVKLIDNSPDLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYFIRFALIPLFGQDFPGT
>XP_455299 386..459
SVVRRADNDMINGTKLLNVTRMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALVMAQREKIVDLLYALF
>XP_390305 226..298
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPYDRALDFANKEKITELLYPLF
>XP_960837 139..211
VARREDNAMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
>XP_368552 127..199
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKMRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
>XP_460447 213..285
VSRREDTNYVNGTKLLNVAGMTRGKRDGILKTEKTKSVVKVGAMNLKGVWIPFERASEIARNEGIDGLLYPLF
>XP_389978 139..218
AVMWDYNIGLVRMTPFFKCRGYGKTIPAKMLGLNPGLKEITHSITGGSIAAQGYWMPYRCAKAICATFCHPIAGALIPIF
>XP_711513 469..541
VSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGVDSLLYPLF
>NP_596132 088..165
LRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKRYGVYEILQPLISFN
>XP_751244 151..230
VMWDYNIGLVRTTHLFKCNDYSKMLNANPGLREICHSITGGALAAQGYWMPYEAAKAVAATFCWKIRHALTPLFGLDFPS
>XP_760925 057..143
TMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSRRIAWEIRDHLVPLFGYDFPST
>XP_001212599 130..218
IMWDYNIGLVRTTPLFRSQNYSKTTPAKVLDANPGLREISHSITGGAIVAQDKPGYWIPFEAAKAVAATFCWRIRYALTPIFGLDFPSQ
>XP_459773 187..274
IIWDYETGFVHLTGIWKASINDEVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYHIRFQLIPIF
>XP_710918 256..352
VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYYLRYSLIPIFGTDFPDS
>XP_459901 067..158
ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTNTQYLSSGGSNPQFNDLRNHEVRDLRGLWIPYDRAVSLALKFDIYELAKSLF
>XP_657766 089..163
LMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPETSEDEIAGNVWISPVLALELAAEYKMYDWVRALLD
>XP_385459 077..154
LMRRSYDGFVSATGMFKASFPYAEASDEDAERKYIKSLPTTSHEETAGNVWIPPEQALILAEEYKISPWIRALLDPTP
>XP_962267 085..162
LMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTTSSEETAGNVWIPPEQALILAEEYQITPWIRALLDPSD
>XP_753510 089..163
LMRRSKDGYVSATGMFKIAFPWAKLEEEKAEREYLKTREGTSEDEIAGNIWVSPLLALELAKEYQMYDWVRALLD
>XP_363762 084..161
LMRRSSDGYVSATGMFKATFPYADAEDEEAERNYIKSLPATSKEETAGNVWISPDQALALAEEYSIATWIRALLDPTD
>XP_723412 087..178
VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYLQGIWIPYDKAVNLALKFDIYEITKKLF
>NP_596166 062..140
LMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNPIEDKRVAGLWVPPADALALAKDYSMTPFINALLEASST
>XBP1_SACCE NP_012165 314..415
RDLICQSYKDFLINELGPDQIDLPNLNPANFTKRIRGGYIKIQGTWLPMEISRLLCLRFCFPIRYFLVPIFGPDFPKDCESWYLAHQNVTFASSTTGAGAAT
>XP_001216355 084..197
TYFLMDGYVSATGMFKIAFPWAKLDEERSEREYLKSREETSEDEIAGNVWISPKLALELAGEYQMYNWVRALLDPTDIVQSPSSAKKQITPPPRYDLPPIEAPTQLTATSTRS
>XP_369301 092..188
EEYTVMWDYGCGLVRMTHFFKCRGYTKTVPGKVLNQNHGLKDITYSITGGSISAQESPNFGRMVIDRELVAHATREAESMYGRSMQAQAQQQGPLR
>XP_455262 289..388
YGKLDKPSKKDSQQKWNKWFQRESFSTYIDLHWHKLNPTLSTLLGQSYDAKIPFERMVKRIRGGYIKIQGTWLPYPVSKELCSRFCYPLRYLLVPLFGPDFPEKCEYWY
>NP_983869 277..365
YTDVHWNQVDPTWKQRLCRLYQQEKNLDFTPEFQDCYKRIRGGYIKIQGTWLPMEICKRLCIRFCFPIRYFLVPIFGEGFLQECHNWYF
>XP_446482 295..390
STSNSSVNYLDFHWFDISEKVRSQIFEQFKQHLEKDRNVDCSTIPKAEEYIQRIRGGYIKIQGTWVPWYIAKLICIRFCFPIRYLLVPIFGEQFPV