Difference between revisions of "Reference APSES domains (reference species)"

Revision as of 07:00, 23 November 2011

Multi FASTA file of APSES domains in six fungal reference species.

This page collects APSES domain sequences from six fungal species that are used as reference species for the course. The species are:

Aspergillus nidulans (ASPNI)
Candida albicans (CANAL)
Neurospora crassa (NEUCR)
Saccharomyces cerevisiae (SACCE)
Schizosaccharomyces pombe (SCHPO)
Ustilago maydis (USTMA)

Executing the PSI-BLAST search

Defining the APSES Domain sequence

Navigate to the NCBI BLAST page, accessed protein BLAST;
Follow the link to protein BLAST and enter the yeast Mbp1 refseq ID NP_010227 into the input form;
Select the PHI-BLAST algorithm to search for domains in the sequence and Run BLAST;
Click on the graphical summary of the result to access the CDD conserved domains report for the sequence;
Click on the (+) sign next to the link to KilA-N(pfam 04383) domain to display the query/profile alignment. This is what it looks like:

                          10        20        30        40        50        60        70        80
                  ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....|
gi 6320147     19 IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQ---------------GGFGKYQGTWVPLNIA 83
Cdd:pfam04383   3 YNDFEIIIRRDKDGYINATKLCKAAGAKGKRFRNWLRLESTKELIEELSkennpdkliiienrkGKGGRLQGTYVHPDLA 82


                          90
                  ....*....|....
gi 6320147     84 KQLA----EKFSVY 93
Cdd:pfam04383  83 LAIAswisPEFALK 96

This gives us the following APSES domain sequence:

>Yeast Mbp1 APSES domain (AA 19..93 of NP_010227)
IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG 
GFGKYQGTWVPLNIAKQLAEKFSVY

Searching for APSES domains

A PSI-BLAST search was executed, searching in the refseq' subset of the NCBI protein database and restricting the species to the six fungal reference species plu Escherichi coli. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analyis.

The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10^-4 were also removed. The final result included 39 sequences. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on Get selected sequences created a results page of 29 sequences. These were then displayed in a FASTA(text) format.

Constructing the multi-FASTA file

A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence.

The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the result page: they are presented from alink at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:

Pairwise: the default
Pairwise with identities: showing only differences to the query sequence
query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
hit-table: this gives only the numerical parameters describing the quality of the matches.

When we select the flat-query anchored with/without identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.

Renaming sequences

To make the interpretation of alignments and gene trees easier, the Mbp1 orthologues for all species were labeled Mbp1_???? (e.g. Mbp1_ASPFU). All yeast sequences were labelled with their gene name (e.g. Sok2_SACCE). All other sequences were named according to the yeast gene they share the most identities with, where the last digit was replaced with A, B, C - as required. (e.g. SokA_ASHGO). Note that such relabeling sequences does not change the data or its interpretation, it is just helpful. Finally the squences were sorted to have the Mbp1 orthologues first in the list, then all other sequences sorted by organism.

The final 69 sequences

>Mbp1_SACCE (79  ids)  NP_010227    (024..102)
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
>Mbp1_ASHGO (66  ids)  NP_986147    (031..109)
SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEVIKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLFDF
>Mbp1_ASPFU (49  ids)  XP_754232    (001..077)
MRRRGDDWINATHILKVAGFDKPARTRILEREVQKGTHEKVQGGYGKYQGTWIPLHEGRLLAERNNIIDKLRPIFDY
>Mbp1_ASPNI (50  ids)  XP_660758    (028..106)
SVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIFDY
>Mbp1_ASPTE (49  ids)  XP_001213217 (028..106)
SVMRRRADDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIFDY
>Mbp1_CANAL (53  ids)  XP_723071    (026..103)
IMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIFEF
>Mbp1_CANGL (71  ids)  XP_445458    (024..102)
SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEVLKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLFDF
>Mbp1_COPCI (43  ids)  EAU84310     (025..103)
AVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEF
>Mbp1_CRYNE (47  ids)  XP_570545    (133..211)
SVMRRASDSWVNATQILKVAGVHKSARTKILEKEVLNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVFDF
>Mbp1_DEBHA (50  ids)  XP_458784    (027..104)
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIFEF
>Mbp1_GIBZE (48  ids)  XP_390560    (040..117)
VMRRRSDDWINATHILKAAGFDKPARTRILERDVQKDVHEKIQGGYGKYQGTWIPLESGQALAERHSVIDRLRPIFEY
>Mbp1_KLULA (64  ids)  XP_454189    (025..103)
SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEVITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLFDF
>Mbp1_MAGGR (48  ids)  XP_362974    (040..117)
VMRRRVDDWINATHILKAAGFDKPARTRILEREVQKDQHEKVQGGYGKYQGTWIPLEAGEALAHRNNIFDRLRPIFEF
>Mbp1_NEUCR (50  ids)  XP_955821    (037..114)
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIYERLKPIFEF
>Mbp1_PICST (52  ids)  XP_001386821 (026..103)
IMRRKLDSWINATHILKIAKFPKAKRTRILEKDVQTGVHEKVQGGYGKYQGTYVPLELGRDIAKNFGVFDILKPIFDF
>Mbp1_SCHPO (43  ids)  NP_595496    (027..103)
MKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHEYNVFDLIQPLIEY
>Mbp1_USTMA (41  ids)  XP_762343    (026..104)
AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPITSY
>Mbp1_YARLI (49  ids)  XP_500257    (022..100)
AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEVQKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIFNY
>Swi4_ASHGO (58  ids)  NP_986370    (043..115)
VMRRLHDDWVNITQVFKVATFSKTQRTKILEKESADISHEKIQGGYGRFQGTWIPLDSAKGLVAKYEITDIVV
>Sok2_ASHGO (67  ids)  NP_983001    (352..425)
SVVRRADNDMINGTKLLNVAKMTRGRRDGILKAEKVRHVVKIGSMHLKGVWIPFERALALAQREKIVDMLFPLF
>MbpB_ASPFU (22  ids)  XP_751244    (151..225)
VMWDYNIGLVRTTHLFKCNDYSKMLNANPGLREICHSITGGALAAQGYWMPYEAAKAVAATFCWKIRHALTPLFG
>MbpA_ASPFU (40  ids)  XP_748947    (105..183)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
>Sok2_ASPFU (58  ids)  XP_755125    (152..224)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpB_ASPNI (19  ids)  XP_001392970 (124..203)
ISWDYNVGLVLTRSLFKCNGHPKTAPAKVLKMNPGLGDISHSITGGALVGQGYWMPFRAAKALATTFCWNIRFVLTPMFG
>SokB_ASPNI (21  ids)  XP_663009    (131..211)
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFCWKIRFALTPLFG
>MbpA_ASPNI (40  ids)  XP_001391313 (118..196)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLLEY
>SokA_ASPNI (56  ids)  XP_663440    (152..224)
VARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKITDLLYPLF
>Sok2_ASPNI (58  ids)  XP_001390623 (153..225)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpB_ASPTE (21  ids)  XP_001212599 (130..212)
IMWDYNIGLVRTTPLFRSQNYSKTTPAKVLDANPGLREISHSITGGAIVAQDKPGYWIPFEAAKAVAATFCWRIRYALTPIFG
>MbpA_ASPTE (40  ids)  XP_001215548 (007..085)
AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVDLCREYHVEELLRPLLEY
>Sok2_ASPTE (59  ids)  XP_001218256 (139..211)
VARREDNSMINGTKLLNVAGMTRGRRDGILKSEKIRHVVKIGPMHLKGVWIPFERALEFANKEKITDLLYPLF
>MbpC_CANAL (22  ids)  XP_723412    (087..178)
VLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYLQGIWIPYDKAVNLALKFDIYEITKKLF
>MbpB_CANAL (25  ids)  XP_710918    (256..346)
VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTWLPYKLCKILARRFCYYLRYSLIPIFG
>MbpA_CANAL (48  ids)  XP_712970    (006..082)
SIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAKTYGVTEELAPVL
>Sok2_CANAL (49  ids)  XP_711513    (469..541)
VSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGVDSLLYPLF
>Phd1_CANAL (65  ids)  XP_714237    (228..301)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQIVDMLYPLF
>SokA_CANGL (56  ids)  XP_449680    (143..216)
TVVRRADNDMVNGTKLLNVTGMTRGRRDGILKNEPVRDVVKGGPMTLKGVWIPIDRARAIARQEGIEQWLYPLF
>Swi4_CANGL (61  ids)  XP_444966    (062..140)
VMRRTMDDWVNVTQVFKIAQFSKTQRTKILEKESTNMKHEKVQGGYGRFQGTWVPLEAAKFMTTKYNIDNPVVNTILSF
>Sok2_CANGL (64  ids)  XP_448847    (224..297)
SVVRRADNDMINGTKLLNVTKMTRGKRDGILRSEKYRKVVKIGSMHLKGVWIPFERALFIAKREKIVDLLYPLF
>MbpA_COPCI (26  ids)  EAU85126     (059..139)
IMMDIDDGYILWTGIWKALGNSKADIVKMIDSQPDLAPLIRRVRGGYLKIQGTWMPYEVALKLSRRVAWPIRHDLVPLFGF
>MbpA_CRYNE (42  ids)  XP_569090    (036..114)
AVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHEKVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDY
>MbpB_DEBHA (26  ids)  XP_459773    (187..275)
IIWDYETGFVHLTGIWKASINDEVNTHRNLKADIVKLLESTPKQYHQHIKRIRGGFLKIQGTWLPFDLCKMLAKRFCYHIRFQLIPIFG
>Swi4_DEBHA (26  ids)  XP_459901    (067..158)
ILRRVQDSYINISQLFSILLKIGHLSEAQLTNFLNNEILTNTQYLSSGGSNPQFNDLRNHEVRDLRGLWIPYDRAVSLALKFDIYELAKSLF
>MbpA_DEBHA (45  ids)  XP_457246    (028..103)
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKIQGGYGRFQGTWIPLADAQRLAASYGVTPDLAPVL
>SokA_DEBHA (50  ids)  XP_460447    (213..285)
VSRREDTNYVNGTKLLNVAGMTRGKRDGILKTEKTKSVVKVGAMNLKGVWIPFERASEIARNEGIDGLLYPLF
>Sok2_DEBHA (64  ids)  XP_459785    (307..380)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>MbpB_GIBZE (21  ids)  XP_389978    (139..219)
AVMWDYNIGLVRMTPFFKCRGYGKTIPAKMLGLNPGLKEITHSITGGSIAAQGYWMPYRCAKAICATFCHPIAGALIPIFG
>MbpA_GIBZE (39  ids)  XP_384396    (045..123)
AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLLTY
>Sok2_GIBZE (55  ids)  XP_390305    (226..298)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPYDRALDFANKEKITELLYPLF
>Swi4_KLULA (50  ids)  XP_454890    (119..197)
IMRRCNDNWLNITQVFKAGSFTKAQRTKILEKEANEIKHEKIQGGYGRFQGTWIPWESTKYLVEKYNINNKVVKRIVEF
>Sok2_KLULA (67  ids)  XP_455299    (386..459)
SVVRRADNDMINGTKLLNVTRMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALVMAQREKIVDLLYALF
>MbpB_MAGGR (20  ids)  XP_369301    (096..176)
TVMWDYGCGLVRMTHFFKCRGYTKTVPGKVLNQNHGLKDITYSITGGSISAQGYWMPFACARAVCATFCHPIAGALIPIFG
>MbpA_MAGGR (39  ids)  XP_365024    (131..209)
AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEIQTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLLEY
>Sok2_MAGGR (57  ids)  XP_368552    (133..205)
VARREDNHMINGTKLLNVAGMTRGRRDGILKSEKMRHVVKIGPMHLKGVWIPFERALDFANKEKITELLYPLF
>MbpA_NEUCR (40  ids)  XP_962967    (071..147)
AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
>MbpA_PICST (46  ids)  XP_001383745 (006..081)
IMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLPDAQRLATMYGVTADAAPVL
>SokA_PICST (49  ids)  XP_001385235 (239..311)
VSRREDTNFVNGTKLLNVIGMTRGKRDGILKTEKTRNVVKVGSMNLKGVWIPFDRAFEIARNEGVDEALHPLF
>Sok2_PICST (64  ids)  XP_001383609 (194..267)
SVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREGIVDLLYPLF
>Sok2_SACCE (74  ids)  EDN64408     (435..508)
SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF
>Phd1_SACCE (74  ids)  NP_012881    (208..281)
SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF
>Swi4_SACCE (79  ids)  EDN63086     (060..138)
VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTF
>MbpB_SCHPO (21  ids)  NP_596132    (088..164)
LRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKRYGVYEILQPLISF
>MbpA_SCHPO (41  ids)  NP_593032    (027..104)
SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILS
>MbpA_USTMA (24  ids)  XP_760925    (057..138)
TMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSRRIAWEIRDHLVPLFGY
>Swi4_USTMA (42  ids)  XP_761485    (182..260)
AVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAEYNVSHLLQPILEF
>MbpB_YARLI (26  ids)  XP_505499    (080..159)
IIWDYHTGYVHLTGLWKAIGNSKADIVKLIDNSPDLEAVIRRVRGGYLKIQGTWVPYDIARALASRTCYFIRFALIPLFG
>MbpA_YARLI (44  ids)  XP_501770    (036..114)
AVMRRRTDSSLNATQILKVAGVEKSKRTKILEKEILTGAHEKVQGGYGKYQGTWIPYERGVDLCRQYSVYDVLQPLLAF
>SokA_YARLI (55  ids)  CAB45654     (144..216)
VARREDNDMINGTKLLNVAGMTRGRRDGILKGEKLRHVVKAGAMHLKGVWIPYDRALEFANKEKIIDLLFPLF
>Sok2_YARLI (60  ids)  XP_501102    (130..202)
VARREDNNMINGTKLLNVVGMTRGRRDGILKTEKIRHVVKIGAMHLKGVWIPYERALAFAQRERIVDVLYPLF

@@ Line 1: / Line 1: @@
 __NOTOC__
-;Multi FASTA file of all APSES domains in fungal proteins.
+;Multi FASTA file of APSES domains in six fungal reference species.
+This page collects APSES domain sequences from six fungal species that are used as reference species for the course. The species are:
+* Aspergillus nidulans (ASPNI)
+* Candida albicans (CANAL)
+* Neurospora crassa (NEUCR)
+* Saccharomyces cerevisiae (SACCE)
+* Schizosaccharomyces pombe (SCHPO)
+* Ustilago maydis (USTMA)
 ====Executing the PSI-BLAST search====
+=====Defining the APSES Domain sequence=====
+#Navigate to the [http://www.ncbi.nlm.nih.gov/blast NCBI BLAST page], accessed '''protein BLAST''';
+#Follow the link to '''protein BLAST''' and enter the yeast Mbp1 refseq ID NP_010227 into the input form;
+#Select the '''PHI-BLAST''' algorithm to search for domains in the sequence and '''Run BLAST''';
+#Click on the graphical summary of the result to access the '''CDD conserved domains''' report for the sequence;
+#Click on the (+) sign next to the link to [http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?ascbin=8&maxaln=10&seltype=2&uid=190963 KilA-N(pfam 04383)] domain to display the query/profile alignment. This is what it looks like:
+<table>
+<tr><td>
+ <font color=#700777>                          10        20        30        40        50        60        70        80</font>
+ <font color=#700777>                  ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....|</font>
+ [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&doptcmdl=GenPept&db=Protein&term=6320147 gi 6320147]    <font color=#229922> 19 </font><font color=#2233CC>IHSTGS</font><font color=#FF4466>I</font><font color=#2233CC>MK</font><font color=#FF4466>R</font><font color=#2233CC>K</font><font color=#FF4466>KD</font><font color=#2233CC>DWV</font><font color=#FF4466>NAT</font><font color=#2233CC>HIL</font><font color=#FF4466>KAA</font><font color=#2233CC>NFAKAKRTRI</font><font color=#FF4466>L</font><font color=#2233CC>EK</font><font color=#FF4466>E</font><font color=#2233CC>VL</font><font color=#FF4466>KE</font><font color=#2233CC>TH</font><font color=#FF4466>E</font><font color=#2233CC>KVQ</font><font color=#888888>---------------</font><font color=#FF4466>G</font><font color=#2233CC>GF</font><font color=#FF4466>G</font><font color=#2233CC>KY</font><font color=#FF4466>QGT</font><font color=#2233CC>W</font><font color=#FF4466>V</font><font color=#2233CC>PLNI</font><font color=#FF4466>A</font> <font color=#229922>83</font>
+ [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&doptcmdl=GenPept&db=cdd&term=pfam04383 Cdd:pfam04383] <font color=#229922>  3 </font><font color=#2233CC>YNDFEI</font><font color=#FF4466>I</font><font color=#2233CC>IR</font><font color=#FF4466>R</font><font color=#2233CC>D</font><font color=#FF4466>KD</font><font color=#2233CC>GYI</font><font color=#FF4466>NAT</font><font color=#2233CC>KLC</font><font color=#FF4466>KAA</font><font color=#2233CC>GAKGKRFRNW</font><font color=#FF4466>L</font><font color=#2233CC>RL</font><font color=#FF4466>E</font><font color=#2233CC>ST</font><font color=#FF4466>KE</font><font color=#2233CC>LI</font><font color=#FF4466>E</font><font color=#2233CC>ELS</font><font color=#888888>kennpdkliiienrk</font><font color=#FF4466>G</font><font color=#2233CC>KG</font><font color=#FF4466>G</font><font color=#2233CC>RL</font><font color=#FF4466>QGT</font><font color=#2233CC>Y</font><font color=#FF4466>V</font><font color=#2233CC>HPDL</font><font color=#FF4466>A</font> <font color=#229922>82</font>
+ <font color=#700777>                          90</font>
+ <font color=#700777>                  ....*....|....</font>
+ [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&doptcmdl=GenPept&db=Protein&term=6320147 gi 6320147]    <font color=#229922> 84 </font><font color=#2233CC>KQL</font><font color=#FF4466>A</font><font color=#888888>----</font><font color=#2233CC>EK</font><font color=#FF4466>F</font><font color=#2233CC>SVY</font> <font color=#229922>93</font>
+ [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&doptcmdl=GenPept&db=cdd&term=pfam04383 Cdd:pfam04383] <font color=#229922> 83 </font><font color=#2233CC>LAI</font><font color=#FF4466>A</font><font color=#888888>swis</font><font color=#2233CC>PE</font><font color=#FF4466>F</font><font color=#2233CC>ALK</font> <font color=#229922>96</font>
+</td></tr>
+</table>
+This gives us the following APSES domain sequence:
+ >Yeast Mbp1 APSES domain (AA 19..93 of NP_010227)
+ IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG
+ GFGKYQGTWVPLNIAKQLAEKFSVY
-The starting point of this list is a BLAST search with '''one''' known APSES domain sequence. This query sequence - the Mbp1 APSES domain - was defined as follows, based on [http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=66020 Pfam profile 02292: APSES].
- >Yeast Mbp1 APSES domain (AA 24..102 of NP_010227)
+=====Searching for APSES domains=====
- SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
- QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
-A PSI-BLAST search was executed, searching in the '''nr''' subset of GenPept without further restrictions (Oct. 2007). The default parameters for PSI-BLAST were used, except for using the BLOSUM45 matrix and reducing the Evalue to 1.0 from 10.0.
+A PSI-BLAST search was executed, searching in the '''refseq''' subset of the NCBI protein database and restricting the species to the six fungal reference species plu ''Escherichi coli'''. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analyis.
-The search converged after 6 iterations, i.e. PSI-BLAST had found no additional new hits above the inclusion threshold E-value of 0.005. 164 sequences were found and contributed to the profile. However, some of these sequences are redundant, i.e. they are matches to the same amino acid sequence in different database entries, and some of these sequences are from organisnms other than the ones we are considering in the assignment. Even if these latter sequences  are removed, it was appropriate to keep them included initially: they contribute to the information in the PSI-BLAST search profile and improve the sensitivity and specificity of the search.
+The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10<sup>-4</sup> were also removed. The final result included 39 sequences. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on '''Get selected sequences''' created a results page of 29 sequences. These were then displayed in a FASTA(text) format.
-It would certainly not be impossible - albeit somewhat tedious - to manually edit the list of proteins by checking/unchecking which hits to include. I have written a short Perl script to automate this task and to rename the sequences at the same time. Renaming is not required and does not add information; RefSeq / GenPept accession numbers will do just fine to name the sequences uniquely. However the final analysis of sequence alignment or phylogeny results is much easier to do if the sequence labels actually tell us something about the organisms they came from and which other sequence they might be similar to.
-After removing redundant sequences, sequence fragments that did not span the entire Mbp1 APSES domain, and sequences from fungi that are not in the list of organisms for this course, 69 sequences remained for analysis.
+<!-- CONTINUE HERE -->
-<!--TODO:  In the next version of assignment, spend some time to carefully follow up on Xbp1 hits; I've left them out for now since a) they don't find APSES with RPS-BLAST at CDD, and B) this simplifies the phylogenetics... -->
 ====Constructing the multi-FASTA file====

Difference between revisions of "Reference APSES domains (reference species)"

Revision as of 07:00, 23 November 2011

Executing the PSI-BLAST search

Defining the APSES Domain sequence

Searching for APSES domains

Constructing the multi-FASTA file

Renaming sequences

The final 69 sequences

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools