Difference between revisions of "Reference APSES domains (reference species)"

From "A B C"
Jump to navigation Jump to search
m
m
 
(7 intermediate revisions by the same user not shown)
Line 5: Line 5:
  
 
__NOTOC__
 
__NOTOC__
 +
 +
 +
<div class="alert">
 +
The species used on this page are not the current set of [[Reference_species_for_fungi|reference species]]. Proceed with caution.
 +
</div>
  
 
<section begin=contents_summary />
 
<section begin=contents_summary />
Line 16: Line 21:
  
  
* see also: [[Mbp1_protein_reference_annotation|reference annotation of Mbp1 proteins]]
+
* see also: [[Reference APSES proteins (reference species)]]
  
  
Line 56: Line 61:
 
A PSI-BLAST search was executed, searching in the '''refseq''' subset of the NCBI protein database and restricting the species to the six fungal reference species plus ''Escherichia coli''. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analysis.  
 
A PSI-BLAST search was executed, searching in the '''refseq''' subset of the NCBI protein database and restricting the species to the six fungal reference species plus ''Escherichia coli''. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analysis.  
  
The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10<sup>-4</sup> were also removed. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on '''Get selected sequences''' created a results page of 27 sequences. These were then displayed in a FASTA(text) format and their headers were slightly edited to create a dataset of [[Reference APSES full length proteins]].
+
The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10<sup>-4</sup> were also removed. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on '''Get selected sequences''' created a results page of 27 sequences. These were then displayed in a FASTA(text) format and their headers were slightly edited to create the dataset [[Reference APSES proteins (reference species)]].
  
 
===Constructing the multi-FASTA file===
 
===Constructing the multi-FASTA file===
Line 194: Line 199:
 
Here  is a sample set of the APSES domain sequences to illustrate the '''phylip''' format. Sequences were aligned with MAFFT and edited in JALVIEW to remove gapped regions and frayed termini. The FASTA sequences were converted with [http://www-bimas.cit.nih.gov/molbio/readseq/ the Readseq server].  
 
Here  is a sample set of the APSES domain sequences to illustrate the '''phylip''' format. Sequences were aligned with MAFFT and edited in JALVIEW to remove gapped regions and frayed termini. The FASTA sequences were converted with [http://www-bimas.cit.nih.gov/molbio/readseq/ the Readseq server].  
  
<source lang="text">
+
<pre>
 
  27 78
 
  27 78
 
KILA_ESCCO  DGEIIHLRAK DGYINATSMC RT-A-GKLLS DYTRLKLSRD M-GIPIS-IQ
 
KILA_ESCCO  DGEIIHLRAK DGYINATSMC RT-A-GKLLS DYTRLKLSRD M-GIPIS-IQ
Line 251: Line 256:
 
             --TSSEETAG NVWIPPEQAL ILAEEYQI
 
             --TSSEETAG NVWIPPEQAL ILAEEYQI
  
</source>
+
</pre>
  
  
Line 268: Line 273:
 
* Paste the following organism restrictions into the '''Entrez query''' field. This includes all fungi we have worked with in the course, as well as ''Escherichia coli'' (for the KilA-N domain):
 
* Paste the following organism restrictions into the '''Entrez query''' field. This includes all fungi we have worked with in the course, as well as ''Escherichia coli'' (for the KilA-N domain):
  
<source lang="text">
+
<pre>
 
Ajellomyces dermatitidis [ORGN]
 
Ajellomyces dermatitidis [ORGN]
 
OR Arthroderma benhamiae [ORGN]
 
OR Arthroderma benhamiae [ORGN]
Line 331: Line 336:
 
OR Zymoseptoria tritici [ORGN]
 
OR Zymoseptoria tritici [ORGN]
 
OR Escherichia coli [ORGN]
 
OR Escherichia coli [ORGN]
</source>
+
</pre>
 
* Select '''PSI-BLAST''' as the algorithm.
 
* Select '''PSI-BLAST''' as the algorithm.
 
* '''BLAST''' this.
 
* '''BLAST''' this.
Line 381: Line 386:
 
====Processing the PSI-BLAST results====
 
====Processing the PSI-BLAST results====
 
* We need to collapse the separate aligned sections, remove the profusion of gap characters, and replace the semantically meaningless GI numbers with something that we can use for interpreting alignments and trees. I could do this by hand for the ~300 sequences in about 2 hours. I chose to write some Perl code instead. It works on the copied alignments, the headers, and the RBM annotations.
 
* We need to collapse the separate aligned sections, remove the profusion of gap characters, and replace the semantically meaningless GI numbers with something that we can use for interpreting alignments and trees. I could do this by hand for the ~300 sequences in about 2 hours. I chose to write some Perl code instead. It works on the copied alignments, the headers, and the RBM annotations.
<source lang=Perl>
+
<pre>
 
#!/usr/bin/perl
 
#!/usr/bin/perl
 
# ProcessPSI-BLAST.pl
 
# ProcessPSI-BLAST.pl
Line 491: Line 496:
 
exit();
 
exit();
  
</source>
+
</pre>
  
 
====Alignment====
 
====Alignment====
 
* The alignment was done at the EBI using MAFFT and written using FASTA output format.
 
* The alignment was done at the EBI using MAFFT and written using FASTA output format.
<source lang="text">
+
<pre>
 
>Mbp1_USTMA XP_762343
 
>Mbp1_USTMA XP_762343
 
--------IIN-NVA-VMRRRSDDWLN---------------------------------
 
--------IIN-NVA-VMRRRSDDWLN---------------------------------
Line 1,916: Line 1,921:
 
ELIQ-----------SFKGG----------------RP---ENQ-------GTW------
 
ELIQ-----------SFKGG----------------RP---ENQ-------GTW------
 
-------------VHPDIAINLAQ-----
 
-------------VHPDIAINLAQ-----
</source>
+
</pre>
  
 
-->
 
-->

Latest revision as of 06:51, 26 September 2020

Reference APSES domains



The species used on this page are not the current set of reference species. Proceed with caution.


Sequences of APSES domains in the fungal reference species - domain definition, PSI-BLAST search, and header editing.


The APSES domain proteins were determined with a PSI-BLAST search in the refseq database, using 1BM8_A as the search sequence, and restricting the search to the Reference species for fungi.




Executing the PSI-BLAST search

Defining the APSES Domain sequence

The APSES domain "proper"
  1. Navigate to the NCBI BLAST page, accessed protein BLAST;
  2. Follow the link to protein BLAST and enter the yeast Mbp1 refseq ID NP_010227 into the input form;
  3. Select the PHI-BLAST algorithm to search for domains in the sequence and Run BLAST;
  4. Click on the graphical summary of the result to access the CDD conserved domains report for the sequence;
  5. Click on the (+) sign next to the link to KilA-N(pfam 04383) domain to display the query/profile alignment. This is what it looks like:
                          10        20        30        40        50        60        70        80
                  ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....|
gi 6320147     19 IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQ---------------GGFGKYQGTWVPLNIA 83
Cdd:pfam04383   3 YNDFEIIIRRDKDGYINATKLCKAAGAKGKRFRNWLRLESTKELIEELSkennpdkliiienrkGKGGRLQGTYVHPDLA 82


                          90
                  ....*....|....
gi 6320147     84 KQLA----EKFSVY 93
Cdd:pfam04383  83 LAIAswisPEFALK 96

This gives us the following APSES domain sequence:

>Yeast Mbp1 APSES domain (AA 19..93 of NP_010227)
IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG 
GFGKYQGTWVPLNIAKQLAEKFSVY

Searching for APSES domains

A PSI-BLAST search was executed, searching in the refseq subset of the NCBI protein database and restricting the species to the six fungal reference species plus Escherichia coli. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analysis.

The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10-4 were also removed. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on Get selected sequences created a results page of 27 sequences. These were then displayed in a FASTA(text) format and their headers were slightly edited to create the dataset Reference APSES proteins (reference species).

Constructing the multi-FASTA file

A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence. To generate the multi-FASTA file of APSES domains, we could have simply edited the full length proteins manually. But there is a simpler way to achieve this. The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the BLAST result page: they are presented from a link at the top of the BLAST results page: "Formatting options": the principal options for the format are:

  • Pairwise: the default
  • Pairwise with identities: showing only differences to the query sequence
  • query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
  • flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
  • hit-table: this gives only the numerical parameters describing the quality of the matches.

When we select the Flat-query anchored with letters for identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.

Renaming sequences

To make the interpretation of alignments and gene trees easier, all Saccharomyces cerevisiaea sequences were labelled with their gene name (e.g. Sok2_SACCE). Sequences that are presumed to be functionally equivalent orthologues to Mbp1 were identified through the Reciprocal Best Match (RBM) criterion and labeled as Mbp1_NNNNN. All other sequences were named APS1_, APS2_, APS3_ ... - as required. (e.g. APS1_USTMA). There is no further significance in the numbers, i.e. APS1_USTMA is not necessarily an RBM to APS1_SCHPO. Note that such relabeling of sequences does not change the data or its interpretation, it is just helpful to interpret the tree.

The final 27 APSES domain reference sequences

>KILA_ESCCO ZP_07189117 KilA-N domain protein
IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSFKGGRPENQGTW
VHPDIAINLAQ

>MBP1_SACCE NP_010227 Mbp1
IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAE
KFSVY

>MBP1_USTMA XP_762343 UM06196
IINNVAVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAE
RYNI

>MBP1_NEUCR XP_955821 NCU07246
VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIY

>MBP1_ASPNI XP_660758.1  AN3154
IGTDSVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAER
NNI

>MBP1_SCHPO NP_593032 MBF transcription factor complex subunit Res2
IKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATK
YKV

>MBP1_CANAL XP_723071 potential DNA binding component of MBF
VTSEGPIMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIAR
NFGVY

>APS1_NEUCR XP_962967 NCU07587
VNNVAVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQ
YGV

>APS1_CANAL XP_712970 potential DNA binding component of SBF
MMNESSIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAK
TYGV

>APS1_SCHPO NP_595496 MBF transcription factor complex subunit Res1
INGFPLMKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHE
YNVF

>APS2_ASPNI XP_664319 hypothetical protein AN6715
VNGVAVMKRRSDGWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCRE
YHV

>APS2_USTMA XP_761485 UM05338
VRGIAVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAE
YNV

>SWI4_SACCE NP_011036 Swi4p
TKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYE
I

>APS3_SCHPO NP_596132 MBF transcription factor complex subunit Cdc10
GDNVALRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKR
YGVY

>APS3_CANAL XP_714237 potential DNA binding regulator of filamentous growth
NNVSVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQI

>SOK2_SACCE NP_013729 Sok2p
NGISVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKI

>APS3_ASPNI XP_663440 STUA CELL PATTERN FORMATION-ASSOCIATED PROTEIN
GVCVARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKI

>PHD1_SACCE NP_012881 Phd1p
NGISVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQI

>APS4_CANAL XP_710918 CaO19.5210
LNNHWVIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTW
LPYKLCKILARRFCYY

>APS3_NEUCR XP_960837 NCU01414
GICVARREDNAMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALDFANKEKI

>APS5_CANAL XP_711513 potential DNA binding protein
NILVSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGV

>APS4_ASPNI XP_663009 AN5405
TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFC

>APS3_USTMA XP_760925 UM04778
VRGHTMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSR
R

>APS4_SCHPO NP_596166
HFLMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNPIEDKRVAGLWVPPADALALAKDYSM

>APS6_CANAL XP_723412 potential transcriptional co-activator
HGEIIVLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYL
QGIWIPYDKAVNLALKFDIY

>APS4_NEUCR XP_962267 NCU06560
FLMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTTSSEETAGNVWIPPEQALILAEEYQI

>APS5_ASPNI XP_657766 AN0162
TYFLMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPETSEDEIAGNVWISPVLALELAAEYKMY


Mbp1 orthologue reference alignment

This is a reference alignment of the APSES domains of those proteins that fulfilled the Reciprocal Best Match criterion with yeast Mbp1.

CLUSTAL format alignment by MAFFT L-INS-1 (v6.850b)


MBP1_SACCE      IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVY
MBP1_CANAL      VTSEGPIMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVY
MBP1_USTMA      IINNVAVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNI-
MBP1_NEUCR      ------VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIY
MBP1_ASPNI      -IGTDSVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNI-
MBP1_SCHPO      -IKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKV-

Sample Phylip format

Here is a sample set of the APSES domain sequences to illustrate the phylip format. Sequences were aligned with MAFFT and edited in JALVIEW to remove gapped regions and frayed termini. The FASTA sequences were converted with the Readseq server.

 27 78
KILA_ESCCO   DGEIIHLRAK DGYINATSMC RT-A-GKLLS DYTRLKLSRD M-GIPIS-IQ
MBP1_SACCE   STGSIMKRKK DDWVNATHIL KA-A-NFAKA KRTRI-LEKE V-LKETH--E
MBP1_USTMA   NNVAVMRRRS DDWLNATQIL KV-V-GLDKP QRTRV-LERE I-QKGIH--E
MBP1_NEUCR   ----VMRRRH DDWVNATHIL KA-A-GFDKP ARTRI-LERE V-QKDTH--E
MBP1_ASPNI   GTDSVMRRRS DDWINATHIL KV-A-GFDKP ARTRI-LERE V-QKGVH--E
MBP1_SCHPO   KGVSVMRRRR DSWLNATQIL KV-A-DFDKP QRTRV-LERQ V-QIGAH--E
MBP1_CANAL   SEGPIMRRKK DSWINATHIL KI-A-KFPKA KRTRI-LEKD V-QTGIH--E
APS1_NEUCR   NNVAVMRRQK DGWVNATQIL KV-A-NIDKG RRTKI-LEKE I-QIGEH--E
APS1_CANAL   NESSIMRRCK DDWVNATQIL KC-C-NFPKA KRTKI-LEKG V-QQGLH--E
APS1_SCHPO   NGFPLMKRCH DNWLNATQIL KI-A-ELDKP RRTRI-LEKF A-QKGLH--E
APS2_ASPNI   NGVAVMKRRS DGWLNATQIL KV-A-GVVKA RRTKT-LEKE I-AAGEH--E
APS2_USTMA   RGIAVMRRRG DGWLNATQIL KI-A-GIEKT RRTKI-LEKS I-LTGEH--E
SWI4_SACCE   -TKIVMRRTK DDWINITQVF KI-A-QFSKT KRTKI-LEKE S-NDMQH--E
APS3_SCHPO   GDNVALRRCP DSYFNISQIL RL-A-GTSSS ENAKE-LDDI I-ESGDY--E
APS3_CANAL   NNVSVVRRAD NNMINGTKLL NV-A-QMTRG RRDGI-LKSE ----KVR--H
SOK2_SACCE   NGISVVRRAD NDMVNGTKLL NV-T-KMTRG RRDGI-LKAE ----KIR--H
APS3_ASPNI   -GVCVARRED NGMINGTKLL NV-A-GMTRG RRDGI-LKSE ----KVR--N
PHD1_SACCE   NGISVVRRAD NNMINGTKLL NV-T-KMTRG RRDGI-LRSE ----KVR--E
APS4_CANAL   NNHWVIWDYE TGWVHLTGIW KA-SLSHLKA DIVKL-LEST PKEYQQY-IK
APS3_NEUCR   -GICVARRED NAMINGTKLL NV-A-GMTRG RRDGI-LKSE ----KVR--H
APS5_CANAL   -NILVSRRED TNYINGTKLL NV-I-GMTRG KRDGI-LKTE ----KIK--N
APS4_ASPNI   ---TVMWDYN IGLVRTTHLF KC-N-DYSKT TPAKM-LNQN PGLRDIC--H
APS3_USTMA   RGHTMMIDVD TSFVRFTSIT QA-L-GKNKV NFGRL-VKTC P-ALDPH-IT
APS4_SCHPO   --HFLMRMAK DSSISATSMF RS-A-FPKAT QEEED-LEMR WIRDNLN---
APS6_CANAL   GEIIVLRRVQ DSFVNVTQLF QILE-VLPTS QVDNY-FDNE I-LSNLKYLR
APS4_NEUCR   ---FLMRRSQ DGYISATGMF KA-T-FPYAS QEEEE-AERK YIKSIPT---
APS5_ASPNI   -TYFLMRRSK DGYVSATGMF KI-A-FPWAK LEEER-SERE YLKTRPE---

             SFKGGRPENQ GTWVHPDIAI NLAQ----
             KVQGGFGKYQ GTWVPLNIAK QLAEKFSV
             KVQGGYGKYQ GTWIPLDVAI ELAERYNI
             KIQGGYGRYQ GTWIPLEQAE ALARRNNI
             KVQGGYGKYQ GTWIPLQEGR QLAERNNI
             KVQGGYGKYQ GTWVPFQRGV DLATKYKV
             KVQGGYGKYQ GTYVPLDLGA AIARNFGV
             KVQGGYGKYQ GTWIPFERGL EVCRQYGV
             KVQGGFGRFQ GTWIPLEDAR KLAKTYGV
             KIQGGCGKYQ GTWVPSERAV ELAHEYNV
             KVQGGYGKYQ GTWVNYQRGV ELCREYHV
             KIQGGYGKFQ GTWIPLQRAQ QVAAEYNV
             KVQGGYGRFQ GTWIPLDSAK FLVNKYEI
             NVDSKHPQID GVWVPYDRAI SIAKRYGV
             VVKIGSMHLK GVWIPFERAL AMAQREQI
             VVKIGSMHLK GVWIPFERAL AIAQREKI
             VVKIGPMHLK GVWIPFDRAL EFANKEKI
             VVKIGSMHLK GVWIPFERAY ILAQREQI
             RIRGGFLKIQ GTWLPYKLCK ILARRFCY
             VVKIGPMHLK GVWIPFERAL DFANKEKI
             VVKVGSMNLK GVWIPFDRAY EIARNEGV
             SITGGALAAQ GYWMPYEAAK AIAATFC-
             KLKGGYLSIQ GTWLPFDLAK ELSRR---
             --PIEDKRVA GLWVPPADAL ALAKDYSM
             KHQNIY--LQ GIWIPYDKAV NLALKFDI
             --TSSEETAG NVWIPPEQAL ILAEEYQI