Reference APSES domains (yeast)
- Multi FASTA file of all saccharomyces cerevisiae APSES domains.
Executing the PSI-BLAST search
The starting point of this list is a PSI-BLAST search with one known APSES domain sequence. This query sequence - the Mbp1 APSES domain - was defined as follows, based on Pfam profile 02292: APSES.
>Yeast Mbp1 APSES domain (AA 24..102 of NP_010227) SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
Even though we are only interested in yeast genes, including other sequences into the PSI-BLAST search will allow us to identify distant homologues as well. A PSI-BLAST search was executed, searching in the refseq subset of GenPept and selecting an Organism restriction to use only Fungi (taxid: 4751). The default parameters for PSI-BLAST were used, except for using the BLOSUM45 matrix and an E-value threshold of 0.1, not 10.
The search converged after 6 iterations, i.e. in the 6th iteration PSI-BLAST found no additional new hits above the inclusion threshold E-value of 0.005. Clicking on the Taxonomy reports link lists all hits sorted by the species they originate from. Clicking on the Saccharomyces cerevisiae link identifies the yeast genes that were found:
Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292 ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c] 126 1e-37 ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c] 103 9e-30 ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c] 98 9e-28 ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c] 94 2e-27 ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c] 55 1e-12
One of these (Xbp1) is only a partial match with an alignment length of 55 amino acids. There is a somewhat complicated story here, therefore, for the purposes of the course I have removed Xbp1 from consideration as an APSES transcription factor. We will work from the four yeast gene families Mbp1, Swi4, Sok2, and Phd1.
Constructing the multi-FASTA file
A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence.
The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers options to format the result page: they are presented from a link at the top of the BLAST results page: " Reformat these Results": the principal options for the format are:
- Pairwise: the default
- Pairwise with identities: showing only differences to the query sequence
- query anchored with/without letters for identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
- flat-query anchored with/without letters for identities: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
I have selected the flat-query anchored with letters for identities option and restricted the output to saccharomyces cerevisiae sequences:
Query 1 SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIA 60 NP_010227 24 SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIA 83 NP_011036 60 VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSA 118 NP_013729 436 SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHV---VKIGSMHLKGVWIPFERA 492 NP_012881 208 SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREV---VKIGSMHLKGVWIPFERA 264 NP_012165 343 NFTKRIRGGYIKIQGTWLPMEIS 365 Query 61 KQLAEKFS--VYD-QLKPLFDF 79 NP_010227 84 KQLAEKFS--VYD-QLKPLFDF 102 NP_011036 119 KFLVNKYE--IIDPVVNSILTF 138 NP_013729 493 LAIAQREK--IAD-YLYPLF 509 NP_012881 265 YILAQREQ--ILD-HLYPLF 281 NP_012165 366 RLLCLRFCFPIRY-FLVPIFG 385
I can then simply copy the alignment, remove hyphens and create FASTA headers to which I have manually added some useful information. The "Query" itself (being identical to the original Mbp1 protein) and the Xbp1 partial match are not included.
Yeast APSES domains
This is the final yeast APSES domain reference sequence set in multi-FASTA format.
>Mbp1_SACCE (79 ids) NP_010227 (024..102) SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF >Sok2_SACCE (74 ids) NP_013729 (436..509) SVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKIADYLYPLF >Phd1_SACCE (74 ids) NP_012881 (208..281) SVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQILDHLYPLF >Swi4_SACCE (79 ids) NP_011036 (060..138) VMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTF