Reference APSES domains (reference species)
- Multi FASTA file of all APSES domains in fungal proteins.
The full-length protein sequences were copied from the previously prepared input file of [[All_APSES_proteins|86 proteins] and pasted into the input form of the EBI ClustalW service. While this is no longer considered state-of-the-art for multiple sequence alignments, it is computationally efficient and sufficiently accurate for the purpose of approximate domain boundary definition. What we want to construct an input file for aligning just the APSES domains: this should contain the following
- our yeast APSES domain (this defines the boundaries of the domain we are interested in)
- enough sequence extending it N- and C-terminally for the other proteins to ensure we are not throwing out conserved amino acids
- but not too much, since irrelevant sequence can cause problems for the alignment.
Scrolling through the ClustalW result page, the alignment blocks containing the Mbp1 APSES domain sequence were copied and pasted into a MSWord test document, then manually edited to contain only the APSES domains plus some 10 or 20 residues on each end. Through some simple replace commands, this was then brought into a FASTA format. What's a bit annoying is that this changes the headers to contain only the first word (in our case mostly the GI number) .. i.e. from a FASTA input of ...
... we get a Clustal record of ...
6320147 --------------------------------------MSNQIYSARYSGVDVYEFIHST 22
...which we can change back into a FASTA record:
>6320147 MSNQIYSARYSGVDVYEFIHST
There is no easy way to repair the headers in MSWord, but using a trivial perl program this can be automated:
However I consider this cosmetics - the file would have been just as valid with only the GI numbers in the header. Here is the resulting FASTA file containing only APSES domains: