Difference between revisions of "Reference APSES domains (reference species)"
Line 288: | Line 288: | ||
* We need to collapse the separate aligned sections, remove the profusion of gap characters, and replace the semantically meaningless GI numbers with something that we can use for interpreting alignments and trees. I could do this by hand for the ~300 sequences in about 2 hours. I choose to wrote some Perl code instead: | * We need to collapse the separate aligned sections, remove the profusion of gap characters, and replace the semantically meaningless GI numbers with something that we can use for interpreting alignments and trees. I could do this by hand for the ~300 sequences in about 2 hours. I choose to wrote some Perl code instead: | ||
<source lang=Perl> | <source lang=Perl> | ||
+ | #!/usr/bin/perl | ||
+ | # ProcessPSI-BLAST.pl | ||
+ | # Read headers and Flat query alignments from two files. | ||
+ | # Collapse all alignments into single, ungapped strings. | ||
+ | # Select which GI to use, construct meaningful header and print out | ||
+ | # header in multiFASTA format. | ||
+ | # BS Nov 2013 | ||
+ | use strict; | ||
+ | use warnings; | ||
+ | my $headerFile = "APSES_headers.txt"; | ||
+ | my $aliFile = "APSES_ali.txt"; | ||
+ | my $MINCOVER = 75; # Minimum required coverage (%) | ||
+ | my $MAXEXPECT = 0.0001; # Maximum allowed E value | ||
+ | |||
+ | my %headers; # Hash to hold the header data | ||
+ | my %sequences; # Hash to hold the sequences | ||
+ | |||
+ | open IN, $headerFile or die "$!"; | ||
+ | while (my $line = <IN>) { # process all lines from this file | ||
+ | # use regular expression to parse information from header line. | ||
+ | if ($line =~ m/^\s* # possibly whitespace | ||
+ | (\w+).* # capture the first word (as $1: protein name) | ||
+ | .*\[ # discard all characters until opening bracket | ||
+ | (\w+)\s(\w+) # capture two words ($2 and $3: species) | ||
+ | .*\] # discard all characters until closing bracket | ||
+ | \s+(\S+) # discard whitespace, capture word ($4: max score) | ||
+ | \s+(\S+) # discard whitespace, capture word ($5: total score) | ||
+ | \s+(\S+)% # discard whitespace, capture word ($6: coverage) | ||
+ | \s+(\S+) # discard whitespace, capture word ($7: E value) | ||
+ | \s+(\S+) # discard whitespace, capture word ($8: Identity) | ||
+ | \s+(\S+)\. # discard whitespace, capture word ($9: accession, without version) | ||
+ | /x ) { | ||
+ | if ($6 >= $MINCOVER && $7 <= $MAXEXPECT) { # only if both conditions hold... | ||
+ | my $h = substr($1,0,4) . "_"; # 4 characters of protein name, underscore | ||
+ | $h .= uc(substr($2,0,3)) . uc(substr($3,0,2)); # add species code | ||
+ | $headers{$9} = $h; # put this into the hash | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | close IN; | ||
+ | |||
+ | open IN, $aliFile or die "$!"; | ||
+ | while (my $line = <IN>) { # process all lines from this file | ||
+ | # use regular expression to parse information from header line. | ||
+ | if ($line =~ m/^(.._\S+)\s+ # capture accession number (as $1) | ||
+ | \d+\s+ # discard numbers and whitespace | ||
+ | ([A-Z-]+) # capture sequence ($2) | ||
+ | /x ) { | ||
+ | my $key = $1; | ||
+ | my $val = $2; | ||
+ | $val =~ s/-//g; # remove all hyphens | ||
+ | $sequences{$key} .= $val; # concatenate sequence fragment | ||
+ | # into hash (create entry if | ||
+ | # none exists yet). | ||
+ | } | ||
+ | } | ||
+ | close IN; | ||
+ | |||
+ | # Now iterate through all keys in %headers and print sequences in | ||
+ | # multi FASTA format. | ||
+ | |||
+ | foreach my $key (keys(%headers)) { | ||
+ | print (">"); | ||
+ | print ("$headers{$key}:$key\n"); | ||
+ | print ("$sequences{$key}\n"); | ||
+ | } | ||
+ | |||
+ | exit(); | ||
</source> | </source> | ||
Revision as of 23:25, 25 November 2013
Reference APSES domains
- Multi FASTA file of APSES domains in six fungal reference species.
This page collects APSES domain sequences from six fungal species that are used as reference species for the course. The species are:
- Aspergillus nidulans (ASPNI)
- Candida albicans (CANAL)
- Neurospora crassa (NEUCR)
- Saccharomyces cerevisiae (SACCE)
- Schizosaccharomyces pombe (SCHPO)
- Ustilago maydis (USTMA)
- see also: reference annotation of Mbp1 proteins
Executing the PSI-BLAST search
Defining the APSES Domain sequence
- The APSES domain "proper"
- Navigate to the NCBI BLAST page, accessed protein BLAST;
- Follow the link to protein BLAST and enter the yeast Mbp1 refseq ID NP_010227 into the input form;
- Select the PHI-BLAST algorithm to search for domains in the sequence and Run BLAST;
- Click on the graphical summary of the result to access the CDD conserved domains report for the sequence;
- Click on the (+) sign next to the link to KilA-N(pfam 04383) domain to display the query/profile alignment. This is what it looks like:
10 20 30 40 50 60 70 80 ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....| gi 6320147 19 IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQ---------------GGFGKYQGTWVPLNIA 83 Cdd:pfam04383 3 YNDFEIIIRRDKDGYINATKLCKAAGAKGKRFRNWLRLESTKELIEELSkennpdkliiienrkGKGGRLQGTYVHPDLA 82 90 ....*....|.... gi 6320147 84 KQLA----EKFSVY 93 Cdd:pfam04383 83 LAIAswisPEFALK 96 |
This gives us the following APSES domain sequence:
>Yeast Mbp1 APSES domain (AA 19..93 of NP_010227) IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG GFGKYQGTWVPLNIAKQLAEKFSVY
Searching for APSES domains
A PSI-BLAST search was executed, searching in the refseq subset of the NCBI protein database and restricting the species to the six fungal reference species plus Escherichia coli. The latter was chosen to retrieve the KilA-N domain sequence which we need as an outgroup for phylogenetic analysis.
The search converged after 5 iterations in which matches of less than 80% of the query length were manually removed, even if they had low E-values. Also, care was taken not to include false positives and thus to avoid profile corruption, and hits with E > 10-4 were also removed. The check-boxes next to the alignments were used to select sequences with > 80% coverage to the query and only the highest-scoring KilA-N domain protein was kept. Clicking on Get selected sequences created a results page of 27 sequences. These were then displayed in a FASTA(text) format and their headers were slightly edited to create a dataset of Reference APSES full length proteins.
Constructing the multi-FASTA file
A multi-FASTA file is the default input format for many MSA programs, it is simply a file that contains more than one FASTA formatted sequence. To generate the multi-FASTA file of APSES domains, we could have simply edited the full length proteins manually. But there is a simpler way to achieve this. The PSI-BLAST search has already defined the sequences from each source protein that are similar to the APSES search profile. We only need to extract them in a convenient way from the search results. NCBI offers a number of options to format the BLAST result page: they are presented from a link at the top of the BLAST results page: "Formatting options": the principal options for the format are:
- Pairwise: the default
- Pairwise with identities: showing only differences to the query sequence
- query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
- flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
- hit-table: this gives only the numerical parameters describing the quality of the matches.
When we select the Flat-query anchored with letters for identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.
Renaming sequences
To make the interpretation of alignments and gene trees easier, all Saccharomyces cerevisiaea sequences were labelled with their gene name (e.g. Sok2_SACCE
). Sequences that are presumed to be functionally equivalent orthologues to Mbp1 were identified through the Reciprocal Best Match (RBM) criterion and labeled as Mbp1_NNNNN
. All other sequences were named APS1_
, APS2_
, APS3_
... - as required. (e.g. APS1_USTMA
). There is no further significance in the numbers, i.e. APS1_USTMA
is not necessarily an RBM to APS1_SCHPO
. Note that such relabeling of sequences does not change the data or its interpretation, it is just helpful to interpret the tree.
The final 27 APSES domain reference sequences
>KILA_ESCCO ZP_07189117 KilA-N domain protein IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSFKGGRPENQGTW VHPDIAINLAQ >MBP1_SACCE NP_010227 Mbp1 IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAE KFSVY >MBP1_USTMA XP_762343 UM06196 IINNVAVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAE RYNI >MBP1_NEUCR XP_955821 NCU07246 VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIY >MBP1_ASPNI XP_660758.1 AN3154 IGTDSVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAER NNI >MBP1_SCHPO NP_593032 MBF transcription factor complex subunit Res2 IKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATK YKV >MBP1_CANAL XP_723071 potential DNA binding component of MBF VTSEGPIMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIAR NFGVY >APS1_NEUCR XP_962967 NCU07587 VNNVAVMRRQKDGWVNATQILKVANIDKGRRTKILEKEIQIGEHEKVQGGYGKYQGTWIPFERGLEVCRQ YGV >APS1_CANAL XP_712970 potential DNA binding component of SBF MMNESSIMRRCKDDWVNATQILKCCNFPKAKRTKILEKGVQQGLHEKVQGGFGRFQGTWIPLEDARKLAK TYGV >APS1_SCHPO NP_595496 MBF transcription factor complex subunit Res1 INGFPLMKRCHDNWLNATQILKIAELDKPRRTRILEKFAQKGLHEKIQGGCGKYQGTWVPSERAVELAHE YNVF >APS2_ASPNI XP_664319 hypothetical protein AN6715 VNGVAVMKRRSDGWLNATQILKVAGVVKARRTKTLEKEIAAGEHEKVQGGYGKYQGTWVNYQRGVELCRE YHV >APS2_USTMA XP_761485 UM05338 VRGIAVMRRRGDGWLNATQILKIAGIEKTRRTKILEKSILTGEHEKIQGGYGKFQGTWIPLQRAQQVAAE YNV >SWI4_SACCE NP_011036 Swi4p TKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEKVQGGYGRFQGTWIPLDSAKFLVNKYE I >APS3_SCHPO NP_596132 MBF transcription factor complex subunit Cdc10 GDNVALRRCPDSYFNISQILRLAGTSSSENAKELDDIIESGDYENVDSKHPQIDGVWVPYDRAISIAKR YGVY >APS3_CANAL XP_714237 potential DNA binding regulator of filamentous growth NNVSVVRRADNNMINGTKLLNVAQMTRGRRDGILKSEKVRHVVKIGSMHLKGVWIPFERALAMAQREQI >SOK2_SACCE NP_013729 Sok2p NGISVVRRADNDMVNGTKLLNVTKMTRGRRDGILKAEKIRHVVKIGSMHLKGVWIPFERALAIAQREKI >APS3_ASPNI XP_663440 STUA CELL PATTERN FORMATION-ASSOCIATED PROTEIN GVCVARREDNGMINGTKLLNVAGMTRGRRDGILKSEKVRNVVKIGPMHLKGVWIPFDRALEFANKEKI >PHD1_SACCE NP_012881 Phd1p NGISVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIGSMHLKGVWIPFERAYILAQREQI >APS4_CANAL XP_710918 CaO19.5210 LNNHWVIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLESTPKEYQQYIKRIRGGFLKIQGTW LPYKLCKILARRFCYY >APS3_NEUCR XP_960837 NCU01414 GICVARREDNAMINGTKLLNVAGMTRGRRDGILKSEKVRHVVKIGPMHLKGVWIPFERALDFANKEKI >APS5_CANAL XP_711513 potential DNA binding protein NILVSRREDTNYINGTKLLNVIGMTRGKRDGILKTEKIKNVVKVGSMNLKGVWIPFDRAYEIARNEGV >APS4_ASPNI XP_663009 AN5405 TVMWDYNIGLVRTTHLFKCNDYSKTTPAKMLNQNPGLRDICHSITGGALAAQGYWMPYEAAKAIAATFC >APS3_USTMA XP_760925 UM04778 VRGHTMMIDVDTSFVRFTSITQALGKNKVNFGRLVKTCPALDPHITKLKGGYLSIQGTWLPFDLAKELSR R >APS4_SCHPO NP_596166 HFLMRMAKDSSISATSMFRSAFPKATQEEEDLEMRWIRDNLNPIEDKRVAGLWVPPADALALAKDYSM >APS6_CANAL XP_723412 potential transcriptional co-activator HGEIIVLRRVQDSFVNVTQLFQILIKLEVLPTSQVDNYFDNEILSNLKYFGSSSNTPQYLDLRKHQNIYL QGIWIPYDKAVNLALKFDIY >APS4_NEUCR XP_962267 NCU06560 FLMRRSQDGYISATGMFKATFPYASQEEEEAERKYIKSIPTTSSEETAGNVWIPPEQALILAEEYQI >APS5_ASPNI XP_657766 AN0162 TYFLMRRSKDGYVSATGMFKIAFPWAKLEEERSEREYLKTRPETSEDEIAGNVWISPVLALELAAEYKMY
Mbp1 orthologue reference alignment
This is a reference alignment of the APSES domains of those proteins that fulfilled the Reciprocal Best Match criterion with yeast Mbp1.
CLUSTAL format alignment by MAFFT L-INS-1 (v6.850b) MBP1_SACCE IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVY MBP1_CANAL VTSEGPIMRRKKDSWINATHILKIAKFPKAKRTRILEKDVQTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVY MBP1_USTMA IINNVAVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREIQKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNI- MBP1_NEUCR ------VMRRRHDDWVNATHILKAAGFDKPARTRILEREVQKDTHEKIQGGYGRYQGTWIPLEQAEALARRNNIY MBP1_ASPNI -IGTDSVMRRRSDDWINATHILKVAGFDKPARTRILEREVQKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNI- MBP1_SCHPO -IKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKV-
All course species APSES domains
To construct a reference alignment for all APSES domains in the various course species, the following process was used:
- Open a protein BLAST input window.
- Paste the yeast Mbp1 APSES domain sequence
>Yeast Mbp1 APSES domain (AA 19..93 of NP_010227) IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG GFGKYQGTWVPLNIAKQLAEKFSVY
- Select refseq_protein as the Database.
- Paste the following organism restrictions into the Entrez query field. This includes all fungi we have worked with in the course, as well as Escherichia coli (for the KilA-N domain):
Ajellomyces dermatitidis [ORGN]
OR Arthroderma benhamiae [ORGN]
OR Arthroderma gypseum [ORGN]
OR Ashbya gossypii [ORGN]
OR Aspergillus clavatus [ORGN]
OR Aspergillus fumigatus [ORGN]
OR Aspergillus nidulans [ORGN]
OR Aspergillus niger [ORGN]
OR Aspergillus terreus [ORGN]
OR Candida albicans [ORGN]
OR Candida dubliniensis [ORGN]
OR Candida glabrata [ORGN]
OR Candida orthopsilosis [ORGN]
OR Candida tropicalis [ORGN]
OR Chaetomium globosum [ORGN]
OR Clavispora lusitaniae [ORGN]
OR Coccidioides immitis [ORGN]
OR Coccidioides posadasii [ORGN]
OR Debaryomyces hansenii [ORGN]
OR Eremothecium cymbalariae [ORGN]
OR Kazachstania africana [ORGN]
OR Kluyveromyces lactis [ORGN]
OR Komagataella pastoris [ORGN]
OR Lachancea thermotolerans [ORGN]
OR Lodderomyces elongisporus [ORGN]
OR Magnaporthe oryzae [ORGN]
OR Malassezia globosa [ORGN]
OR Meyerozyma guilliermondii [ORGN]
OR Millerozyma farinosa [ORGN]
OR Myceliophthora thermophila [ORGN]
OR Naumovozyma castellii [ORGN]
OR Naumovozyma dairenensis [ORGN]
OR Nectria haematococca [ORGN]
OR Neosartorya fischeri [ORGN]
OR Neurospora crassa [ORGN]
OR Paracoccidioides sp. [ORGN]
OR Puccinia graminis [ORGN]
OR Pyrenophora teres [ORGN]
OR Pyrenophora tritici-repentis [ORGN]
OR Saccharomyces cerevisiae[ORGN]
OR Saccharomyces cerevisiae [ORGN]
OR Scheffersomyces stipitis [ORGN]
OR Schizosaccharomyces japonicus [ORGN]
OR Sclerotinia sclerotiorum [ORGN]
OR Sordaria macrospora [ORGN]
OR Talaromyces marneffei [ORGN]
OR Talaromyces stipitatus [ORGN]
OR Tetrapisispora blattae [ORGN]
OR Tetrapisispora phaffii [ORGN]
OR Thielavia terrestris [ORGN]
OR Torulaspora delbrueckii [ORGN]
OR Trichophyton rubrum [ORGN]
OR Trichophyton verrucosum [ORGN]
OR Uncinocarpus reesii [ORGN]
OR Vanderwaltozyma polyspora [ORGN]
OR Verticillium alfalfae [ORGN]
OR Yarrowia lipolytica [ORGN]
OR Zygosaccharomyces rouxii [ORGN]
OR Zymoseptoria tritici [ORGN]
OR Escherichia coli [ORGN]
- Select PSI-BLAST as the algorithm.
- BLAST this.
- On the results page, select hits with >75% coverage and E values < 10-4 and iterate (6 rounds) to convergence.
- Open the Formatting options link and select Flat query anchored with letters for identities. The alignment then looks something like this:
[...] XP_962267 81 P-SYFLMRRSQD----GYISATGMF---------K----------------------A 102 XP_001212599 125 DK-EWLIMWDYNI----GLVRTTPLF---------R-------------S--------Q 148 XP_003666082 80 P-SYFLMRRSED----GYVSATGMF---------K----------------------A 101 XP_001398916 86 TYFLMRRSKD----GFVSATGMF---------K-------------I--------A 107 XP_001527061 504 NILVSRREDT----NYINCTKLL---------N-------------V--------V 525 XP_002417464 87 HN-EIIVLRRVQD----SFVNITQLFQILI-----K-------------L--------D 114 XP_657766 86 TYFLMRRSKD----GYVSATGMF---------K-------------I--------A 107 [...]
- Copy all those sequences, and paste them into a text file called APSES_ali.txt
- Copy the headers, and paste them into a separte text file called APSES_headers.txt; they look something like this:
APSES transcription factor Xbp1 [Aspergillus clavatus NRRL 1] 85.9 85.9 94% 2e-19 26% XP_001268422.1 ABR055Cp [Ashbya gossypii ATCC 10895] 86.3 86.3 96% 3e-19 26% NP_983001.2 hypothetical protein PICST_67427 [Scheffersomyces stipitis] 85.6 85.6 96% 3e-19 24% XP_001383609.2 hypothetical protein PGUG_03651 [Meyerozyma guilliermondii] 85.2 85.2 96% 3e-19 24% XP_001484270.1
- We need to collapse the separate aligned sections, remove the profusion of gap characters, and replace the semantically meaningless GI numbers with something that we can use for interpreting alignments and trees. I could do this by hand for the ~300 sequences in about 2 hours. I choose to wrote some Perl code instead:
#!/usr/bin/perl
# ProcessPSI-BLAST.pl
# Read headers and Flat query alignments from two files.
# Collapse all alignments into single, ungapped strings.
# Select which GI to use, construct meaningful header and print out
# header in multiFASTA format.
# BS Nov 2013
use strict;
use warnings;
my $headerFile = "APSES_headers.txt";
my $aliFile = "APSES_ali.txt";
my $MINCOVER = 75; # Minimum required coverage (%)
my $MAXEXPECT = 0.0001; # Maximum allowed E value
my %headers; # Hash to hold the header data
my %sequences; # Hash to hold the sequences
open IN, $headerFile or die "$!";
while (my $line = <IN>) { # process all lines from this file
# use regular expression to parse information from header line.
if ($line =~ m/^\s* # possibly whitespace
(\w+).* # capture the first word (as $1: protein name)
.*\[ # discard all characters until opening bracket
(\w+)\s(\w+) # capture two words ($2 and $3: species)
.*\] # discard all characters until closing bracket
\s+(\S+) # discard whitespace, capture word ($4: max score)
\s+(\S+) # discard whitespace, capture word ($5: total score)
\s+(\S+)% # discard whitespace, capture word ($6: coverage)
\s+(\S+) # discard whitespace, capture word ($7: E value)
\s+(\S+) # discard whitespace, capture word ($8: Identity)
\s+(\S+)\. # discard whitespace, capture word ($9: accession, without version)
/x ) {
if ($6 >= $MINCOVER && $7 <= $MAXEXPECT) { # only if both conditions hold...
my $h = substr($1,0,4) . "_"; # 4 characters of protein name, underscore
$h .= uc(substr($2,0,3)) . uc(substr($3,0,2)); # add species code
$headers{$9} = $h; # put this into the hash
}
}
}
close IN;
open IN, $aliFile or die "$!";
while (my $line = <IN>) { # process all lines from this file
# use regular expression to parse information from header line.
if ($line =~ m/^(.._\S+)\s+ # capture accession number (as $1)
\d+\s+ # discard numbers and whitespace
([A-Z-]+) # capture sequence ($2)
/x ) {
my $key = $1;
my $val = $2;
$val =~ s/-//g; # remove all hyphens
$sequences{$key} .= $val; # concatenate sequence fragment
# into hash (create entry if
# none exists yet).
}
}
close IN;
# Now iterate through all keys in %headers and print sequences in
# multi FASTA format.
foreach my $key (keys(%headers)) {
print (">");
print ("$headers{$key}:$key\n");
print ("$sequences{$key}\n");
}
exit();