Difference between revisions of "Reference APSES domains (reference species)"

From "A B C"
Jump to navigation Jump to search
Line 1: Line 1:
 +
__NOTOC__
 +
 
;Multi FASTA file of all APSES domains in fungal proteins.  
 
;Multi FASTA file of all APSES domains in fungal proteins.  
 
Generating this file turned out to be much more difficult than anticipated:
 
 
One feature I want this file to have is headers that identify the yeast protein that each ASPES domain is most similar to. E.g. if the ASPES domain of ''Aspergillus terreus'' protein ATEG_06370 is most similar to that of yeast Xbp1, I want it to be called Xbp1_ASPTE. Such meaningful protein names - rather than abstract GI or RefSeq identifiers - will be extremely helpful when it comes to analysing conservation or phylogeny.
 
 
The other feature is of course obtaining the full-length sequence of the ASPES domain. Unfortunately BLAST being a '''local-alignment''' algorithm does not guarantee that it will not shave off significant bits and pieces from the termini.
 
 
At this point, I would normally start writing a program. But let's see how far we can get with tools we find on the Web.
 
  
 
====Executing the PSI-BLAST search====
 
====Executing the PSI-BLAST search====
Line 17: Line 11:
 
  QGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDG
 
  QGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDG
  
The search returned 81 hits with significant e-values by the 5th iteration. 5 of these were from the organism ''Chaetomia globosum'' and were removed from the list since this is not one of the organisms we are studying. 6 hits had HSPs only along a part of the APSES domain. For five of these hits, reasonable similarity to the whole APSES domain was independently verified by manually performing a Needleman-Wunsch optimal alignment with the Mbp1 APSES domain sequence. (EMBOSS NEEDLE using EBLOSUM 30, default gap parameters).
+
The search returned 81 hits with significant e-values by the 5th iteration. 5 of these were from the organism ''Chaetomia globosum'' and were removed from the list since this is not one of the organisms we are studying. 6 hits were aligned only along a part of the APSES domain. For five of these hits, reasonable similarity to the whole APSES domain was independently verified by manually performing a Needleman-Wunsch optimal alignment with the Mbp1 APSES domain sequence. (EMBOSS NEEDLE using EBLOSUM 30, default gap parameters).
  
 
However the match to the ''Neurospora crassa'' protein XP_962373 suggested an incorrect gene model. Consider the alignment:
 
However the match to the ''Neurospora crassa'' protein XP_962373 suggested an incorrect gene model. Consider the alignment:
Line 35: Line 29:
  
  
<!-- In either case, a highly similar sequence across a part of the domain, where the rest of the expected sequence is missing in the database '''must''' lead you to suspect that something is wrong with the database sequence! In order to repair this, one needs to retrieve the genomic sequence, translate it and check whether it matches the query. The '''CDS''' feature section in the GenPept sequence entry contains the gene's '''locus tag''': "NCU06339.1". Searching with this locus-tag in the NCBI nucleotide database gives hits to the mRNA and the whole genome sequence. In the file for chromosome IV cont3.367, we find the following feature annotation (among many others:)
+
====A multi-FASTA file====
 +
Since we are interested in only the APSES domain, we need to display the search results in an appropriate format. If we navigate to the page from where we sent the BLAST query, we have several options to display search results:
  
 +
*'''Pairwise''': the default
 +
*'''Pairwise with identities''': showing only differences to the query sequence
 +
*'''query anchored with/without identities''': looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query  are displayed ''below'' the sequence
 +
*'''flat-query anchored with/without identitites''': This now looks like a multiple sequence alignment (in fact it '''is''' one - all sequences aligned to the profile).
 +
*'''hit-table''': this gives only the numerical parameters describing the quality of the matches.
  
    gene            &lt;19631..&gt;21692
+
Using the  '''flat-query anchored with/without identitites''' option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands. Of course, the sequences for which only partial matches were found need to be completed "by hand" (from the reults of the pairwise sequence alignment described above to validate these sequences).
                    /locus_tag="NCU06339.1"
 
                    /db_xref="GeneID:3878537"
 
    mRNA            join(&lt;19631..19700,19759..&gt;21692)
 
                    /locus_tag="NCU06339.1"
 
                    /product="hypothetical protein"
 
                    /transcript_id="XM_957280.1"
 
                    /db_xref="GI:85107447"
 
                    /db_xref="GeneID:3878537"
 
    CDS            join(19631..19700,19759..21692)
 
                    /locus_tag="NCU06339.1"
 
                    /codon_start=1
 
                    /protein_id="XP_962373.1"
 
                    /db_xref="GI:85107448"
 
                    /db_xref="GeneID:3878537"
 
  
The string <code >&lt;19631..&gt;21692</code> tells us that the gene is encoded on the strand in this file, not the reverse complement, since the numbers are '''ascending'''. The &lt; &gt; brackets tell us that the gene may continue on upstream and downstream. And the mRNA description <code>join(&lt;19631..19700,19759..&gt;21692)</code> tells us that there is a 59 nucleotide intron in the coding sequence. So what we'll do is retrieve 1000 nuclotides upstream sequence together with the gene and then align the APSES domain to the translation. Entering the numbers 19631 and 21692into the '''range''' fields at the top of the page and displaying the range in FASTA format gives us the nucleotides we need. There are many ways to do this comparison, one of them is to use the '''EMBOSS transeq''' program to translate the Neurospora chromosome fragment in three frames (one needs to set the option to change all "*"s to "X"s), then use '''WATER''' to align the ASPES domain to each of them. -->
 
  
====Defining the most similar ASPES domain in yeast====
 
  
Normally we would find the most similar protein in another species by executing a BLAST search. In our case however, we have 70 sequences. Doing this by hand is possible - but painful. Even clicking through the precomputed '''''BLink'''''s (that we would conveniently find on the page returned through "Get selected sequences") will not help us, since, we are not looking for the most similar protein ''per se'', but for the most similar '''ASPES domain'''. So what we need is (1) an input file of ASPES domain sequences, and then (2) a way to BLAST them against the yeast genome. Let's ignore for the time being the requirement for full-length domain sequences and stick with the [[Glossary#HSP|HSPs]] that PSI-BLAST has found. Parsing the BLAST file and extracting the sequences by hand is, again, possible, but painful. Fortunately there is a simpler way. If we navigate to the page from  where we sent the BLAST query, we have several options to display search results:
+
====Renaming sequences====
 +
 
 +
To support the interpretation of alignments and gene trees, the Mbp1 orthologues for all species were named accordingly. All other sequences were named according to their species and the first four digits of their RefSeq ID. This is a pain to do by hand, so I wrote a little perl script to parse this information from the BLAST output and modify the headers accordingly. In case you are curious how this works, you can click '''[[Perl_parse_BLAST_organism|here]]'''.
 +
 
  
*'''Pairwise''': the default
 
*'''Pairwise with identities''': showing only differences to the query sequence
 
*'''query anchored with/without identities''': looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query  are displayed ''below'' the sequence
 
*'''flat-query anchored with/without identitites''': This now looks like a multiple sequence alignment (in fact it '''is''' one - all sequences aligned to the profile).
 
*'''hit-table''': this gives only the numerical parameters describing the quality of the matches.
 
  
Thus using the '''flat-query anchored with/without identitites''' option, it is reasonably straightforward to obtain the matched sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands.
+
====Defining the most similar ASPES domain in yeast====
  
 +
Normally we would find the most similar protein in another species by executing a BLAST search. In our case however, we have 70 sequences. Doing this by hand is possible - but painful. Even clicking through the precomputed '''''BLink'''''s (that we would conveniently find on the page returned through "Get selected sequences") will not help us, since, we are not looking for the most similar protein ''per se'', but for the most similar '''ASPES domain'''. So what we need is (1) an input file of ASPES domain sequences, and then (2) a way to BLAST them against the yeast genome. Let's ignore for the time being the requirement for full-length domain sequences and stick with the  that PSI-BLAST has found. Parsing the BLAST file and extracting the sequences by hand is, again, possible, but painful. Fortunately there is a simpler way.
  
 
====The ASPES domain sequences====  
 
====The ASPES domain sequences====  

Revision as of 23:25, 18 November 2006


Multi FASTA file of all APSES domains in fungal proteins.

Executing the PSI-BLAST search

A PSI-BLAST search was executed with default parameters, searching in the RefSeq database, restricted to Fungi. The query sequence - the Mbp1 APSES domain - was defined as follows

>Yeast Mbp1 APSES domain (AA 24..107 of NP_010227)
SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
QGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDG

The search returned 81 hits with significant e-values by the 5th iteration. 5 of these were from the organism Chaetomia globosum and were removed from the list since this is not one of the organisms we are studying. 6 hits were aligned only along a part of the APSES domain. For five of these hits, reasonable similarity to the whole APSES domain was independently verified by manually performing a Needleman-Wunsch optimal alignment with the Mbp1 APSES domain sequence. (EMBOSS NEEDLE using EBLOSUM 30, default gap parameters).

However the match to the Neurospora crassa protein XP_962373 suggested an incorrect gene model. Consider the alignment:

QUERY       1 SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY     50
                                          .:.:.:.||:....:.||..|.
XP_962373   1                             MLNQNPGLKDIAYSITGGAIKA     22

QUERY      51 QGTWVPLNIAKQLAEKF--SVYDQLKPLF--DFTQ---TDG              84
              ||.|.|:..||::...|  .:..:|.|||  ||..   :.|         
XP_962373  23 QGYWMPYACAKAVCATFCYQIAGALIPLFGPDFPSECISPGEPRYGIMII     72

In this situation you have to be suspicious that the gene-finder algorithm skipped a part of the N-terminus. Or, the sequence was derived from a partial m-RNA. This sequence was removed from analysis.

Further, XP_712876 and XP_712970 were found to be identical sequences from the same organism. Only one of these duplicates was kept.

This gave a total of 74 ASPES domain sequences for analysis.


A multi-FASTA file

Since we are interested in only the APSES domain, we need to display the search results in an appropriate format. If we navigate to the page from where we sent the BLAST query, we have several options to display search results:

  • Pairwise: the default
  • Pairwise with identities: showing only differences to the query sequence
  • query anchored with/without identities: looks something like a multiple sequence alignment, hyphens for gaps, insertions relative to the query are displayed below the sequence
  • flat-query anchored with/without identitites: This now looks like a multiple sequence alignment (in fact it is one - all sequences aligned to the profile).
  • hit-table: this gives only the numerical parameters describing the quality of the matches.

Using the flat-query anchored with/without identitites option, it is reasonably straightforward to obtain the aligned sequences, copy and paste them into a Word document and convert that into a multi-FASTA format with a few Edit > Replace commands. Of course, the sequences for which only partial matches were found need to be completed "by hand" (from the reults of the pairwise sequence alignment described above to validate these sequences).


Renaming sequences

To support the interpretation of alignments and gene trees, the Mbp1 orthologues for all species were named accordingly. All other sequences were named according to their species and the first four digits of their RefSeq ID. This is a pain to do by hand, so I wrote a little perl script to parse this information from the BLAST output and modify the headers accordingly. In case you are curious how this works, you can click here.


Defining the most similar ASPES domain in yeast

Normally we would find the most similar protein in another species by executing a BLAST search. In our case however, we have 70 sequences. Doing this by hand is possible - but painful. Even clicking through the precomputed BLinks (that we would conveniently find on the page returned through "Get selected sequences") will not help us, since, we are not looking for the most similar protein per se, but for the most similar ASPES domain. So what we need is (1) an input file of ASPES domain sequences, and then (2) a way to BLAST them against the yeast genome. Let's ignore for the time being the requirement for full-length domain sequences and stick with the that PSI-BLAST has found. Parsing the BLAST file and extracting the sequences by hand is, again, possible, but painful. Fortunately there is a simpler way.

The ASPES domain sequences

This requires to take all sequence identifiers, use their APSES domains and search them against the yeast genome. I actually have given up on finding a Web-tool to do this. Of course this can be done manually, through Blinks - but having to do this for 70 sequences was an uninspiring prospect. And it seems there are no BLAST Webservices that will accept batch-input of lists of sequences.


The full-length protein sequences were copied from the previously prepared input file of [[All_APSES_proteins|86 proteins] and pasted into the input form of the EBI ClustalW service. While this is no longer considered state-of-the-art for multiple sequence alignments, it is computationally efficient and sufficiently accurate for the purpose of approximate domain boundary definition. What we want to construct an input file for aligning just the APSES domains: this should contain the following

  • our yeast APSES domain (this defines the boundaries of the domain we are interested in)
  • enough sequence extending it N- and C-terminally for the other proteins to ensure we are not throwing out conserved amino acids
  • but not too much, since irrelevant sequence can cause problems for the alignment.

Scrolling through the ClustalW result page, the alignment blocks containing the Mbp1 APSES domain sequence were copied and pasted into a MSWord test document, then manually edited to contain only the APSES domains plus some 10 or 20 residues on each end. Through some simple replace commands, this was then brought into a FASTA format. What's a bit annoying is that this changes the headers to contain only the first word (in our case mostly the GI number) .. i.e. from a FASTA input of ...

>6320147 NP_010227.1 Mbp1p [Saccharomyces cerevisiae]
MSNQIYSARYSGVDVYEFIHST...

... we get a Clustal record of ...

6320147         --------------------------------------MSNQIYSARYSGVDVYEFIHST 22

...which we can change back into a FASTA record:

>6320147
MSNQIYSARYSGVDVYEFIHST

Tuhs losing part of the header information. There is no easy way to repair the headers in MSWord, but using a trivial perl program this can be automated:


However I consider this cosmetics - the file would have been just as valid with only the GI numbers in the header. Here is the resulting FASTA file containing only APSES domains:

Sources