BLAST
BLAST
BLAST
Objectives
- Understand the advantages and limitations of heuristic, local alignment vs. optimal alignment.
- Know how to initiate a BLAST search.
- Understand different BLAST algorithms and for which computational task they are appropriate.
- Understand the contents of the different databases offered by the NCBI for BLASTing and be able to restrict a search by database and organism.
- Understand how to set the algorithm's parameters for different purposes.
- Understand all information in a BLAST report.
- Be able to evaluate the significance of hits through E-values and other metrics / features of the alignment.
- Be able to use PSI BLAST and avoid and recognize profile corruption; be able to evaluate E-value trends of questionable alignments.
- Be familiar with novel developments beyond BLAST.
Links
- NCBI BLAST home page
- Altschul et al. (1990)
- Web BLAST options document
- Composition-based statistics
- Filtering and Masking
- The Statistics of Sequence Similarity Scores
- Siew&Fischer (2003)
- Yin and Fischer (2006)
- Manuscript citation
- Jones & Swindells (2002)
- Bioinformatics Solutions
- Pattern Hunter algorithm
- COMPASS (Sadreyev & Grishin, 2003)
- COMPASS on the Web
- Sadreyev et al. 2007, NAR Web server issue
Exercises
[...]
Slides
BLAST
Slide 0010
Slide 0011
The tight integration of search capacity with database holdings is the key to the utility of the data. Public investments in sequencing only pay off when the sequences are easily accessible!
Some computational definitions absolutely require genome-wide searches: e.g. the computational definition of orthologues, or compiling evolutionary conservation patterns.
Slide 0012
Slide 0013
Slide 0014
The enormous speed-up of BLAST is due to its use of an indexed table of database "words". The index is a list of positions at which each word occurs in the database. Using an index, it is very easy to examine every occurrence of a word in the database and try to extend the word match on both sides with additional similar sequence. The extension does not introduce gaps, because this is faster, but also because the statistics of ungapped alignments are tractable! The final step is the assenbly of significant hits into longer alignments. See also Altschul et al. (1990).
Slide 0015
Extensive help is available (and should be read!) for each of the options. Take the time to read the Web BLAST options document and be sure to understand how to format input, what databases are available and how the choice of database influences the results. If you are not confident with the document, ask on the course list.
Slide 0017
Be sure to understand the choices and their consequences for Composition-based statistics and for Filtering and Masking segments of low complexity in your query. Filtering is an important option to consider especially for PSI-BLAST searches!
Slide 0018
Each Blast "hit" represents an alignment that can contain one or more HSPs. Note: If a hit is followed by a second hit and no new GI number, it identifies a second region of similarity in the same sequence.
Slide 0019
Normally scores depend on the matrix that was used and can't be compared between different matrices and scoring systems. However the NCBI matrices have been normalized in bits, thus the scores between alignments with different matrices can be compared, (this is not generally the case with other matrices). In addition the percentage of Identical and similar ("positives") residues and the gap fraction are given. %-Identities and gap fraction are often used to conclude whether two sequences are homologous, the percentage of positives is not usually used since it depends on the matrix.
Slide 0020
The E-value is a statistically well founded metric that allows us to conclude the likelihood of a spurious alignment. Computing E-values is possible for HSPs since the statistics of gap-less alignments are analytically tractable, whereas gapped alignments have no theoretical description of the distribution of expected scores.
Note that E-values do not represent an assertion about the retrieved sequence, but an assertion about the score and its relation to the expected distribution of scores. Or, to rephrase this, a large E-value does not mean that your hit is not a homologue, but it means that an irrelevant sequence has a a high chance of having just as high a score due to chance similarities. To repeat: a large E-value does not mean your hit is not a homologue. However a small E-value does indeed mean that a chance alignment is unlikely.
It is important to realize that the E-value depends on the database size. Obviously, you would expect randomly high-scoring hits more often in a large database than in a small one. Thus an alignment with the same score will have smaller E-value searched against a particular genome than if you search it against the entire "nr" dataset of GenBank. (More detail in the NCBI tutorial: The Statistics of Sequence Similarity Scores.)
Slide 0021
In the example above, the BLAST search of a Pea defensin - PDB structure 1JKZ - achieved an E-value of only 6.7. However the hit that was retrieved
- is annotated as an arabidopsis defensin
- has 30% identity over the entire domain, albeit the domain is small
- requires only one single gap for alignment
- and has each and every single cysteine conserved, when compared to the query!
Each of these additional observations alone could have led you to conclude homology. The large E-value is primarily due to the fact that the protein sequences are quite short.
Slide 0022
How can there be too many hits, when lots-of-hits is what you are looking for? Either you find redundant sequences or trivially similar sequences that are obscurig the rare, interesting similarities you are looking for (GFP or other fusion proteins come to mind, for example), or you are searching in a database section that contains redundant sequences.
Note that restricting by organism does not restrict the search, but only the list of results that are being reported. The search takes just as long. Only the specialized genome search pages and some non-NCBI databases of model-organism genome projects offer BLAST searches on reduced datasets. These searches are faster.
Slide 0023
How many genes have no homologues? That depends. Unknown genes (or "ORFans") may comprise a significant (albeit diminishing) fraction of genomes. See Siew&Fischer (2003) and a discussion of the role of viral horizontal gene transfer in ORFans by Yin and Fischer (2006). In general, between 10 and 30% of sequences may fall into this category and it is likely that even the most closely related species have sequences that are unique.
Slide 0024
There are a number of different ways to execute BLAST searches and not all of them are equivalent. It is prudent not to rely on any single source of information but to verify and validate important results . Here is an example for a M. grisea Mbp1 homologue search (done 2007).
All searches performed with the query S. cerevisiae Mbp1 (NP_010227).
- 1
- BLAST search via nr database, Entrez restriction to Organism: M. grisea
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?PAGE=Proteins&PROGRAM=blastp
- Database size: 6,035,852
- Sequences: 13,709
- Comment: All sequences
- Best hit: ABA02072 (full length)
- Second: XP_365024 (full length)
- 2
- BLAST search via (B) icon from fungi genome project page
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=11:Fungi
- Database size: 6,110,084
- Sequences: 14,009
- Comment: RefSeq only
- Best hit: XP_362974 (split)
- Second: XP_365024 (split)
- 3
- BLAST search via check-box on "genomic" Blast page
http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi?organism=fungi
- Database size: 481,986
- Sequences: 1,177
- Comment: Sequences not specified - "Completed Genome"
- Best hit: No significant hits returned
- 4
- BLAST search via (B) icon linked from genomic BLAST page
http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=148305
- Database size/Sequences/Best hit: same results as for strategy 2.
- Observations
ABA02072 is the protein with highest similarity to NP_010227 in the M. Grisea Genome. It is only found in the search against "nr". XP_365024 is almost identical to that sequence, except for a 25 AA deletion near the C-terminus. Apparently the ABA02072 transcript was predicted with a slightly different gene model. ABA02072 is not present in the RefSeq databases. The extra sequence in ABA02072 is reasonably well supported by sequence similarity.
The protein with highest similarity in the RefSeq database, XP_365024, is found as the second best hit in the genomic BLAST searches, due to the problem of a split HSP, similar to the Aspergillus cases. The best BLAST hit with a split HSP, XP_362974, is only the third-best hit in the full length alignments that strategy 1 has produced.
The database accessed via the check-box selection on the genomic BLAST page is not properly instantiated. It contains less then 10% of the sequences and the homologues are not among them.
- In conclusion
- - ABA02072 appears to be the most similar homologue to NP_010227.
- - The corresponding RefSeq sequence is not identical, due to a variant gene model.
- - Whether BLAST splits or doesn't split HSPs can depend on the source database.
- - E-values depend on the length of the aligned sequences; E-values for full length and partial alignments cannot be compared.
- - This case is *hard* because of 3 concurrent complications:
- A) presence / absence of the "correct" sequence in RefSeq
- B) returning partial / full-length alignments depending on database background
- C) two predicted sequences have different gene-model and thus exon structure.
This may be a bit of an extreme case, but some such variability of results is quite frequently found if one looks for it.
Slide 0025
PSI-BLAST
Slide 0027
Slide 0028
Slide 0029
Slide 0030
In this example, we are observing how the alignment and score for one hit from the entire set evolves over a number of iterations. The first E-value is 2e-04.
Slide 0031
The second E-value for the pair has decreased from 2e-04 to 2e-32. This has transformed a somewhat borderline hit to a certain homologue! If you look carefully, you will see that the detailed position of gaps has changed - just like in MSAs, consensus information can be invaluable to place gaps correctly - and the lenght of the alignment has grown considerably.
Slide 0032
Slide 0033
The E-value decreases further. A careful comparison of the trend of E-values can be very helpful for evaluating borderline hits. E-values of homologues almost always get dramatically smaller through the iterations. E-values of spurious hits get larger or stay approximately the same. Make it a habit to look at the E-value trend in questionable cases but exclude the questionable hit from the profile by unchecking the check-box on the search form, until you are satisfied that the sequence is a homologue after all. Getting unrelated sequences included in your profile will lead to profile corruption!
Slide 0034
Slide 0035
In the end, how many false positives can we expect? Unfortunately, more than we'd think. Jones & Swindells (2002) have run an analysis against decoy sequences that picked up false positives in 5% of all cases, after the fifth iteration, although the E-value threshold was set to 0.001. Even though their methodology was a bit ad hoc and finding false positives about 50 times more frequently than expected is not catastrophic, we must realize that protein sequences are not random strings and that rigorous statistics are very difficult for this complex problem. Use caution, use common sense and in questionable cases try to use independent confirmation of homology, such as conserved binding sites or functional motifs, if possible.
Slide 0036
Other BLAST variations
Slide 0038
A nice extension of normal sequence alignment is the graphical view of similarities. But note that BLAST is not an optimal sequence alignment algorithm and I question why one would use an inferior algorithm if one has better alternatives easily available? If you plan to use the results, as opposed to just looking at them, use EMBOSS needle respectively water instead!
Slide 0039
Beyond BLAST
Slide 0041
Is it possible to improve significantly on BLAST? Yes! An adaptation of the basic strategy of the algorithm improves both the speed and the sensitivity. The Ontario company Bioinformatics Solutions is marketing the Pattern Hunter algorithm, originally developed by Bin Ma of London and Ming Li of Waterloo.
Besides this being an interesting algorithm, this is an interesting spotlight on the Bioinformatics industry as well. A free academic license is offered for Windows installations only; most "real" bioinformatics would run on some flavor of UNIX machines. And while the fee for the full Academic License is not high (on the order of $1,000.00), the company reports "hundreds" of installed users, in contrast to the tens of thousands who use NCBI BLAST. We note that an important resource in world-wide, daily use does not perform as well as it could, because the resource provider does not acquire the intellectual property of those who could improve it. And since BLAST runs as well as the provider needs to make it to maintain its near monopoly in the user community, there seems to be no incentive for the NCBI to update their servers with PatternHunter. This is clearly the opposite of a win-win situation. What happens in Bionformatics is determined by politics and economics as much as in any other field.
Slide 0042
Why is PatternHunter better? Simply because it uses a more advanced way of defining the database words, or "seeds", that are used to find the initial high-scoring hits. PatternHunter uses spaced seeds, i.e. non-consecutive characters that increase the signal to noise ratio of similarity, as explained above. Thus the algorithm is both faster (because it spends less time looking at initial seeds that can't be extended well) and more sensitive, because once a hit is accepted, it is more likely to be true.
Slide 0043
Is it possible to improve significantly on PSI-BLAST? Yes, COMPASS (Sadreyev & Grishin, 2003) takes the idea of profile based searches further by aligning profiles of sequences against a database of profiles. The principle is the same as the "equivalence principle" for homology, sometimes we can detect distantly related homologues through a mutual similarity to an intermediate sequence. Run COMPASS on the Web against the SCOP database of structural domains (see also here Sadreyev et al. 2007, NAR Web server issue).