Expected Preparations:

  [BIN-ALI]
BLAST
 
  The units listed above are part of this course and contain important preparatory material.  

Keywords: PSI-BLAST in practice; interpretation; significance and profile corruption; other BLASTS; beyond BLAST

Objectives:

This unit will …

  • … introduce the PSI-BLAST algorithm;

  • … teach how to run PSI-BLAST searches.

Outcomes:

After working through this unit you …

  • … can run a PSI-BLAST search via the NCBI’s online interface, taking care to avoid profile corruption;

  • … have discovered APSES domain proteins in MYSPE and added them to your protein database.


Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Your protein database: Add the MYSPE proteins to your database in a JSON file.


Evaluation:

NA: This unit is not evaluated for course marks.

Contents

Sensitive, fast, database-scale sequence searches are possible with the PSI-BLAST algorithm. This unit introduces the concept and guides you through searching for APSES domain proteins in MYSPE.

Task…

Heuristic profile-based alignment: PSI BLAST

 

It is (deceptively) easy to perform BLAST searches via the Web interface, but to use such powerful computational tools to their greatest advantage takes a considerable amount of care, caution and consideration.

PSI-BLAST allows to perform very sensitive searches for homologues that have diverged so far that their pairwise sequence similarity has become insignificant. It achieves this by establishing a profile of sequences to align with the database, rather than searching with individual sequences. This deemphasizes parts of the sequence that are variable and inconsequential, and focusses on the parts of greater structural and functional importance. As a consequence, the signal to noise ratio is greatly enhanced.

In this unit, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in MYSPE. We will use these sequences for multiple alignments, calculation of conservation etc.

The first methodical problem we have to address is what sequence to search with. The full-length Mbp1 sequence from Saccharomyces cerevisiae or its RBM from MYSPE are not suitable: They contain multiple domains (in particular the ubiquitous Ankyrin domains) and would create broad, non-specific profiles. The APSES domain sequence by contrast is structurally well defined. The KilA-N domain, being shorter, is less likely to make a sensitive profile. Indeed one of the results of our analysis will be to find whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, like the KILA-N domain, as suggested by the Pfam alignment.

The second methodical problem we must address is how to perform a sensitive PSI-BLAST search in one organism. We need to balance two conflicting objectives:

Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: MYSPE. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.

We need a subset of species 1. that represent as large a range as possible on the evolutionary tree; 1. that are as well distributed as possible on the tree; and 1. whose genomes are fully sequenced.

 

Further Reading

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

Page ID: BIN-ALI-PSI_BLAST

Author:
Boris Steipe ( <boris.steipe@utoronto.ca> )
Created:
2017-08-05
Last modified:
2022-09-14
Version:
1.1
Version History:
–  1.1 Maintenance
–  1.0 First live version, updated to new BLAST interface.
–  0.1 First stub
Tagged with:
–  Unit
–  Live
–  Has lecture slides
–  Has R code examples
–  Has further reading

 

[END]