BIN-ALI-PSI BLAST

PSI-BLAST

(PSI-BLAST in practice, interpretation, significance and profile corruption; other BLASTS; beyond BLAST)

Abstract:

Sensitive, fast, database-scale sequence searches are possible with the PSI-BLAST algorithm. This unit introduces the concept and guides you through searching for APSES domain proteins in MYSPE.

Objectives:
This unit will ...

... introduce the PSI-BLAST algorithm;
... teach how to run PSI-BLAST searches.

Outcomes:
After working through this unit you ...

... can run a PSI-BLAST search via the NCBI's online interface, taking care to avoid profile corruption;
... have discovered APSES domain proteins in MYSPE and added them to your protein database.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Your protein database: Add the MYSPE proteins to your database in a JSON file.

Prerequisites:
This unit builds on material covered in the following prerequisite units:

BIN-ALI-BLAST (BLAST heuristic sequence alignment)

Heuristic profile-based alignment: PSI BLAST

It is (deceptively) easy to perform BLAST searches via the Web interface, but to use such powerful computational tools to their greatest advantage takes a considerable amount of care, caution and consideration.

PSI-BLAST allows to perform very sensitive searches for homologues that have diverged so far that their pairwise sequence similarity has become insignificant. It achieves this by establishing a profile of sequences to align with the database, rather than searching with individual sequences. This deemphasizes parts of the sequence that are variable and inconsequential, and focusses on the parts of greater structural and functional importance. As a consequence, the signal to noise ratio is greatly enhanced.

In this unit, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in MYSPE. We will use these sequences for multiple alignments, calculation of conservation etc.

The first methodical problem we have to address is what sequence to search with. The full-length Mbp1 sequence from Saccharomyces cerevisiae or its RBM from MYSPE are not suitable: They contain multiple domains (in particular the ubiquitous Ankyrin domains) and would create broad, non-specific profiles. The APSES domain sequence by contrast is structurally well defined. The KilA-N domain, being shorter, is less likely to make a sensitive profile. Indeed one of the results of our analysis will be to find whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, like the KILA-N domain, as suggested by the Pfam alignment.

The second methodical problem we must address is how to perform a sensitive PSI-BLAST search in one organism. We need to balance two conflicting objectives:

If we restrict the PSI-BLAST search to MYSPE, PSI-BLAST has little chance of building a meaningful profile - the number of homologous sequences that exist in this single species is too small. Thus the search will not become very sensitive.

If we don't restrict our search, but search in all species, the number of hits may become unwieldily large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage. Also we need to evaluate the fringe cases of marginal E-value: should a new sequence be added to the profile, or should we hold off on it for one or two iterations, to see whether its E-value drops significantly. Whatever we do, we must avoid profile corruption.

Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: MYSPE. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.

We need a subset of species

that represent as large a range as possible on the evolutionary tree;
that are as well distributed as possible on the tree; and
whose genomes are fully sequenced.

Selecting species for a PSI-BLAST search

To select species, we will use an approach that is conceptually simple: select a set of species according to their shared taxonomic rank in the tree of life. Taxonomic classification is a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in constant flux.

If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:

cellular organisms; Eukaryota; Opisthokonta;
Fungi; Dikarya; Ascomycota; Saccharomyceta;
Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae;
Saccharomyces; Saccharomyces cerevisiae

These names can be mapped into taxonomic ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

Rank	Suffix	Example
Domain		Eukaryota (Eukarya)
Subdomain		Opisthokonta
Kingdom		Fungi
Subkingdom		Dikarya
Phylum		Ascomycota
rankless taxon^[1]	-myceta	Saccharomyceta
Subphylum	-mycotina	Saccharomycotina
Class	-mycetes	Saccharomycetes
Subclass	-mycetidae
Order	-ales	Saccharomycetales
Family	-aceae	Saccharomycetaceae
Subfamily	-oideae
Tribe	-eae
Subtribe	-ineae
Genus		Saccharomyces
Species		Saccharomyces cerevisiae
Strain		Saccharomyces cerevisiae S288c

You can see that there is no common mapping between the yeast lineage listed at the NCBI and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record.

But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)

I have chosen the 10 species below to define a well-distributed search-space for PSI-BLAST. Of course you must add MYSPE to the selection for your own search purposes.

To enter these 10 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list - but that would be lame.)


"Aspergillus nidulans"[organism] OR
"Bipolaris oryzae"[organism] OR
"Coprinopsis cinerea"[organism] OR
"Cryptococcus neoformans"[organism] OR
"Neurospora crassa"[organism] OR
"Puccinia Graminis"[organism] OR
"Saccharomyces cerevisiae"[organism] OR
"Schizosaccharomyces pombe"[organism] OR
"Ustilago maydis"[organism] OR
"Wallemia mellicola"[organism]

Executing the PSI-BLAST search

We have a list of species. Good. Next up: how do we use it.

Task:

Navigate to the BLAST homepage.
Select protein BLAST.
Paste the MBP1_SACCE APSES domain sequence into the search field:

>APSES_MBP1 Residues 4-102 of S. cerevisiae Mbp1
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF

Select Reference proteins (refseq_protein) as the database.
Copy the Entrez restrictions from above and add the correct name for MYSPE. (Obviously, you can't find sequences in MYSPE if MYSPE is not included among the genomes you are searching in.) Paste the list into the Entrez Query field.
In the Algorithm section, select PSI-BLAST.
Click on BLAST.

Evaluate the results carefully. Since we did not change the algorithm parameters, the threshold for inclusion was set at an E-value of 10 by default, and for a PSI BLAST search this is too lenient, i.e. the chances of including non-homologous hits is far too high. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they will drop out of the profile and not contaminate it.

Task:

In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10^-03 for the sequence to be included.)
Click on the Reformat button (top right).
In the table of sequence descriptions (not alignments!), click on Query cover to sort the table by coverage, not by score.
Deselect the check mark next to all seqeunces with less than 80% query coverage in the column on the right "Select for PSI BLAST"
Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.

This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

we are explicitly removing sequences with poor coverage; and
we are requiring a more stringent minimum E-value for each sequence.

Task:

Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration and are proposed for inclusion in the next iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80% (These are the sequences that do not have a green check-mark next to them in the right-hand column). These are the ones you need to check carefully: do you agree that they should be included? If there is any doubt, perhaps because of a really marginal E-value, poor coverage or a function annotation that is not compatible with your query, it is safer to exclude a sequence, than to risk profile corruption. If the sequence is a true positive, it will return to the list in later iterations, usually with a better E-value as the profile improves. It's a good idea to note such sequences in your journal so you can keep track of how their E-values change.
Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
Iterate the search in this way, successively relaxing the coverage threshold, until no more "New" sequences are added to the profile. The search has converged. Obviously the result depends on your data, but it would be unusual if the search had not converged after 6 iterations or so, and there is probably a mistake if there are more than 70 hits or so.
Now look at the list of excluded hits (if any), the hits that are reasonable but didn't quite make the cut. Are there any from MYSPE that seem like they should actually be included? Perhaps their E-value is only marginally above the threshold? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...

Iterate the PSI-BLAST search until no new sequences have been added (no yellow highlights). Then relax the coverage criteria - to 60% first, then less. But watch the E-values of the shorter sequences carefully. You can click on any of theDon't include sequences with E values greater than 1e-04 or so. The S. cerevisiae Xbp1 sequence seems to be a true positive, but it won't match the profile well because of a long insertion in the APSES domain - thus it has only 38% coverage. But its E-value is probably less then 1e-06. Keep that for last. There is no harm in proceeding through these iterations slowly and conservatively. But if you include only one non-homologous sequence by mistake, you will get profile corruption. For example, you might notice hypothetical protein CC1G_13964 [Coprinopsis cinerea okayama7#130] or hypothetical protein CNAG_05349 [Cryptococcus neoformans var. grubii H99] at E-values around 1e-04. These are probably false positives with an unexpectedly high number of indels and a location towards the C-terminus of the protein. Keep them out!

Once no "new" sequences have been added, we would always get the same result on additional iterations because there are no more changes to the profile. We say that the search has converged. Last I ran this, I took about 10 iterations to arrive at 45 high-confidence hits for the ten reference species. You should find approximately the same number - plus a few more for the domains you find in MYSPE.

Time to harvest.

Task:

In the header section of the BLAST report, click on Taxonomy reports and find MYSPE in the Organism Report section. These are your APSES domain homologs. All of them. There is a link to the alignment, the BLAST score, the E-value, and a link to the entry in RefSeq.
Copy the sequence identifiers from MYSPE, with E-values above your defined threshold to your journal.

For example, the table of Saccharomyces hits contains the following information:

------------------------------------------------------------------------------------------------
Saccharomyces cerevisiae S288c [ascomycetes]
------------------------------------------------------------------------------------------------
transcription factor MBP1 [Saccharomyces cerevisiae S288C]             138    8e-39    NP_010227
SBF complex DNA-binding subunit SWI4 [Saccharomyces cerevisiae S288C]  118    7e-32    NP_011036
Phd1p [Saccharomyces cerevisiae S288C]                                 110    8e-30    NP_012881
Sok2p [Saccharomyces cerevisiae S288C]                                 110    4e-29    NP_013729
Xbp1p [Saccharomyces cerevisiae S288C]                                46.0    2e-06    NP_012165
------------------------------------------------------------------------------------------------

Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others, noting that it has a homologue in Neurospora crassa with a similar situation.

Task:

Add your sequences to your protein database

To add the sequences to your database, open each of the links to the RefSeq record for MYSPE organism into a separate tab.
Find the UniProt IDs.
Make a copy of "./data/refAPSES_PSI-BLAST.json" and call it "MYSPE_APSES_PSI-BLAST.json"
Open "./data/MYSPE_APSES_PSI-BLAST.json" in RStudio to see how it is formatted.
Delete the existing data for however many proteins your PSI-BLAST search has discovered in the MYSPE genome, and replace it with your protein's data. Save the file when you are done.
Validate your files online at https://jsonlint.com/
Add the following line to your database creation script myScripts/makeProteinDB.R:

  myDB <- dbAddProtein(    myDB, fromJSON("MYSPE_APSES_PSI-BLAST.json"))

Save myScripts/makeProteinDB.R.
Recreate the database by running:

  source("myScripts/makeProteinDB.R")

Confirm that your proteins have been correctly added:

sel <- which(myDB$taxonomy$species == MYSPE)
(taxid <- myDB$taxonomy$ID[sel])

sel <- which(myDB$protein$taxonomyID == taxid)
myDB$protein$name[sel]      # Does this return all of the APSES domain protein
                            # names of MYSPE? If not, this needs to be fixed.
                            # Ask for advice.

nchar(myDB$protein$sequence[sel])      # Does this give the correct sequence
                                       # lengths? If not, this needs to be
                                       # fixed. Ask for advice.

Notes

↑ The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-09-25

Version:

1.1

Version history:

1.1 Maintenance
1.0 First live version, updated to new BLAST interface.
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.

[1]

BIN-ALI-PSI BLAST

Contents

Evaluation

Contents

Heuristic profile-based alignment: PSI BLAST

Selecting species for a PSI-BLAST search

Executing the PSI-BLAST search

Further reading, links and resources

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools