BIN-ALI-PSI BLAST
PSI-BLAST
Keywords: PSI-BLAST in practice, interpretation, significance and profile corruption; other BLASTS; beyond BLAST
Contents
This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.
Abstract
...
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
Objectives
...
Outcomes
...
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
Evaluation: NA
- This unit is not evaluated for course marks.
Contents
Task:
- Read the introductory notes on the princiles and use of the "Position Specific Iterated BLAST" (PSI-BLAST) algorithm.
Heuristic profile-based alignment: PSI BLAST
It is (deceptively) easy to perform BLAST searches via the Web interface, but to use such powerful computational tools to their greatest advantage takes a considerable amount of care, caution and consideration.
PSI-BLAST allows to perform very sensitive searches for homologues that have diverged so far that their pairwise sequence similarity has become insignificant. It achieves this by establishing a profile of sequences to align with the database, rather than searching with individual sequences. This deemphasizes parts of the sequence that are variable and inconsequential, and focusses on the parts of greater structural and functional importance. As a consequence, the signal to noise ratio is greatly enhanced.
In this part of the assignment, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in YFO. We will use these sequences for multiple alignments, calculation of conservation etc.
The first methodical problem we have to address is what sequence to search with. The full-length Mbp1 sequence from Saccharomyces cerevisiae or its RBM from YFO are not suitable: They contain multiple domains (in particular the ubiquitous Ankyrin domains) and would create broad, non-specific profiles. The APSES domain sequence by contrast is structurally well defined. The KilA-N domain, being shorter, is less likely to make a sensitive profile. Indeed one of the results of our analysis will be to find whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, like the KILA-N domain, as suggested by the Pfam alignment.
The second methodical problem we must address is how to perform a sensitive PSI-BLAST search in one organism. We need to balance two conflicting objectives:
- If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are in YFO is too small. Thus the search will not become very sensitive.
- If we don't restrict our search, but search in all species, the number of hits may become unwieldily large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage. Also we need to evaluate the fringe cases of marginal E-value: should a new sequence be added to the profile, or should we hold off on it for one or two iterations, to see whether its E-value drops significantly. By all means, we need to avoid profile corruption.
Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.
Therefore we have to find a middle ground: add enough organisms (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: YFO. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.
We need a subset of species
- that represent as large a range as possible on the evolutionary tree;
- that are as well distributed as possible on the tree; and
- whose genomes are fully sequenced.
Selecting species for a PSI-BLAST search
To select species, we will use an approach that is conceptually simple: select a set of species according to their shared taxonomic rank in the tree of life. Biological classification provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.
If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
cellular organisms; Eukaryota; Opisthokonta;
Fungi; Dikarya; Ascomycota; Saccharomyceta;
Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae;
Saccharomyces; Saccharomyces cerevisiae
These names can be mapped into taxonomic ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)
Rank | Suffix | Example |
Domain | Eukaryota (Eukarya) | |
Subdomain | Opisthokonta | |
Kingdom | Fungi | |
Subkingdom | Dikarya | |
Phylum | Ascomycota | |
rankless taxon[1] | -myceta | Saccharomyceta |
Subphylum | -mycotina | Saccharomycotina |
Class | -mycetes | Saccharomycetes |
Subclass | -mycetidae | |
Order | -ales | Saccharomycetales |
Family | -aceae | Saccharomycetaceae |
Subfamily | -oideae | |
Tribe | -eae | |
Subtribe | -ineae | |
Genus | Saccharomyces | |
Species | Saccharomyces cerevisiae |
You can see that there is no common mapping between the yeast lineage listed at the NCBI and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record.
But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)
I have chosen the 10 species below to define a well-distributed search-space for PSI-BLAST. Of course you must also include YFO in the selection (if YFO is not in this list already).
To enter these 10 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list)
"Wallemia mellicola"[organism] OR
"Puccinia Graminis"[organism] OR
"Ustilago maydis"[organism] OR
"Cryptococcus neoformans"[organism] OR
"Coprinopsis cinerea"[organism] OR
"Schizosaccharomyces pombe"[organism] OR
"Aspergillus nidulans"[organism] OR
"Neurospora crassa"[organism] OR
"Bipolaris oryzae"[organism] OR
"Saccharomyces cerevisiae"[organism]
Executing the PSI-BLAST search
We have a list of species. Good. Next up: how do we use it.
Task:
- Navigate to the BLAST homepage.
- Select protein BLAST.
- Paste the MBP1_SACCE APSES domain sequence into the search field:
>APSES_MBP1 Residues 4-102 of S. cerevisiae Mbp1 QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
- Select refseq as the database.
- Copy the Entrez restrictions from above and add the correct name for YFO to the list if it is not there already. (Obviously, you can't find sequences in YFO if YFO is not included among the genomes you are searching in.) Paste the list into the Entrez Query field.
- In the Algorithm section, select PSI-BLAST.
- Click on BLAST.
Evaluate the results carefully. Since we did not change the algorithm parameters, the threshold for inclusion was set at an E-value of 0.005 by default, and that may be a bit too lenient, i.e. it might include sequences that are not homologous. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they will drop out of the profile and not contaminate it.
Task:
- In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to
0.001
(This means E-values can't be above 10-03 for the sequence to be included.) - Click on the Reformat button (top right).
- In the table of sequence descriptions (not alignments!), click on Query cover to sort the table by coverage, not by score.
- Deselect the check mark next to these sequences in the second-to-rightmost column Select for PSI blast.
- Then scroll to Run PSI-BLAST iteration 2 ... and click on
Go
.
This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
- we are explicitly removing sequences with poor coverage; and
- we are requiring a more stringent minimum E-value for each sequence.
Task:
- Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration and are proposed for inclusion in the next iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%. These are the ones you need to check carefully: do you agree that they should be included? If there is any doubt, perhaps because of a really marginal E-value, poor coverage or a function annotation that is not compatible with your query, it is safer to exclude a sequence than to risk profile corruption. If the sequence is a true positive, it will return to the list in later iterations, usually with a better E-value as the profile improves. It's a good idea to note such sequences in your journal so you can keep track of how their E-values change.
- Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
- Iterate the search in this way, successively relaxing the coverage threshold, until no more "New" sequences are added to the profile. The search has converged. Obviously the result depends on your data, but it would be unusual if the search had not converged after 6 iterations or so, and there is probably a mistake if there are more than 70 hits or so.
- Now look at the list of excluded hits (if any), the hits that are reasonable but didn't quite make the cut. Are there any from YFO that seem like they should actually be included? Perhaps their E-value is only marginally above the threshold? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
Once no "new" sequences have been added, we would always get the same result on additional iterations because there are no more changes to the profile. We say that the search has converged. Time to harvest.
Task:
- In the header section of the BLAST report, click on Taxonomy reports and find YFO in the Organism Report section. These are your APSES domain homologs. All of them. There is a link to the alignment, the BLAST score, the E-value, and a link to the entry in RefSeq.
- From the report copy the sequence identifiers from YFO, with E-values above your defined threshold to your notebook.
For example, the list of Saccharomyces genes contains the following information:
Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292
4e-37 ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
2e-30 ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c]
4e-27 ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c]
4e-27 ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c]
7e-06 ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c]
Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
Task:
- To add the sequences to your database, open each of the links to the RefSeq record for YFO organism into a separate tab.
- Find the UniProt IDs
- Go through the (short) section
add PSI BLAST results
in the Assignment 04 R-script.
So much for using PSI-BLAST. The last step seems a bit tedious, adding all this information by hand. There's got to be a better way, right?
But for now, we'll have a look at what the sequences tell us.
Model Based Alignments: PSSMs and HMMs
- Position Specific Scoring Matrices (PSSMs)
The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD see here.
- Hidden Markov Models (HMMs)
An approach to represent such profile information that is more general than PSSMs is a Hidden Markov model (HMM) and the standard tool to use HMMs in Bioinformatics is HMMER, written by Sean Eddy. HMMER has allowed to represent the entirety of protein sequences as a collection of profiles, stored in databases such as Pfam, Interpro, and SMART. While the details are slightly different, all of these services allow to scan sequences for the presence of domains. Importantly thus, the alignment results are not collections of full-length protein families, but annotate to domain families, i.e. full length proteins are decomposed into their homologous domains. This is a very powerful approach towards the functional annotation of unknown sequences.
In this section, we will annotate the YFO sequence with the domains it contains, using the database of domain HMMs curated by SMART in Heidelberg and Pfam at the EMBL. We will then compare these annotations with those determined for the orthologues in the reference species. In this way we can enhance the information about one protein by determining how its features are conserved.
Add PSI BLAST results to your Database
Task:
Once you have discovered APSES domain proteins in YFO, you should enter them
into your database. Copy and edit your code template that adds entries to the
protein table, once for each of your YFO hits.
Execute the code and confirm that the proteins have been added to your database with all of their required information. You can check the code in addSACCE_APSESproteins.R and compare this with your code if you get errors. But careful in case you want to reuse that code: dont use "ns = ref" in your dbAutoincrement calls!
Finally save the new version of myDB as version 03: "myDB.03.RData"
Further reading, links and resources
Notes
- ↑ The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-08-05
Version:
- 0.1
Version history:
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.