BIN-ALI-PSI BLAST

From "A B C"
(Redirected from BIN-ALI-PSI-BLAST)
Jump to navigation Jump to search

PSI-BLAST

(PSI-BLAST in practice, interpretation, significance and profile corruption; other BLASTS; beyond BLAST)


 


Abstract:

Sensitive, fast, database-scale sequence searches are possible with the PSI-BLAST algorithm. This unit introduces the concept and guides you through searching for APSES domain proteins in MYSPE.


Objectives:
This unit will ...

  • ... introduce the PSI-BLAST algorithm;
  • ... teach how to run PSI-BLAST searches.

Outcomes:
After working through this unit you ...

  • ... can run a PSI-BLAST search via the NCBI's online interface, taking care to avoid profile corruption;
  • ... have discovered APSES domain proteins in MYSPE and added them to your protein database.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
  • Your protein database
    Add the MYSPE proteins to your database in a JSON file.

    Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents


    Heuristic profile-based alignment: PSI BLAST

     

    It is (deceptively) easy to perform BLAST searches via the Web interface, but to use such powerful computational tools to their greatest advantage takes a considerable amount of care, caution and consideration.

    PSI-BLAST allows to perform very sensitive searches for homologues that have diverged so far that their pairwise sequence similarity has become insignificant. It achieves this by establishing a profile of sequences to align with the database, rather than searching with individual sequences. This deemphasizes parts of the sequence that are variable and inconsequential, and focusses on the parts of greater structural and functional importance. As a consequence, the signal to noise ratio is greatly enhanced.

    In this unit, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in MYSPE. We will use these sequences for multiple alignments, calculation of conservation etc.

    The first methodical problem we have to address is what sequence to search with. The full-length Mbp1 sequence from Saccharomyces cerevisiae or its RBM from MYSPE are not suitable: They contain multiple domains (in particular the ubiquitous Ankyrin domains) and would create broad, non-specific profiles. The APSES domain sequence by contrast is structurally well defined. The KilA-N domain, being shorter, is less likely to make a sensitive profile. Indeed one of the results of our analysis will be to find whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, like the KILA-N domain, as suggested by the Pfam alignment.

    The second methodical problem we must address is how to perform a sensitive PSI-BLAST search in one organism. We need to balance two conflicting objectives:

    • If we restrict the PSI-BLAST search to MYSPE, PSI-BLAST has little chance of building a meaningful profile - the number of homologous sequences that exist in this single species is too small. Thus the search will not become very sensitive.
    • If we don't restrict our search, but search in all species, the number of hits may become unwieldily large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage. Also we need to evaluate the fringe cases of marginal E-value: should a new sequence be added to the profile, or should we hold off on it for one or two iterations, to see whether its E-value drops significantly. Whatever we do, we must avoid profile corruption.

    Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.

    Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: MYSPE. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.

    We need a subset of species

    1. that represent as large a range as possible on the evolutionary tree;
    2. that are as well distributed as possible on the tree; and
    3. whose genomes are fully sequenced.


     



     

    Selecting species for a PSI-BLAST search

     


    To select species, we will use an approach that is conceptually simple: select a set of species according to their shared taxonomic rank in the tree of life. Taxonomic classification is a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in constant flux.

    If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:


    cellular organisms; Eukaryota; Opisthokonta;
    Fungi; Dikarya; Ascomycota; Saccharomyceta;
    Saccharomycotina; Saccharomycetes;
    Saccharomycetales; Saccharomycetaceae;
    Saccharomyces; Saccharomyces cerevisiae
    


    These names can be mapped into taxonomic ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

    Rank Suffix Example
    Domain   Eukaryota (Eukarya)
      Subdomain   Opisthokonta
    Kingdom   Fungi
      Subkingdom   Dikarya
    Phylum   Ascomycota
      rankless taxon[1] -myceta Saccharomyceta
      Subphylum -mycotina Saccharomycotina
    Class -mycetes Saccharomycetes
      Subclass -mycetidae  
    Order -ales Saccharomycetales
    Family -aceae Saccharomycetaceae
      Subfamily -oideae  
      Tribe -eae  
      Subtribe -ineae  
    Genus   Saccharomyces
    Species   Saccharomyces cerevisiae
      Strain   Saccharomyces cerevisiae S288c


    You can see that there is no common mapping between the yeast lineage listed at the NCBI and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record.

    But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)

    I have chosen the 10 species below to define a well-distributed search-space for PSI-BLAST. Of course you must add MYSPE to the selection for your own search purposes.

    To enter these 10 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list - but that would be lame.)

    
    "Aspergillus nidulans"[organism] OR
    "Bipolaris oryzae"[organism] OR
    "Coprinopsis cinerea"[organism] OR
    "Cryptococcus neoformans"[organism] OR
    "Neurospora crassa"[organism] OR
    "Puccinia Graminis"[organism] OR
    "Saccharomyces cerevisiae"[organism] OR
    "Schizosaccharomyces pombe"[organism] OR
    "Ustilago maydis"[organism] OR
    "Wallemia mellicola"[organism]
    
    



     

    Executing the PSI-BLAST search

     

    We have a list of species. Good. Next up: how do we use it.

    Task:

    1. Navigate to the BLAST homepage.
    2. Select protein BLAST.
    3. Paste the MBP1_SACCE APSES domain sequence into the search field:
    >APSES_MBP1 Residues 4-102 of S. cerevisiae Mbp1
    QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
    LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
    
    1. Select Reference proteins (refseq_protein) as the database.
    2. Copy the Entrez restrictions from above and add the correct name for MYSPE. (Obviously, you can't find sequences in MYSPE if MYSPE is not included among the genomes you are searching in.) Paste the list into the Entrez Query field.
    3. In the Algorithm section, select PSI-BLAST.
    4. Click on BLAST.


    Evaluate the results carefully. Since we did not change the algorithm parameters, the threshold for inclusion was set at an E-value of 10 by default, and for a PSI BLAST search this is too lenient, i.e. the chances of including non-homologous hits is far too high. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they will drop out of the profile and not contaminate it.


    Task:

    1. In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10-03 for the sequence to be included.)
    2. Click on the Reformat button (top right).
    3. In the table of sequence descriptions (not alignments!), click on Query cover to sort the table by coverage, not by score.
    4. Deselect the check mark next to all seqeunces with less than 80% query coverage in the column on the right "Select for PSI BLAST"
    5. Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.


    This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

    1. we are explicitly removing sequences with poor coverage; and
    2. we are requiring a more stringent minimum E-value for each sequence.


    Task:

    1. Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration and are proposed for inclusion in the next iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80% (These are the sequences that do not have a green check-mark next to them in the right-hand column). These are the ones you need to check carefully: do you agree that they should be included? If there is any doubt, perhaps because of a really marginal E-value, poor coverage or a function annotation that is not compatible with your query, it is safer to exclude a sequence, than to risk profile corruption. If the sequence is a true positive, it will return to the list in later iterations, usually with a better E-value as the profile improves. It's a good idea to note such sequences in your journal so you can keep track of how their E-values change.
    2. Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
    3. Iterate the search in this way, successively relaxing the coverage threshold, until no more "New" sequences are added to the profile. The search has converged. Obviously the result depends on your data, but it would be unusual if the search had not converged after 6 iterations or so, and there is probably a mistake if there are more than 70 hits or so.
    4. Now look at the list of excluded hits (if any), the hits that are reasonable but didn't quite make the cut. Are there any from MYSPE that seem like they should actually be included? Perhaps their E-value is only marginally above the threshold? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
    • Iterate the PSI-BLAST search until no new sequences have been added (no yellow highlights). Then relax the coverage criteria - to 60% first, then less. But watch the E-values of the shorter sequences carefully. You can click on any of theDon't include sequences with E values greater than 1e-04 or so. The S. cerevisiae Xbp1 sequence seems to be a true positive, but it won't match the profile well because of a long insertion in the APSES domain - thus it has only 38% coverage. But its E-value is probably less then 1e-06. Keep that for last. There is no harm in proceeding through these iterations slowly and conservatively. But if you include only one non-homologous sequence by mistake, you will get profile corruption. For example, you might notice hypothetical protein CC1G_13964 [Coprinopsis cinerea okayama7#130] or hypothetical protein CNAG_05349 [Cryptococcus neoformans var. grubii H99] at E-values around 1e-04. These are probably false positives with an unexpectedly high number of indels and a location towards the C-terminus of the protein. Keep them out!


    Once no "new" sequences have been added, we would always get the same result on additional iterations because there are no more changes to the profile. We say that the search has converged. Last I ran this, I took about 10 iterations to arrive at 45 high-confidence hits for the ten reference species. You should find approximately the same number - plus a few more for the domains you find in MYSPE.

    Time to harvest.


    Task:

    1. In the header section of the BLAST report, click on Taxonomy reports and find MYSPE in the Organism Report section. These are your APSES domain homologs. All of them. There is a link to the alignment, the BLAST score, the E-value, and a link to the entry in RefSeq.
    2. Copy the sequence identifiers from MYSPE, with E-values above your defined threshold to your journal.

    For example, the table of Saccharomyces hits contains the following information:

    ------------------------------------------------------------------------------------------------
    Saccharomyces cerevisiae S288c [ascomycetes]
    ------------------------------------------------------------------------------------------------
    transcription factor MBP1 [Saccharomyces cerevisiae S288C]             138    8e-39    NP_010227
    SBF complex DNA-binding subunit SWI4 [Saccharomyces cerevisiae S288C]  118    7e-32    NP_011036
    Phd1p [Saccharomyces cerevisiae S288C]                                 110    8e-30    NP_012881
    Sok2p [Saccharomyces cerevisiae S288C]                                 110    4e-29    NP_013729
    Xbp1p [Saccharomyces cerevisiae S288C]                                46.0    2e-06    NP_012165
    ------------------------------------------------------------------------------------------------
    

    Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others, noting that it has a homologue in Neurospora crassa with a similar situation.


    Task:

    Add your sequences to your protein database
    • To add the sequences to your database, open each of the links to the RefSeq record for MYSPE organism into a separate tab.
    • Find the UniProt IDs.
    • Make a copy of "./data/refAPSES_PSI-BLAST.json" and call it "MYSPE_APSES_PSI-BLAST.json"
    • Open "./data/MYSPE_APSES_PSI-BLAST.json" in RStudio to see how it is formatted.
    • Delete the existing data for however many proteins your PSI-BLAST search has discovered in the MYSPE genome, and replace it with your protein's data. Save the file when you are done.
    • Validate your files online at https://jsonlint.com/
    • Add the following line to your database creation script myScripts/makeProteinDB.R:
      myDB <- dbAddProtein(    myDB, fromJSON("MYSPE_APSES_PSI-BLAST.json"))
    
    • Save myScripts/makeProteinDB.R.
    • Recreate the database by running:
      source("myScripts/makeProteinDB.R")
    
    • Confirm that your proteins have been correctly added:
    sel <- which(myDB$taxonomy$species == MYSPE)
    (taxid <- myDB$taxonomy$ID[sel])
    
    sel <- which(myDB$protein$taxonomyID == taxid)
    myDB$protein$name[sel]      # Does this return all of the APSES domain protein
                                # names of MYSPE? If not, this needs to be fixed.
                                # Ask for advice.
    
    nchar(myDB$protein$sequence[sel])      # Does this give the correct sequence
                                           # lengths? If not, this needs to be
                                           # fixed. Ask for advice.
    


     

    Further reading, links and resources

    • Limiting the minimal profile block width enhances PSI-BLAST performance significantly:
    Oda et al. (2017) Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance. BMC Bioinformatics 18:288. (pmid: 28578660)

    PubMed ] [ DOI ] BACKGROUND: PSI-BLAST, an extremely popular tool for sequence similarity search, features the utilization of Position-Specific Scoring Matrix (PSSM) constructed from a multiple sequence alignment (MSA). PSSM allows the detection of more distant homologs than a general amino acid substitution matrix does. An accurate estimation of the weights for sequences in an MSA is crucially important for PSSM construction. PSI-BLAST divides a given MSA into multiple blocks, for which sequence weights are calculated. When the block width becomes very narrow, the sequence weight calculation can be odd. RESULTS: We demonstrate that PSI-BLAST indeed generates a significant fraction of blocks having width less than 5, thereby degrading the PSI-BLAST performance. We revised the code of PSI-BLAST to prevent the blocks from being narrower than a given minimum block width (MBW). We designate the modified application of PSI-BLAST as PSI-BLASTexB. When MBW is 25, PSI-BLASTexB notably outperforms PSI-BLAST consistently for three independent benchmark sets. The performance boost is even more drastic when an MSA, instead of a sequence, is used as a query. CONCLUSIONS: Our results demonstrate that the generation of narrow-width blocks during the sequence weight calculation is a critically important factor that restricts the PSI-BLAST search performance. By preventing narrow blocks, PSI-BLASTexB upgrades the PSI-BLAST performance remarkably. Binaries and source codes of PSI-BLASTexB (MBW = 25) are available at https://github.com/kyungtaekLIM/PSI-BLASTexB .

    Notes

    1. The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-25

    Version:

    1.1

    Version history:

    • 1.1 Maintenance
    • 1.0 First live version, updated to new BLAST interface.
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.