Difference between revisions of "BIN-ALI-PSI BLAST"

From "A B C"
Jump to navigation Jump to search
m
m
Line 19: Line 19:
  
  
{{DEV}}
+
{{LIVE}}
  
 
{{Vspace}}
 
{{Vspace}}
Line 29: Line 29:
 
<section begin=abstract />
 
<section begin=abstract />
 
<!-- included from "../components/BIN-ALI-PSI-BLAST.components.wtxt", section: "abstract" -->
 
<!-- included from "../components/BIN-ALI-PSI-BLAST.components.wtxt", section: "abstract" -->
...
+
Sensitive, fast, database-scale sequence searches are possible with the PSI-BLAST algorithm. This unit introduces the concept and guides you through searching for APSES domain proteins in MYSPE.
 
<section end=abstract />
 
<section end=abstract />
  
Line 47: Line 47:
 
=== Objectives ===
 
=== Objectives ===
 
<!-- included from "../components/BIN-ALI-PSI-BLAST.components.wtxt", section: "objectives" -->
 
<!-- included from "../components/BIN-ALI-PSI-BLAST.components.wtxt", section: "objectives" -->
...
+
This unit will ...
 +
* ... introduce the PSI-BLAST algorithm;
 +
* ... teach how to run PSI-BLAST searches.
  
 
{{Vspace}}
 
{{Vspace}}
Line 54: Line 56:
 
=== Outcomes ===
 
=== Outcomes ===
 
<!-- included from "../components/BIN-ALI-PSI-BLAST.components.wtxt", section: "outcomes" -->
 
<!-- included from "../components/BIN-ALI-PSI-BLAST.components.wtxt", section: "outcomes" -->
...
+
After working through this unit you ...
 +
* ... can run a PSI-BLAST search via the NCBI's online interface, taking care to avoid profile corruption;
 +
* ... have discovered APSES domain proteins in MYSPE and added them to your protein database.
  
 
{{Vspace}}
 
{{Vspace}}
Line 67: Line 71:
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 +
;Your protein database: Add the MYSPE proteins to your database in a JSON file.
  
 
{{Vspace}}
 
{{Vspace}}
Line 86: Line 91:
  
 
{{Task|1=
 
{{Task|1=
*Read the introductory notes on {{ABC-PDF|BIN-ALI-PSI-BLAST|the princiles and use of the "Position Specific Iterated BLAST" (PSI-BLAST) algorithm}}.
+
*Read the introductory notes on {{ABC-PDF|BIN-ALI-PSI-BLAST|the principles and use of the "Position Specific Iterated BLAST" (PSI-BLAST) algorithm}}.
 
}}
 
}}
  
Line 108: Line 113:
 
The second methodical problem we must address is how to perform a sensitive PSI-BLAST search '''in one organism'''. We need to balance two conflicting objectives:
 
The second methodical problem we must address is how to perform a sensitive PSI-BLAST search '''in one organism'''. We need to balance two conflicting objectives:
  
* If we restrict the PSI-BLAST search to MYSPE, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are '''in''' MYSPE is too small. Thus the search will not become very sensitive.
+
* If we restrict the PSI-BLAST search to MYSPE, PSI-BLAST has little chance of building a meaningful profile - the number of homologous sequences that exist '''in this single species''' is too small. Thus the search will not become very sensitive.
  
 
       <!-- Column 1 end -->
 
       <!-- Column 1 end -->
Line 115: Line 120:
 
       <!-- Column 2 start -->
 
       <!-- Column 2 start -->
  
* If we don't restrict our search, but search in all species, the number of hits may become unwieldily large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage. Also we need to evaluate the fringe cases of marginal E-value: should a new sequence be added to the profile, or should we hold off on it for one or two iterations, to see whether its E-value drops significantly. By all means, we need to avoid profile corruption.
+
* If we don't restrict our search, but search in all species, the number of hits may become unwieldily large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage. Also we need to evaluate the fringe cases of marginal E-value: should a new sequence be added to the profile, or should we hold off on it for one or two iterations, to see whether its E-value drops significantly. Whatever we do, we '''must''' avoid profile corruption.
  
 
Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.
 
Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.
  
Therefore we have to find a middle ground: add enough organisms (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: MYSPE. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.
+
Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: MYSPE. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.
  
 
We need a subset of species
 
We need a subset of species
Line 125: Line 130:
 
# that are as well '''distributed''' as possible on the tree; and
 
# that are as well '''distributed''' as possible on the tree; and
 
# whose '''genomes''' are fully sequenced.
 
# whose '''genomes''' are fully sequenced.
 
 
  
 
       <!-- Column 2 end -->
 
       <!-- Column 2 end -->
Line 151: Line 154:
 
       <!-- Column 1 start -->
 
       <!-- Column 1 start -->
  
To select species, we will use an approach that is conceptually simple: select a set of species according to their shared taxonomic rank in the tree of life. {{WP|Biological classification|'''Biological classification'''}} provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.
+
To select species, we will use an approach that is conceptually simple: select a set of species according to their shared taxonomic rank in the tree of life. {{WP|Taxonomy_(biology)|'''Taxonomic classification'''}} is a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in constant flux.
  
 
If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
 
If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
Line 269: Line 272:
 
<td>&nbsp;</td>
 
<td>&nbsp;</td>
 
<td>''Saccharomyces cerevisiae''</td>
 
<td>''Saccharomyces cerevisiae''</td>
 +
</tr>
 +
 +
<tr class="s1">
 +
<td>&nbsp;&nbsp;Strain</td>
 +
<td>&nbsp;</td>
 +
<td>''Saccharomyces cerevisiae S288c''</td>
 
</tr>
 
</tr>
  
Line 327: Line 336:
 
  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
 
  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
 
  LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
 
  LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
# Select '''refseq''' as the database.
+
# Select '''Reference proteins (refseq_protein)''' as the database.
# Copy the Entrez restrictions from above '''and add the correct name for MYSPE''' to the list if it is not there already. (Obviously, you can't find sequences in MYSPE if MYSPE is not included among the genomes you are searching in.) Paste the list into the '''Entrez Query''' field.
+
# Copy the Entrez restrictions from above '''and add the correct name for MYSPE'''. (Obviously, you can't find sequences in MYSPE if MYSPE is not included among the genomes you are searching in.) Paste the list into the '''Entrez Query''' field.
 
# In the '''Algorithm''' section, select PSI-BLAST.
 
# In the '''Algorithm''' section, select PSI-BLAST.
 
#Click on '''BLAST'''.
 
#Click on '''BLAST'''.
Line 334: Line 343:
  
  
Evaluate the results carefully. Since we did not change the algorithm parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient, i.e. it might include sequences that are not homologous. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they will drop out of the profile and not contaminate it.
+
Evaluate the results carefully. Since we did not change the algorithm parameters, the threshold for inclusion was set at an '''E-value''' of 10 by default, and for a PSI BLAST search this is too lenient, i.e. the chances of including non-homologous hits is far too high. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they will drop out of the profile and not contaminate it.
  
  
Line 341: Line 350:
 
# Click on the '''Reformat''' button (top right).
 
# Click on the '''Reformat''' button (top right).
 
# In the table of sequence descriptions (not alignments!), click on '''Query cover''' to sort the table by coverage, not by score.
 
# In the table of sequence descriptions (not alignments!), click on '''Query cover''' to sort the table by coverage, not by score.
# '''Deselect''' the check mark next to these sequences in the second-to-rightmost column '''Select for PSI blast'''.
+
# '''Deselect''' the check mark next to all seqeunces with less than 80% query coverage in the column on the right "Select for PSI BLAST"
 
# Then scroll to '''Run PSI-BLAST iteration 2 ...''' and click on '''<code>Go</code>'''.
 
# Then scroll to '''Run PSI-BLAST iteration 2 ...''' and click on '''<code>Go</code>'''.
 
}}
 
}}
Line 352: Line 361:
  
 
{{task|1=
 
{{task|1=
#Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration and are proposed for inclusion in the next iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%. These are the ones you need to check carefully: do you agree that they should be included? If there is any doubt, perhaps because of a really marginal E-value, poor coverage or a function annotation that is not compatible with your query, it is safer to exclude a sequence than to risk profile corruption. If the sequence is a true positive, it will return to the list in later iterations, usually with a better E-value as the profile improves. It's a good idea to note such sequences in your journal so you can keep track of how their E-values change.
+
#Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration and are proposed for inclusion in the next iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80% (These are the sequences that do not have a green check-mark next to them in the right-hand column). These are the ones you need to check carefully: do you agree that they should be included? If there is any doubt, perhaps because of a really marginal E-value, poor coverage or a function annotation that is not compatible with your query, it is safer to exclude a sequence, than to risk profile corruption. If the sequence is a true positive, it will return to the list in later iterations, usually with a better E-value as the profile improves. It's a good idea to note such sequences in your journal so you can keep track of how their E-values change.
 
# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
 
# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
 
# Iterate the search in this way, successively relaxing the coverage threshold, until no more "New" sequences are added to the profile. The search has converged. Obviously the result depends on your data, but it would be unusual if the search had not converged after 6 iterations or so, and there is probably a mistake if there are more than 70 hits or so.
 
# Iterate the search in this way, successively relaxing the coverage threshold, until no more "New" sequences are added to the profile. The search has converged. Obviously the result depends on your data, but it would be unusual if the search had not converged after 6 iterations or so, and there is probably a mistake if there are more than 70 hits or so.
 
# Now look at the list of excluded hits (if any), the hits that are reasonable but didn't quite make the cut. Are there any from MYSPE that seem like they should actually be included? Perhaps their E-value is only marginally above the threshold? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
 
# Now look at the list of excluded hits (if any), the hits that are reasonable but didn't quite make the cut. Are there any from MYSPE that seem like they should actually be included? Perhaps their E-value is only marginally above the threshold? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
 +
 +
* Iterate the PSI-BLAST search until no new sequences have been added (no yellow highlights). Then relax the coverage criteria - to 60% first, then less. But watch the E-values of the shorter sequences carefully. You can click on any of theDon't include sequences with E values greater than 1e-04 or so. The ''S. cerevisiae'' Xbp1 sequence seems to be a true positive, but it won't match the profile well because of a long insertion in the APSES domain - thus it has only 38% coverage. But its E-value is probably less then 1e-06. Keep that for last. There is no harm in proceeding through these iterations slowly and conservatively. But if you include only one non-homologous sequence by mistake, you will get profile corruption. For example, you might notice <code>hypothetical protein CC1G_13964 [Coprinopsis cinerea okayama7#130]</code> or <code>hypothetical protein CNAG_05349 [Cryptococcus neoformans var. grubii H99]</code> at E-values around 1e-04. These are probably false positives with an unexpectedly high number of indels and a location towards the C-terminus of the protein. Keep them out!
 +
 +
 
}}
 
}}
  
  
Once no "new" sequences have been added, we would always get the same result on additional iterations because there are no more changes to the profile. We say that the search has '''converged'''. Time to harvest.
+
Once no "new" sequences have been added, we would always get the same result on additional iterations because there are no more changes to the profile. We say that the search has '''converged'''. Last I ran this, I took about 10 iterations to arrive at 45 high-confidence hits for the ten reference species. You should find approximately the same number - plus a few more for the domains you find in MYSPE.
 +
 
 +
Time to harvest.
  
  
 
{{task|1=
 
{{task|1=
 
# In the header section of the BLAST report, click on '''Taxonomy reports''' and find MYSPE in the '''Organism Report''' section. These are '''your APSES domain homologs'''. All of them. There is a link to the alignment, the BLAST score, the E-value, and a link to the entry in RefSeq.
 
# In the header section of the BLAST report, click on '''Taxonomy reports''' and find MYSPE in the '''Organism Report''' section. These are '''your APSES domain homologs'''. All of them. There is a link to the alignment, the BLAST score, the E-value, and a link to the entry in RefSeq.
# From the report copy the sequence identifiers from MYSPE, with E-values above your defined threshold to your notebook.
+
# Copy the sequence identifiers from MYSPE, with E-values above your defined threshold to your journal.
 
}}
 
}}
  
For example, the list of ''Saccharomyces'' genes contains the following information:
+
For example, the table of ''Saccharomyces'' hits contains the following information:
  
<code>
+
------------------------------------------------------------------------------------------------
[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 Saccharomyces cerevisiae S288c]</b> [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4890 [ascomycetes]] taxid 559292<br \>
+
Saccharomyces cerevisiae S288c [ascomycetes]
4e-37  [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept ref|NP_010227.1|] Mbp1p [Saccharomyces cerevisiae S288c]<br \>
+
------------------------------------------------------------------------------------------------
2e-30  [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320957&dopt=GenPept ref|NP_011036.1|] Swi4p [Saccharomyces cerevisiae S288c]<br \>
+
transcription factor MBP1 [Saccharomyces cerevisiae S288C]            138    8e-39    NP_010227
4e-27  [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322808&dopt=GenPept ref|NP_012881.1|] Phd1p [Saccharomyces cerevisiae S288c]<br \>
+
SBF complex DNA-binding subunit SWI4 [Saccharomyces cerevisiae S288C] 118    7e-32    NP_011036
4e-27  [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6323658&dopt=GenPept ref|NP_013729.1|] Sok2p [Saccharomyces cerevisiae S288c]<br \>
+
Phd1p [Saccharomyces cerevisiae S288C]                                 110    8e-30    NP_012881
7e-06   [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322090&dopt=GenPept ref|NP_012165.1|] Xbp1p [Saccharomyces cerevisiae S288c]<br \>
+
Sok2p [Saccharomyces cerevisiae S288C]                                 110    4e-29    NP_013729
</code>
+
Xbp1p [Saccharomyces cerevisiae S288C]                               46.0    2e-06   NP_012165
 +
------------------------------------------------------------------------------------------------
  
 
+
<small>[[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others, noting that it has a homologue in ''N. crassa'' with a similar situation.</small>
<small>[[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.</small>
 
  
  
 
{{task|1=
 
{{task|1=
  
# To add the sequences to your database, open each of the links to the RefSeq record for MYSPE organism into a separate tab.
+
;Add your sequences to your protein database
# Find the UniProt IDs
 
# Go through the (short) section <code>add PSI BLAST results</code> in the Assignment 04 R-script.
 
  
}}
+
* To add the sequences to your database, open each of the links to the RefSeq record for MYSPE organism into a separate tab.
 +
* Find the UniProt IDs.
 +
* Make a copy of <code>"./data/refAPSES_PSI-BLAST.json"</code> and call it <code>"MYSPE_APSES_PSI-BLAST.json"</code>
 +
* Open <code>"./data/MYSPE_APSES_PSI-BLAST.json"</code> in RStudio to see how it is formatted.
 +
* Delete the existing data for however many proteins your PSI-BLAST search has discovered in the MYSPE genome, and replace it with your protein's data. Save the file when you are done.
 +
* Validate your files online at https://jsonlint.com/
 +
* Add the following line to your database creation script <code>makeProteinDB.R</code>:
 +
<source lang="R">
 +
  myDB <- dbAddProtein(    myDB, fromJSON("MYSPE_APSES_PSI-BLAST.json"))
 +
</source>
 +
* Save  <code>makeProteinDB.R</code>.
 +
* Recreate the database by running:
 +
<source lang="R">
 +
  source("makeProteinDB.R")
 +
</source>
 +
* Confirm that your proteins have been correctly added:
 +
<source lang="R">
 +
sel <- which(myDB$taxonomy$species == MYSPE)
 +
(taxid <- myDB$taxonomy$ID[sel])
  
<!--
+
sel <- which(myDB$protein$taxonomyID == taxid)
 +
myDB$protein$name[sel]      # Does this return all of the APSES domain protein
 +
                            # names of MYSPE? If not, this needs to be fixed.
 +
                            # Ask for advice.
  
Add to this assignment:
+
nchar(myDB$protein$sequence[sel])      # Does this give the correct sequence
- study the BLAST output format, links, tools, scores ...
+
                                      # lengths? If not, this needs to be
- compare the improvement in E-values to standard BLAST
+
                                      # fixed. Ask for advice.
- examine this in terms of sensitivity and specificity
+
</source>
  
-->
+
}}
  
 
So much for using PSI-BLAST. The last step seems a bit tedious, adding all this information by hand. There's got to be a better way, right?
 
 
But for now, we'll have a look at what the sequences tell us.
 
  
  
Line 419: Line 449:
  
 
The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD [https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml '''see here'''].
 
The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD [https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml '''see here'''].
 
<!--
 
=== CDD domain annotation ===
 
 
In the last assignment, you followed a link to '''CDD Search Results''' from the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1] and briefly looked at the information offered by the NCBI's Conserved Domain Database, a database of ''Position Specific Scoring Matrices'' that embody domain definitions. Rather than access precomputed results, you can also search CDD with sequences: assuming you have saved the MYSPE Mbp1 sequence in FASTA format, this is straightforward. If you did not save this sequence, return to [[BIO_Assignment_Week_3|Assignment 3]] and retrieve it again.
 
 
 
{{task|1=
 
# Access the [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml '''CDD database'''] at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
 
# Read the information. CDD is a superset of various other database domain annotations as well as NCBI-curated domain definitions.
 
# Copy the MYSPE Mbp1 FASTA sequence, paste it into the search form and click '''Submit'''.
 
## On the result page, clik on '''View full result'''
 
## Note that there are a number of partially overlapping ankyrin domain modules. We will study ankyrin domains in a later assignment.
 
## Also note that there may be blocks of sequence colored cyan in the sequence bar. Hover your mouse over the blocks to see what these blocks signify.
 
## Open the link to '''Search for similar domain architecture''' in a separate window and study it. This is the '''CDART''' database. Think about what these results may be useful for.
 
## Click on one of the ANK superfamily graphics and see what the associated information looks like: there is a summary of structure and function, links to specific literature and a tree of the relationship of related sequences.
 
}}
 
-->
 
 
 
 
      <!-- Column 1 end -->
 
    </div>
 
    <div class="col2">
 
      <!-- Column 2 start -->
 
 
; Hidden Markov Models (HMMs)
 
 
An approach to represent such profile information that is more general than PSSMs is a {{WP|Hidden Markov model|'''Hidden Markov model (HMM)'''}} and the standard tool to use HMMs in Bioinformatics is [http://hmmer.org/ '''HMMER'''], written by Sean Eddy. HMMER has allowed to represent the entirety of protein sequences as a collection of profiles, stored in databases such as [http://pfam.xfam.org/ '''Pfam'''], [https://www.ebi.ac.uk/interpro/ '''Interpro'''], and [http://smart.embl-heidelberg.de/ '''SMART'''].  While the details are slightly different, all of these services allow to scan sequences for the presence of domains. Importantly thus, the alignment results are not collections of full-length protein families, but annotate to domain families, i.e. full length proteins are decomposed into their homologous domains. This is a very powerful approach towards the functional annotation of unknown sequences.
 
 
In this section, we will annotate the MYSPE sequence with the domains it contains, using the database of domain HMMs curated by SMART in Heidelberg and Pfam at the EMBL. We will then compare these annotations with those determined for the orthologues in the reference species. In this way we can enhance the information about one protein by determining how its features are conserved.
 
 
      <!-- Column 2 end -->
 
    </div>
 
  </div>
 
</div>
 
 
{{Vspace}}
 
  
  
Line 485: Line 477:
  
 
*[[Reference_species_for_fungi|Our 10 "reference" fungi]]
 
*[[Reference_species_for_fungi|Our 10 "reference" fungi]]
 +
 +
* Limiting the minimal profile block width enhances PSI-BLAST performance significantly:
 +
{{#pmid:28578660}}
  
  
Line 548: Line 543:
 
:2017-08-05
 
:2017-08-05
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.0
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.0 First live version, updated to new BLAST interface.
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>

Revision as of 00:54, 24 October 2017

Abstract

Sensitive, fast, database-scale sequence searches are possible with the PSI-BLAST algorithm. This unit introduces the concept and guides you through searching for APSES domain proteins in MYSPE.


 


This unit ...

Prerequisites

You need to complete the following units before beginning this one:


 


Objectives

This unit will ...

  • ... introduce the PSI-BLAST algorithm;
  • ... teach how to run PSI-BLAST searches.


 


Outcomes

After working through this unit you ...

  • ... can run a PSI-BLAST search via the NCBI's online interface, taking care to avoid profile corruption;
  • ... have discovered APSES domain proteins in MYSPE and added them to your protein database.


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Your protein database
Add the MYSPE proteins to your database in a JSON file.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents


Heuristic profile-based alignment: PSI BLAST

 

It is (deceptively) easy to perform BLAST searches via the Web interface, but to use such powerful computational tools to their greatest advantage takes a considerable amount of care, caution and consideration.

PSI-BLAST allows to perform very sensitive searches for homologues that have diverged so far that their pairwise sequence similarity has become insignificant. It achieves this by establishing a profile of sequences to align with the database, rather than searching with individual sequences. This deemphasizes parts of the sequence that are variable and inconsequential, and focusses on the parts of greater structural and functional importance. As a consequence, the signal to noise ratio is greatly enhanced.

In this unit, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in MYSPE. We will use these sequences for multiple alignments, calculation of conservation etc.

The first methodical problem we have to address is what sequence to search with. The full-length Mbp1 sequence from Saccharomyces cerevisiae or its RBM from MYSPE are not suitable: They contain multiple domains (in particular the ubiquitous Ankyrin domains) and would create broad, non-specific profiles. The APSES domain sequence by contrast is structurally well defined. The KilA-N domain, being shorter, is less likely to make a sensitive profile. Indeed one of the results of our analysis will be to find whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, like the KILA-N domain, as suggested by the Pfam alignment.

The second methodical problem we must address is how to perform a sensitive PSI-BLAST search in one organism. We need to balance two conflicting objectives:

  • If we restrict the PSI-BLAST search to MYSPE, PSI-BLAST has little chance of building a meaningful profile - the number of homologous sequences that exist in this single species is too small. Thus the search will not become very sensitive.
  • If we don't restrict our search, but search in all species, the number of hits may become unwieldily large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage. Also we need to evaluate the fringe cases of marginal E-value: should a new sequence be added to the profile, or should we hold off on it for one or two iterations, to see whether its E-value drops significantly. Whatever we do, we must avoid profile corruption.

Perhaps this is still be manageable when we are searching in fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search may find tens of thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile. We need to define a broadly representative but manageable set of species - to exploit the transitivity of homology - even if we are interested only in matches in one species: MYSPE. Please reflect on this and make sure you understand why we include sequences in a PSI-BLAST search that we are not actually interested in.

We need a subset of species

  1. that represent as large a range as possible on the evolutionary tree;
  2. that are as well distributed as possible on the tree; and
  3. whose genomes are fully sequenced.


 



 

Selecting species for a PSI-BLAST search

 


To select species, we will use an approach that is conceptually simple: select a set of species according to their shared taxonomic rank in the tree of life. Taxonomic classification is a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in constant flux.

If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:


cellular organisms; Eukaryota; Opisthokonta;
Fungi; Dikarya; Ascomycota; Saccharomyceta;
Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae;
Saccharomyces; Saccharomyces cerevisiae


These names can be mapped into taxonomic ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

Rank Suffix Example
Domain   Eukaryota (Eukarya)
  Subdomain   Opisthokonta
Kingdom   Fungi
  Subkingdom   Dikarya
Phylum   Ascomycota
  rankless taxon[1] -myceta Saccharomyceta
  Subphylum -mycotina Saccharomycotina
Class -mycetes Saccharomycetes
  Subclass -mycetidae  
Order -ales Saccharomycetales
Family -aceae Saccharomycetaceae
  Subfamily -oideae  
  Tribe -eae  
  Subtribe -ineae  
Genus   Saccharomyces
Species   Saccharomyces cerevisiae
  Strain   Saccharomyces cerevisiae S288c


You can see that there is no common mapping between the yeast lineage listed at the NCBI and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record.

But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)

I have chosen the 10 species below to define a well-distributed search-space for PSI-BLAST. Of course you must add MYSPE to the selection for your own search purposes.

To enter these 10 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list)

"Aspergillus nidulans"[organism] OR
"Bipolaris oryzae"[organism] OR
"Coprinopsis cinerea"[organism] OR
"Cryptococcus neoformans"[organism] OR
"Neurospora crassa"[organism] OR
"Puccinia Graminis"[organism] OR
"Saccharomyces cerevisiae"[organism] OR
"Schizosaccharomyces pombe"[organism] OR
"Ustilago maydis"[organism] OR
"Wallemia mellicola"[organism]



 

Executing the PSI-BLAST search

 

We have a list of species. Good. Next up: how do we use it.

Task:

  1. Navigate to the BLAST homepage.
  2. Select protein BLAST.
  3. Paste the MBP1_SACCE APSES domain sequence into the search field:
>APSES_MBP1 Residues 4-102 of S. cerevisiae Mbp1
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
  1. Select Reference proteins (refseq_protein) as the database.
  2. Copy the Entrez restrictions from above and add the correct name for MYSPE. (Obviously, you can't find sequences in MYSPE if MYSPE is not included among the genomes you are searching in.) Paste the list into the Entrez Query field.
  3. In the Algorithm section, select PSI-BLAST.
  4. Click on BLAST.


Evaluate the results carefully. Since we did not change the algorithm parameters, the threshold for inclusion was set at an E-value of 10 by default, and for a PSI BLAST search this is too lenient, i.e. the chances of including non-homologous hits is far too high. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they will drop out of the profile and not contaminate it.


Task:

  1. In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10-03 for the sequence to be included.)
  2. Click on the Reformat button (top right).
  3. In the table of sequence descriptions (not alignments!), click on Query cover to sort the table by coverage, not by score.
  4. Deselect the check mark next to all seqeunces with less than 80% query coverage in the column on the right "Select for PSI BLAST"
  5. Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.


This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

  1. we are explicitly removing sequences with poor coverage; and
  2. we are requiring a more stringent minimum E-value for each sequence.


Task:

  1. Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration and are proposed for inclusion in the next iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80% (These are the sequences that do not have a green check-mark next to them in the right-hand column). These are the ones you need to check carefully: do you agree that they should be included? If there is any doubt, perhaps because of a really marginal E-value, poor coverage or a function annotation that is not compatible with your query, it is safer to exclude a sequence, than to risk profile corruption. If the sequence is a true positive, it will return to the list in later iterations, usually with a better E-value as the profile improves. It's a good idea to note such sequences in your journal so you can keep track of how their E-values change.
  2. Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
  3. Iterate the search in this way, successively relaxing the coverage threshold, until no more "New" sequences are added to the profile. The search has converged. Obviously the result depends on your data, but it would be unusual if the search had not converged after 6 iterations or so, and there is probably a mistake if there are more than 70 hits or so.
  4. Now look at the list of excluded hits (if any), the hits that are reasonable but didn't quite make the cut. Are there any from MYSPE that seem like they should actually be included? Perhaps their E-value is only marginally above the threshold? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
  • Iterate the PSI-BLAST search until no new sequences have been added (no yellow highlights). Then relax the coverage criteria - to 60% first, then less. But watch the E-values of the shorter sequences carefully. You can click on any of theDon't include sequences with E values greater than 1e-04 or so. The S. cerevisiae Xbp1 sequence seems to be a true positive, but it won't match the profile well because of a long insertion in the APSES domain - thus it has only 38% coverage. But its E-value is probably less then 1e-06. Keep that for last. There is no harm in proceeding through these iterations slowly and conservatively. But if you include only one non-homologous sequence by mistake, you will get profile corruption. For example, you might notice hypothetical protein CC1G_13964 [Coprinopsis cinerea okayama7#130] or hypothetical protein CNAG_05349 [Cryptococcus neoformans var. grubii H99] at E-values around 1e-04. These are probably false positives with an unexpectedly high number of indels and a location towards the C-terminus of the protein. Keep them out!


Once no "new" sequences have been added, we would always get the same result on additional iterations because there are no more changes to the profile. We say that the search has converged. Last I ran this, I took about 10 iterations to arrive at 45 high-confidence hits for the ten reference species. You should find approximately the same number - plus a few more for the domains you find in MYSPE.

Time to harvest.


Task:

  1. In the header section of the BLAST report, click on Taxonomy reports and find MYSPE in the Organism Report section. These are your APSES domain homologs. All of them. There is a link to the alignment, the BLAST score, the E-value, and a link to the entry in RefSeq.
  2. Copy the sequence identifiers from MYSPE, with E-values above your defined threshold to your journal.

For example, the table of Saccharomyces hits contains the following information:

------------------------------------------------------------------------------------------------
Saccharomyces cerevisiae S288c [ascomycetes]
------------------------------------------------------------------------------------------------
transcription factor MBP1 [Saccharomyces cerevisiae S288C]             138    8e-39    NP_010227
SBF complex DNA-binding subunit SWI4 [Saccharomyces cerevisiae S288C]  118    7e-32    NP_011036
Phd1p [Saccharomyces cerevisiae S288C]                                 110    8e-30    NP_012881
Sok2p [Saccharomyces cerevisiae S288C]                                 110    4e-29    NP_013729
Xbp1p [Saccharomyces cerevisiae S288C]                                46.0    2e-06    NP_012165
------------------------------------------------------------------------------------------------

Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others, noting that it has a homologue in N. crassa with a similar situation.


Task:

Add your sequences to your protein database
  • To add the sequences to your database, open each of the links to the RefSeq record for MYSPE organism into a separate tab.
  • Find the UniProt IDs.
  • Make a copy of "./data/refAPSES_PSI-BLAST.json" and call it "MYSPE_APSES_PSI-BLAST.json"
  • Open "./data/MYSPE_APSES_PSI-BLAST.json" in RStudio to see how it is formatted.
  • Delete the existing data for however many proteins your PSI-BLAST search has discovered in the MYSPE genome, and replace it with your protein's data. Save the file when you are done.
  • Validate your files online at https://jsonlint.com/
  • Add the following line to your database creation script makeProteinDB.R:
  myDB <- dbAddProtein(    myDB, fromJSON("MYSPE_APSES_PSI-BLAST.json"))
  • Save makeProteinDB.R.
  • Recreate the database by running:
  source("makeProteinDB.R")
  • Confirm that your proteins have been correctly added:
sel <- which(myDB$taxonomy$species == MYSPE)
(taxid <- myDB$taxonomy$ID[sel])

sel <- which(myDB$protein$taxonomyID == taxid)
myDB$protein$name[sel]      # Does this return all of the APSES domain protein
                            # names of MYSPE? If not, this needs to be fixed.
                            # Ask for advice.

nchar(myDB$protein$sequence[sel])      # Does this give the correct sequence
                                       # lengths? If not, this needs to be
                                       # fixed. Ask for advice.



 

Model Based Alignments: PSSMs and HMMs

 
Position Specific Scoring Matrices (PSSMs)

The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD see here.


Add PSI BLAST results to your Database

Task:
Once you have discovered APSES domain proteins in MYSPE, you should enter them into your database. Copy and edit your code template that adds entries to the protein table, once for each of your MYSPE hits.

Execute the code and confirm that the proteins have been added to your database with all of their required information. You can check the code in addSACCE_APSESproteins.R and compare this with your code if you get errors. But careful in case you want to reuse that code: dont use "ns = ref" in your dbAutoincrement calls!

Finally save the new version of myDB as version 03: "myDB.03.RData"



 


Further reading, links and resources

  • Limiting the minimal profile block width enhances PSI-BLAST performance significantly:
Oda et al. (2017) Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance. BMC Bioinformatics 18:288. (pmid: 28578660)

PubMed ] [ DOI ] BACKGROUND: PSI-BLAST, an extremely popular tool for sequence similarity search, features the utilization of Position-Specific Scoring Matrix (PSSM) constructed from a multiple sequence alignment (MSA). PSSM allows the detection of more distant homologs than a general amino acid substitution matrix does. An accurate estimation of the weights for sequences in an MSA is crucially important for PSSM construction. PSI-BLAST divides a given MSA into multiple blocks, for which sequence weights are calculated. When the block width becomes very narrow, the sequence weight calculation can be odd. RESULTS: We demonstrate that PSI-BLAST indeed generates a significant fraction of blocks having width less than 5, thereby degrading the PSI-BLAST performance. We revised the code of PSI-BLAST to prevent the blocks from being narrower than a given minimum block width (MBW). We designate the modified application of PSI-BLAST as PSI-BLASTexB. When MBW is 25, PSI-BLASTexB notably outperforms PSI-BLAST consistently for three independent benchmark sets. The performance boost is even more drastic when an MSA, instead of a sequence, is used as a query. CONCLUSIONS: Our results demonstrate that the generation of narrow-width blocks during the sequence weight calculation is a critically important factor that restricts the PSI-BLAST search performance. By preventing narrow blocks, PSI-BLASTexB upgrades the PSI-BLAST performance remarkably. Binaries and source codes of PSI-BLASTexB (MBW = 25) are available at https://github.com/kyungtaekLIM/PSI-BLASTexB .


 


Notes

  1. The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

1.0

Version history:

  • 1.0 First live version, updated to new BLAST interface.
  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.