BIO Assignment Week 6
Assignment for Week 6
Sensitive database searches with PSI-BLAST
Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
Introduction
- Take care of things, and they will take care of you.
- Shunryu Suzuki
Anyone can click buttons on a Web page, but to use the powerful sequence database search tools right often takes considerable more care, caution and consideration.
Much of what we know about a protein's physiological function is based on the conservation of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: variation - is a consequence of selection under constraints: protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.
Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for
- functional annotation;
- protein homology modeling;
- phylogenetic analyses, and
- sensitive homology searches in databases.
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:
- orthologs are expected to be functionally and structurally conserved;
- paralogs may have divergent function but have similar structure;
- missing genes may make paralogs look like orthologs; and
- selection bias may weight our results toward sequences that are over-represented, and not provide a fair representation of evolutionary divergence.
In this assignment, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in YFO. We will use these sequences later for multiple alignments, calculation of conservation etc. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search in one organism. This is the issue:
- If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are in YFO is too small. Thus the search will not become very sensitive.
- If we search in all species, the number of hits may become too large. This is maybe not such a problem if we can restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added. How will you evaluate the fringe cases, where you need to decide whether to add a new sequence with marginal E-value to the profile, or whether to hold off for one or two iterations, and to see whether the E-value drops significantly - to avoid profile corruption?
Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can't anymore assess the sequences that contribute to the profile.
To put this into practice, the sequence search needs to address two issues before we begin:
- We need to define the sequence we are searching with; and
- We need to define the dataset we are searching in.
Defining the sequence to search with
Consider again the task we set out from: find all orthologs and paralogs of the APSES domain containing transcription factors in YFO.
Task:
What query sequence should you use? Should you ...
- Search with the full-length Mbp1 sequence from Saccharomyces cerevisiae?
- Search with the full-length Mbp1 homolog that you found in YFO?
- Search with the S. cerevisiae APSES domain sequence?
- Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
- Search with the KilA-N domain sequence?
- The full-length Mbp1 sequence from Saccharomyces cerevisiae
- Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
- The full-length Mbp1 homolog that you found in YFO
- What organism the search sequence comes from does not make a difference. Since you aim to find all homologs in YFO, it is not necessary to have your search sequence more or less similar to any particular homologs. In fact any APSES sequence should give you the same result, since they are all homologous. But the full-length sequence in YFO has the same problem as the Saccharomyces sequence.
- The S. cerevisiae APSES domain sequence?
- That would be my first choice, just because it is nicely defined as the sequence of the
1BM8
PDB structure. (1MB1
would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
- The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
- As argued above: since they are all homologs, any of them should lead to the same set of results.
- The KilA-N domain sequence?
- This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search more sensitive. The fact that it is shorter might make the search less specific. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as sugested by the Pfam alignment.
So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the PDB (i.e. here, or from Assignment 3.
Selecting species for a PSI-BLAST search
As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on BLAST to find all of them, we need to use the much more sensitive search of PSI-BLAST instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.
But you can also search with too many species: if the number of species is large and PSI-BLAST finds a large number of results:
- it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
- since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.
We should therefore try to find a subset of species
- that represent as large a range as possible on the evolutionary tree;
- that are as well distributed as possible on the tree; and
- whose genomes are fully sequenced.
These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.
How can we define a list of such species, and how can we use the list?
The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of reresentative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. But we can come close enough to this with the following steps:
- Use a list of genome sequenced fungi (from NCBI);
- BLAST the yeast Mbp1 APSES domain against that list;
- Evaluate the taxonomy report that BLAST generates
- Select species of approximately similar taxonomic rank.
Again: reflect on this process and make sure you understand the principle. You should be able to ask yourself: how would I do this for a protein I work with after the course... ? (And know the answer.)
Task:
- Navigate to the BLAST home page.
- Find the link to list all genomic BLAST databases and follow it. This list will take you to a selection of genome-sequenced fungi.
- Find the section of Fungi and click on the small triangle if it is not yet "open".
- Don't be deceived: there are more species in the database than these. You could follow the links if you wanted to search in one particular genome. We will search in a set of genomes instead. Click on the small, round B icon, next to the group label Fungi. You should arrive at this page.
- From the drop-down menus select:
- Query: Protein
- Database: Protein
- BLAST-Program: blastp
- Check the boxes next to all species that have a pale yellow background. As you can read in the header of the page, these are completed genomic sequences. As of today, there are 34 such genomes.
- Paste your search sequence - i.e. the sequence of the yeast Mbp1 APSES domain into the field.
- Click on BLAST, then on View results on the next page.
- In the header section of the BLAST report, find the line Other reports and open the Taxonomy report in a separate tab or window.
- For completeness, scroll through the list of Descriptions - "Sequences producing significant alignments" and look at he accession numbers. Most of these are RefSeq IDs (either
NP_...
orXP_...
). Make sure that for all of the species that do not have RefSeq identifiers there are variants or strains that do. The reason is: we would not want to inadvertently exclude species in favour of closely related other species for which the genome has not yet been imported into RefSeq. Since we will be doing our full-scale search on RefSeq, we want to ensure all our species are actually represented there. Please reflect on this for a moment and make sure you understand this point. - Now examine the taxonomy report. The page has three sections: a Lineage Report, an Organism Report and the Taxonomy report.
To make use of the Taxonomy report, you should know that biological classification provides a hierarchical system that defines relationships for all living entities. The levels of the hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux. However the coarse outlines are basically stable and will serve for our purpose of identifying a number of well-distributed species from a set.
If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya;
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae
These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)
Rank | Suffix | Example |
Domain | Eukaryota (Eukarya) | |
Subdomain | Opisthokonta | |
Kingdom | Fungi | |
Subkingdom | Dikarya | |
Phylum | Ascomycota | |
rankless taxon[1] | -myceta | Saccharomyceta |
Subphylum | -mycotina | Saccharomycotina |
Class | -mycetes | Saccharomycetes |
Subclass | -mycetidae | |
Order | -ales | Saccharomycetales |
Family | -aceae | Saccharomycetaceae |
Subfamily | -oideae | |
Tribe | -eae | |
Subtribe | -ineae | |
Genus | Saccharomyces | |
Species | Saccharomyces cerevisiae |
Dikarya
. saccharomyceta
. . Saccharomycetales
. . . Saccharomycetaceae
. . . . Saccharomyces
. . . . . Saccharomyces cerevisiae
. . . . . . Saccharomyces cerevisiae S288c
. . . . . . Saccharomyces cerevisiae CEN.PK113-7D
. . . . . . Saccharomyces cerevisiae YJM789
. . . . . . Saccharomyces cerevisiae RM11-1a
. . . . . . Saccharomyces cerevisiae AWRI1631
. . . . . . Saccharomyces cerevisiae JAY291
. . . . . . Saccharomyces cerevisiae Lalvin QA23
. . . . . . Saccharomyces cerevisiae FostersB
. . . . . . Saccharomyces cerevisiae AWRI796
. . . . . . Saccharomyces cerevisiae VL3
. . . . . . Saccharomyces cerevisiae Vin13
. . . . . . Saccharomyces cerevisiae EC1118
. . . . . . Saccharomyces cerevisiae FostersO
. . . . . Saccharomyces cerevisiae x Saccharomyces kudriavzevii VIN7
. . . . mitosporic Nakaseomyces
. . . . . Candida glabrata
. . . . . . Candida glabrata CBS 138
. . . . Tetrapisispora phaffii CBS 4417
. . . . Kluyveromyces
. . . . . Kluyveromyces lactis
. . . . . . Kluyveromyces lactis NRRL Y-1140
. . . . Naumovozyma
. . . . . Naumovozyma dairenensis CBS 421
. . . . . Naumovozyma castellii CBS 4309
. . . . Zygosaccharomyces
. . . . . Zygosaccharomyces rouxii
. . . . . . Zygosaccharomyces rouxii CBS 732
. . . . Vanderwaltozyma polyspora DSM 70294
. . . . Eremothecium
. . . . . Eremothecium cymbalariae DBVPG#7215
. . . . . Eremothecium gossypii
. . . . . . Ashbya gossypii ATCC 10895
. . . . . . Ashbya gossypii FDAG1
. . . . Torulaspora delbrueckii
. . . . Lachancea thermotolerans CBS 6340
. . . . Komagataella pastoris
. . . . . Komagataella pastoris GS115
. . . . . Komagataella pastoris CBS 7435
. . . Candida
. . . . Candida dubliniensis CD36
. . . . Candida albicans
. . . . . Candida albicans WO-1
. . . . . Candida albicans SC5314
. . . Debaryomycetaceae
. . . . Scheffersomyces stipitis CBS 6054
. . . . Debaryomyces hansenii CBS767
. . . Yarrowia lipolytica CLIB122
. . leotiomyceta
. . . mitosporic Trichocomaceae
. . . . Aspergillus
. . . . . Aspergillus niger
. . . . . . Aspergillus niger CBS 513.88
. . . . . . Aspergillus niger ATCC 1015
. . . . . Aspergillus fumigatus
. . . . . . Aspergillus fumigatus Af293
. . . . . . Aspergillus fumigatus A1163
. . . . Penicillium chrysogenum Wisconsin 54-1255
. . . Sordariomycetidae
. . . . Magnaporthe
. . . . . Magnaporthe oryzae 70-15
. . . . . Magnaporthe grisea
. . . . Chaetomiaceae .
. . . . . Myceliophthora thermophila ATCC 42464 .
. . . . . Thielavia terrestris NRRL 8126
. . . Dothideomycetes .
. . . . Zymoseptoria tritici IPO323 .
. . . . Phaeosphaeria nodorum SN15
. . Schizosaccharomyces
. . . Schizosaccharomyces pombe
. . . . Schizosaccharomyces pombe 972h-
. Basidiomycota .
. . Ustilago maydis 521 .
. . Filobasidiella/Cryptococcus neoformans species complex
. . . Cryptococcus neoformans var. neoformans .
. . . . Cryptococcus neoformans var. neoformans JEC21 .
. . . . Cryptococcus neoformans var. neoformans B-3501A .
. . . Cryptococcus gattii WM276 .
You need to note that this report gives the highest taxonomic rank that is common to a group below it, i.e. two species differ in the rank immediately below the last one named. For example the two species identified as common at the class level ...
. . . Dothideomycetes .
. . . . Zymoseptoria tritici IPO323 .
. . . . Phaeosphaeria nodorum SN15
... differ at the subclass rank as we can see if we follow their links to the taxonomy browser.
. . . Dothideomycetes .
. . . . Dothideomycetidae
. . . . . Zymoseptoria tritici IPO323 .
. . . . Pleosporomycetidae
. . . . . Phaeosphaeria nodorum SN15
Our goal is to remove species that are "too similar". What that means precisely is really up to us, but for this purpose let's keep only one representative of each subfamily, i.e. we will remove (by hand) all but one representative of a set that shares a -oideaea (subfamily), -eaea (tribe) or -ineae (subtribe) designation or below. The result is the following set of 23 species (I have formatted them so they can be pasted into the Entrez filter field, but one could also enter species one by one, by pressing the (+) button after the organism list):
Saccharomyces cerevisiae [ORGN]
OR Candida glabrata [ORGN]
OR Tetrapisispora phaffii [ORGN]
OR Kluyveromyces lactis [ORGN]
OR Naumovozyma dairenensis [ORGN]
OR Zygosaccharomyces rouxii [ORGN]
OR Vanderwaltozyma polyspora [ORGN]
OR Ashbya gossypii [ORGN]
OR Torulaspora delbrueckii [ORGN]
OR Lachancea thermotolerans [ORGN]
OR Komagataella pastoris [ORGN]
OR Candida albicans [ORGN]
OR Debaryomyces hansenii [ORGN]
OR Yarrowia lipolytica [ORGN]
OR Aspergillus niger [ORGN]
OR Penicillium chrysogenum [ORGN]
OR Magnaporthe oryzae [ORGN]
OR Myceliophthora thermophila [ORGN]
OR Zymoseptoria tritici [ORGN]
OR Phaeosphaeria nodorum [ORGN]
OR Schizosaccharomyces pombe [ORGN]
OR Ustilago maydis [ORGN]
OR Cryptococcus neoformans [ORGN]
(Consider that this list is quite a bit shorter than the much larger number of species you originally found in the BLAST search report.)
Executing the PSI-BLAST search
We have a list of species. Goof. Next up: how do we use it.
Task:
- Navigate to the BLAST homepage.
- Select protein BLAST.
- Paste the APSES domain sequence into the search field.
- Select refseq as the database.
- Copy the organism restriction list from above and enter the correct name for YFO into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the Organism field.
- In the Algorithm section, select PSI-BLAST.
- Click on BLAST.
Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an E-value of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits– in the Sequences producing significant alignments... section– there are also quite a few sequences that have a low query coverage. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with lower E-values in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
Task:
- In the header section, click on Formatting options and in the line "Format for..." set the inclusion threshold to
0.0001
(i.e. one more zero. This meansE-values can't be above 1e-04 for the sequence to be included.) - Click on the Reformat button (top right).
- In the table of sequence descriptions (not alignments!), click on the Query coverage to sort the table by coverage, not by score.
- Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
- Deselect the check mark next to these sequences. (For me these are three sequences, but with YFO included that may be a bit different.)
- Then next to Run PSI-BLAST iteration ..., click on
Go
.
This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
- we are explicitly removing sequences with poor coverage; and
- we are requiring a minimum E-value for each sequence.
Task:
- Again, study the table of hits. Sequences marked with green dots were previously included. Sequences labelled with new have gone below the E-value threshold only in the second iteration. There are quite a few! Sequences without a label were previously excluded.
- Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage (Here is where all your video game hours practicing rapid, targeted mouse clicks finally pay off!)) Then run the third iteration.
- This time there are only a small number of new sequences, but the number of low-coverage sequences has also decreased somewhat.
- Again, deselect all sequences with less than 80% coverage. Note that the longer of these now have very low e-values. They look like true positives. But excluding them does not need to worry us, the will come back. We are more worried about false positives.
- Iterate the search in this way until no more "New" sequences are added to the profile.
Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has converged. Good. Time to harvest.
Task:
- At the header, click on Taxonomy reports and find YFO in the Organism Report section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
- From the report copy the sequence identifiers
- from YFO,
- with E-values above your defined threshold.
For example, the list of Saccharomyces genes is the following:
Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292
ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c] [ 131] 1e-38
ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c] [ 123] 1e-35
ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c] [ 91] 1e-25
ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c] [ 93] 3e-25
ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c] [ 40] 5e-07
ref|NP_010359.1| Tps2p [Saccharomyces cerevisiae S288c] [ 26] 0.011
But I believe that Tps2 is a false positive, with low coverage, high E-value, unrelated function[2] and a different structure. I ignore this one. Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:
Task:
- Back at the header of BLAST results page, again open the Formatting options.
- Find the Limit results section and enter YFO's name into the field. For example
Saccharomyces cerevisiae [ORGN]
- Click on Reformat
- Scroll to the Alignments section, check the box next to each sequence you want to keep. At the bottom, click on Get selected sequences.
- http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=docsum - The default report
- http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup
But even more flexible is the eUtils interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
Note that this utility does not show anything, but downloads the (multi) fasta file to your default download directory.
- That is all.
Links and resources
Footnotes and references
- ↑ The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.
- ↑ It is a trehalose-6-phosphate synthase/phosphatase.
Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.