BIO Assignment Week 6

Assignment for Week 6
Sensitive database searches with PSI-BLAST

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

Introduction

Take care of things, and they will take care of you.: Shunryu Suzuki

Anyone can click buttons on a Web page, but to use the powerful sequence database search tools right often takes considerable more care, caution and consideration.

Much of what we know about a protein's physiological function is based on the conservation of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: variation - is a consequence of selection under constraints: protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.

Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for

functional annotation;
protein homology modeling;
phylogenetic analyses, and
sensitive homology searches in databases.

In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:

orthologs are expected to be functionally and structurally conserved;
paralogs may have divergent function but have similar structure;
missing genes may make paralogs look like orthologs; and
selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.

In this assignment, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in YFO. We will use these sequences later for multiple alignments, calculation of conservation etc. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search in one organism. There is an issue to consider:

If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are in YFO is too small. Thus the search will not become very sensitive.
If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.

Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:

We need to define the sequence we are searching with; and
We need to define the dataset we are searching in.

Defining the sequence to search with

Consider again the task we set out from: find all orthologs and paralogs of the APSES domain containing transcription factors in YFO.

Task:
What query sequence should you use? Should you ...

Search with the full-length Mbp1 sequence from Saccharomyces cerevisiae?
Search with the full-length Mbp1 homolog that you found in YFO?
Search with the structurally defined S. cerevisiae APSES domain sequence?
Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
Search with the KilA-N domain sequence?

Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.

The full-length Mbp1 sequence from Saccharomyces cerevisiae: Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.

The full-length Mbp1 homolog that you found in YFO: What organism the search sequence comes from does not make a difference. Since you aim to find all homologs in YFO, it is not necessary to have your search sequence more or less similar to any particular homologs. In fact any APSES sequence should give you the same result, since they are all homologous. But the full-length sequence in YFO has the same problem as the Saccharomyces sequence.

The structurally defined S. cerevisiae APSES domain sequence?: That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the 1BM8 PDB entry. (1MB1 would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)

The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?: As argued above: since they are all homologs, any of them should lead to the same set of results.

The KilA-N domain sequence?: This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search more sensitive. The fact that it is shorter might make the search less specific. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.

So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the PDB (i.e. here, or from Assignment 3.

Selecting species for a PSI-BLAST search

As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on BLAST to find all of them, we need to use the much more sensitive search of PSI-BLAST instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.

But you can also search with too many species: if the number of species is large and PSI-BLAST finds a large number of results:

it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.

We should therefore try to find a subset of species

that represent as large a range as possible on the evolutionary tree;
that are as well distributed as possible on the tree; and
whose genomes are fully sequenced.

These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.

How can we define a list of such species, and how can we use the list?

The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.

We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. Biological classification provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.

If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:

cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; 
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes; 
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae

These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

Rank	Suffix	Example
Domain		Eukaryota (Eukarya)
Subdomain		Opisthokonta
Kingdom		Fungi
Subkingdom		Dikarya
Phylum		Ascomycota
rankless taxon^[1]	-myceta	Saccharomyceta
Subphylum	-mycotina	Saccharomycotina
Class	-mycetes	Saccharomycetes
Subclass	-mycetidae
Order	-ales	Saccharomycetales
Family	-aceae	Saccharomycetaceae
Subfamily	-oideae
Tribe	-eae
Subtribe	-ineae
Genus		Saccharomyces
Species		Saccharomyces cerevisiae

You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.) There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the ensembl fungal genomes page - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic order. All of these organisms are genome-sequenced, we can pick a set of representatives:

Capnodiales Zymoseptoria tritici
Erysiphales Blumeria graminis
Eurotiales Aspergillus nidulans
Glomerellales Glomerella graminicola
Hypocreales Trichoderma reesei
Magnaporthales Magnaporthe oryzae
Microbotryales Microbotryum violaceum
Pezizales Tuber melanosporum
Pleosporales Phaeosphaeria nodorum
Pucciniales Puccinia graminis
Saccharomycetales Saccharomyces cerevisiae
Schizosaccharomycetales Schizosaccharomyces pombe
Sclerotiniaceae Sclerotinia sclerotiorum
Sordariales Neurospora crassa
Tremellales Cryptococcus neoformans
Ustilaginales Ustilago maydis

This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course you must also include YFO (if YFO is not in this list already). To enter these 16 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list)

Aspergillus nidulans[orgn]
OR Blumeria graminis[orgn]
OR Cryptococcus neoformans[orgn]
OR Glomerella graminicola[orgn]
OR Magnaporthe oryzae[orgn]
OR Microbotryum violaceum[orgn] 
OR Neurospora crassa[orgn]
OR Phaeosphaeria nodorum[orgn]
OR Puccinia graminis[orgn]
OR Sclerotinia sclerotiorum[orgn]
OR Trichoderma reesei[orgn]
OR Tuber melanosporum[orgn]
OR Saccharomyces cerevisiae[orgn]
OR Schizosaccharomyces pombe[orgn]
OR Ustilago maydis[orgn]
OR Zymoseptoria tritici[orgn]

Executing the PSI-BLAST search

We have a list of species. Good. Next up: how do we use it.

Task:

Navigate to the BLAST homepage.
Select protein BLAST.
Paste the APSES domain sequence into the search field.
Select refseq as the database.
Copy the organism restriction list from above and enter the correct name for YFO into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the Entrez Query field.
In the Algorithm section, select PSI-BLAST.
Click on BLAST.

Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an E-value of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.

Task:

In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10^-03 for the sequence to be included.)
Click on the Reformat button (top right).
In the table of sequence descriptions (not alignments!), click on the Query coverage to sort the table by coverage, not by score.
Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
Deselect the check mark next to these sequences in the right-hand column Select for PSI blast. (For me these are six sequences, but with YFO included that may be a bit different.)
Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.

This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

we are explicitly removing sequences with poor coverage; and
we are requiring a more stringent minimum E-value for each sequence.

Task:

Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...

Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has converged. Good. Time to harvest.

Task:

At the header, click on Taxonomy reports and find YFO in the Organism Report section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
From the report copy the sequence identifiers
1. from YFO,
2. with E-values above your defined threshold.

For example, the list of Saccharomyces genes is the following:

Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292 ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c] [ 131] 1e-38 ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c] [ 123] 1e-35 ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c] [ 91] 1e-25 ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c] [ 93] 3e-25 ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c] [ 40] 5e-07

Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.

Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:

Task:

Return to the BLAST results page and again open the Formatting options.
Find the Limit results section and enter YFO's name into the field. For example Saccharomyces cerevisiae [ORGN]
Click on Reformat
Scroll to the Descriptions section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on Download → FASTA complete sequence → Continue.

There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...

http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=docsum - The default report
http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup

Even more flexible is the eUtils interface to the NCBI databases. For example you can download the dataset in text format by clicking below.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=6320147,6320957,6322808,6323658,6322090&rettype=fasta&retmode=text

Note that this utility does not show anything, but downloads the (multi) fasta file to your default download directory.

That is all.

Links and resources

Footnotes and references

↑ The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

[1] The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.

[1]

BIO Assignment Week 6

Contents

Introduction

Defining the sequence to search with

Selecting species for a PSI-BLAST search

Executing the PSI-BLAST search

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools