Difference between revisions of "BIO Assignment Week 6"

Revision as of 18:38, 18 November 2014

Assignment for Week 6
Sensitive database searches with PSI-BLAST

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

Introduction

Take care of things, and they will take care of you.: Shunryu Suzuki

Anyone can click buttons on a Web page, but to use the powerful sequence database search tools right often takes considerable more care, caution and consideration.

Much of what we know about a protein's physiological function is based on the conservation of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: variation - is a consequence of selection under constraints: protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.

Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for

functional annotation;
protein homology modeling;
phylogenetic analyses, and
sensitive homology searches in databases.

In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:

orthologs are expected to be functionally and structurally conserved;
paralogs may have divergent function but have similar structure;
missing genes may make paralogs look like orthologs; and
selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.

In this assignment, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in YFO. We will use these sequences later for multiple alignments, calculation of conservation etc. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search in one organism. There is an issue to consider:

If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are in YFO is too small. Thus the search will not become very sensitive.
If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.

Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:

We need to define the sequence we are searching with; and
We need to define the dataset we are searching in.

Defining the sequence to search with

Consider again the task we set out from: find all orthologs and paralogs of the APSES domain containing transcription factors in YFO.

Task:
What query sequence should you use? Should you ...

Search with the full-length Mbp1 sequence from Saccharomyces cerevisiae?
Search with the full-length Mbp1 homolog that you found in YFO?
Search with the structurally defined S. cerevisiae APSES domain sequence?
Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
Search with the KilA-N domain sequence?

ExpandReflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.

The full-length Mbp1 sequence from Saccharomyces cerevisiae: Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.

The full-length Mbp1 homolog that you found in YFO: What organism the search sequence comes from does not make a difference. Since you aim to find all homologs in YFO, it is not necessary to have your search sequence more or less similar to any particular homologs. In fact any APSES sequence should give you the same result, since they are all homologous. But the full-length sequence in YFO has the same problem as the Saccharomyces sequence.

The structurally defined S. cerevisiae APSES domain sequence?: That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the 1BM8 PDB entry. (1MB1 would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)

The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?: As argued above: since they are all homologs, any of them should lead to the same set of results.

The KilA-N domain sequence?: This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search more sensitive. The fact that it is shorter might make the search less specific. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.

So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the PDB (i.e. here, or from Assignment 3.

Selecting species for a PSI-BLAST search

As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on BLAST to find all of them, we need to use the much more sensitive search of PSI-BLAST instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.

But you can also search with too many species: if the number of species is large and PSI-BLAST finds a large number of results:

it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.

We should therefore try to find a subset of species

that represent as large a range as possible on the evolutionary tree;
that are as well distributed as possible on the tree; and
whose genomes are fully sequenced.

These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.

How can we define a list of such species, and how can we use the list?

The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.

We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. Biological classification provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.

If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:

cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; 
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes; 
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae

These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

Rank	Suffix	Example
Domain		Eukaryota (Eukarya)
Subdomain		Opisthokonta
Kingdom		Fungi
Subkingdom		Dikarya
Phylum		Ascomycota
rankless taxon^[1]	-myceta	Saccharomyceta
Subphylum	-mycotina	Saccharomycotina
Class	-mycetes	Saccharomycetes
Subclass	-mycetidae
Order	-ales	Saccharomycetales
Family	-aceae	Saccharomycetaceae
Subfamily	-oideae
Tribe	-eae
Subtribe	-ineae
Genus		Saccharomyces
Species		Saccharomyces cerevisiae

You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.) There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the ensembl fungal genomes page - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic order. All of these organisms are genome-sequenced, we can pick a set of representatives:

Capnodiales Zymoseptoria tritici
Erysiphales Blumeria graminis
Eurotiales Aspergillus nidulans
Glomerellales Glomerella graminicola
Hypocreales Trichoderma reesei
Magnaporthales Magnaporthe oryzae
Microbotryales Microbotryum violaceum
Pezizales Tuber melanosporum
Pleosporales Phaeosphaeria nodorum
Pucciniales Puccinia graminis
Saccharomycetales Saccharomyces cerevisiae
Schizosaccharomycetales Schizosaccharomyces pombe
Sclerotiniaceae Sclerotinia sclerotiorum
Sordariales Neurospora crassa
Tremellales Cryptococcus neoformans
Ustilaginales Ustilago maydis

This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course you must also include YFO (if YFO is not in this list already). To enter these 16 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list)

Aspergillus nidulans[orgn]
OR Blumeria graminis[orgn]
OR Cryptococcus neoformans[orgn]
OR Glomerella graminicola[orgn]
OR Magnaporthe oryzae[orgn]
OR Microbotryum violaceum[orgn] 
OR Neurospora crassa[orgn]
OR Phaeosphaeria nodorum[orgn]
OR Puccinia graminis[orgn]
OR Sclerotinia sclerotiorum[orgn]
OR Trichoderma reesei[orgn]
OR Tuber melanosporum[orgn]
OR Saccharomyces cerevisiae[orgn]
OR Schizosaccharomyces pombe[orgn]
OR Ustilago maydis[orgn]
OR Zymoseptoria tritici[orgn]

Executing the PSI-BLAST search

We have a list of species. Good. Next up: how do we use it.

Task:

Navigate to the BLAST homepage.
Select protein BLAST.
Paste the APSES domain sequence into the search field.
Select refseq as the database.
Copy the organism restriction list from above and enter the correct name for YFO into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the Entrez Query field.
In the Algorithm section, select PSI-BLAST.
Click on BLAST.

Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an E-value of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.

Task:

In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10^-03 for the sequence to be included.)
Click on the Reformat button (top right).
In the table of sequence descriptions (not alignments!), click on the Query coverage to sort the table by coverage, not by score.
Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
Deselect the check mark next to these sequences in the right-hand column Select for PSI blast. (For me these are six sequences, but with YFO included that may be a bit different.)
Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.

This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

we are explicitly removing sequences with poor coverage; and
we are requiring a more stringent minimum E-value for each sequence.

Task:

Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...

Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has converged. Good. Time to harvest.

Task:

At the header, click on Taxonomy reports and find YFO in the Organism Report section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
From the report copy the sequence identifiers
1. from YFO,
2. with E-values above your defined threshold.

For example, the list of Saccharomyces genes is the following:

Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292 ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c] [ 131] 1e-38 ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c] [ 123] 1e-35 ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c] [ 91] 1e-25 ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c] [ 93] 3e-25 ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c] [ 40] 5e-07

Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.

Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:

Task:

Return to the BLAST results page and again open the Formatting options.
Find the Limit results section and enter YFO's name into the field. For example Saccharomyces cerevisiae [ORGN]
Click on Reformat
Scroll to the Descriptions section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on Download → FASTA complete sequence → Continue.

ExpandThere are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...

That is all.

Links and resources

Footnotes and references

↑ The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

[1] The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.

[1]

Difference between revisions of "BIO Assignment Week 6"

Revision as of 18:38, 18 November 2014

Contents

Introduction

Defining the sequence to search with

Selecting species for a PSI-BLAST search

Executing the PSI-BLAST search

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 40: / Line 40: @@
 *paralogs may have divergent function but have similar structure;
 *missing genes may make paralogs look like orthologs; and
-*selection bias may weight our results toward sequences that are over-represented, and not provide a fair representation of evolutionary divergence.
+*selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.
-In this assignment, we will set ourselves the task to use PSI-BLAST and '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''. We will use these sequences later for multiple alignments, calculation of conservation ''etc''. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search '''in one organism'''. This is the issue:
+In this assignment, we will set ourselves the task to use PSI-BLAST and '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''. We will use these sequences later for multiple alignments, calculation of conservation ''etc''. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search '''in one organism'''. There is an issue to consider:
 * If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are '''in''' YFO is too small. Thus the search will not become very sensitive.
-* If we search in all species, the number of hits may become too large. This is maybe not such a problem if we can restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added. How will you evaluate the fringe cases, where you need to decide whether to add a new sequence with marginal E-value to the profile, or whether to hold off for one or two iterations, and to see whether the E-value drops significantly - to avoid profile corruption?
+* If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.
-Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can't anymore assess the sequences that contribute to the profile.
+Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.
-To put this into practice, the sequence search needs to address two issues before we begin:
+Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:
 # We need to define the sequence we are searching with; and
 # We need to define the dataset we are searching in.
@@ Line 70: / Line 70: @@
 # Search with the full-length Mbp1 sequence from ''Saccharomyces cerevisiae''?
 # Search with the full-length Mbp1 homolog that you found in YFO?
-# Search with the ''S. cerevisiae'' APSES domain sequence?
+# Search with the structurally defined ''S. cerevisiae'' APSES domain sequence?
 # Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
 # Search with the KilA-N domain sequence?
@@ Line 83: / Line 83: @@
 :What organism the search sequence comes from does not make a difference. Since you aim to find '''all''' homologs in YFO, it is not necessary to have your search sequence more or less similar to '''any particular''' homologs. In fact '''any''' APSES sequence should give you the same result, since they are '''all''' homologous. But the full-length sequence in YFO has the same problem as the ''Saccharomyces'' sequence.
-;The ''S. cerevisiae'' APSES domain sequence?
+;The structurally defined ''S. cerevisiae'' APSES domain sequence?
-:That would be my first choice, just because it is nicely defined as the sequence of the <code>1BM8</code> PDB structure. (<code>1MB1</code> would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
+:That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the <code>1BM8</code> PDB entry. (<code>1MB1</code> would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
 ;The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
@@ Line 90: / Line 90: @@
 ;The KilA-N domain sequence?
-:This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search '''more sensitive'''. The fact that it is shorter might make the search '''less specific'''. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be '''whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as sugested by the Pfam alignment.'''
+:This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search '''more sensitive'''. The fact that it is shorter might make the search '''less specific'''. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be '''whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.'''
@@ Line 121: / Line 121: @@
 How can we '''define''' a list of such species, and how can we '''use''' the list?
-The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of reresentative fungi with fully sequenced genomes".  Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. But we can come close enough to this with the following steps:
+The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes".  Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.
-# Use a list of genome sequenced fungi (from NCBI);
+We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. {{WP|Biological classification|'''Biological classification'''}} provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.
-# BLAST the yeast Mbp1 APSES domain against that list;
-# Evaluate the taxonomy report that BLAST generates
-# Select species of approximately similar ''taxonomic rank''.
-Again: reflect on this process and make sure you understand the principle. You should be able to ask yourself: how would I do this for a protein I work with after the course... ? (And know the answer.)
-{{task|1=
-# Navigate to the [http://blast.ncbi.nlm.nih.gov/ '''BLAST'''] home page.
-# Find the link to '''list all genomic BLAST databases''' and follow it. This list will take you to a selection of genome-sequenced fungi.
-# Find the section of '''Fungi''' and click on the small triangle if it is not yet "open".
-# Don't be deceived: there are more species in the database than these. You could follow the links if you wanted to search in '''one particular genome'''. We will search in a '''set of genomes''' instead. Click on the small, round '''B''' icon, next to the group label '''Fungi'''. You should arrive at [http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi?organism=fungi this] page.
-#From the drop-down menus select:
-## Query: '''Protein'''
-## Database: '''Protein'''
-## BLAST-Program: '''blastp'''
-# Check the boxes next to all species that have a pale yellow background. As you can read in the header of the page, these are completed genomic sequences. As of today, there are 34 such genomes.
-# Paste your search sequence - i.e. the sequence of the yeast Mbp1 APSES domain into the field.
-# Click on '''BLAST''', then on '''View results''' on the next page.
-# In the header section of the BLAST report, find the line '''Other reports''' and open the '''Taxonomy report''' in a separate tab or window.
-# For completeness, scroll through the list of '''Descriptions''' - "Sequences producing significant alignments" and look at he accession numbers. Most of these are RefSeq IDs (either <code>NP_...</code> or <code>XP_...</code>). Make sure that for all of the species that do '''not''' have RefSeq identifiers there are variants or strains that do. The reason is: we would not want to inadvertently exclude species in favour of closely related other species for which the genome has not yet been imported into RefSeq. Since we will be doing our full-scale search on RefSeq, we want to ensure all our species are actually represented there. Please reflect on this for a moment and make sure you understand this point.
-# Now examine the taxonomy report. The page has three sections: a '''Lineage Report''', an '''Organism Report''' and the '''Taxonomy report'''.
-}}
-To make use of the Taxonomy report, you should know that {{WP|Biological classification|biological classification}} provides a hierarchical system that defines relationships for all living entities. The levels of the hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux. However the coarse outlines are basically stable and will serve for our purpose of identifying a number of well-distributed species from a set.
 If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
@@ Line 271: / Line 244: @@
-You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. However, armed with this information, we should be able make sense of the Taxonomy Report for our purposes:
+You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major '''domains''' of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=40674&lvl=4 '''Mammalia'''] (a {{WP|Mammal_classification|'''class rank'''}}, and depending how many species you would want to include, use the
+subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)
+There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the [http://fungi.ensembl.org/info/website/species.html '''ensembl fungal genomes page'''] - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic ''order''. All of these organisms are genome-sequenced, we can pick a set of representatives:
-<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">The NCBI BLAST taxonomy report for hits to Mbp1 homologs in genome-sequenced fungi:
+# Capnodiales&nbsp;&nbsp;&nbsp;''Zymoseptoria tritici''
+# Erysiphales&nbsp;&nbsp;&nbsp;''Blumeria graminis''
+# Eurotiales&nbsp;&nbsp;&nbsp;''Aspergillus nidulans''
+# Glomerellales&nbsp;&nbsp;&nbsp;''Glomerella graminicola''
+# Hypocreales&nbsp;&nbsp;&nbsp;''Trichoderma reesei''
+# Magnaporthales&nbsp;&nbsp;&nbsp;''Magnaporthe oryzae''
+# Microbotryales&nbsp;&nbsp;&nbsp;''Microbotryum violaceum''
+# Pezizales&nbsp;&nbsp;&nbsp;''Tuber melanosporum''
+# Pleosporales&nbsp;&nbsp;&nbsp;''Phaeosphaeria nodorum''
+# Pucciniales&nbsp;&nbsp;&nbsp;''Puccinia graminis''
+# Saccharomycetales&nbsp;&nbsp;&nbsp;''Saccharomyces cerevisiae''
+# Schizosaccharomycetales&nbsp;&nbsp;&nbsp;''Schizosaccharomyces pombe''
+# Sclerotiniaceae&nbsp;&nbsp;&nbsp;''Sclerotinia sclerotiorum''
+# Sordariales&nbsp;&nbsp;&nbsp;''Neurospora crassa''
+# Tremellales&nbsp;&nbsp;&nbsp;''Cryptococcus neoformans''
+# Ustilaginales&nbsp;&nbsp;&nbsp;''Ustilago maydis''
-<source lang="text">
+This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course '''you must also include YFO''' (<small>if YFO is not in this list already</small>).
-Dikarya
-. saccharomyceta
-. . Saccharomycetales
-</source>
-<div  class="mw-collapsible-content">
-<source lang="text">
-. . . Saccharomycetaceae
-. . . . Saccharomyces
-. . . . . Saccharomyces cerevisiae
-. . . . . . Saccharomyces cerevisiae S288c
-. . . . . . Saccharomyces cerevisiae CEN.PK113-7D
-. . . . . . Saccharomyces cerevisiae YJM789
-. . . . . . Saccharomyces cerevisiae RM11-1a
-. . . . . . Saccharomyces cerevisiae AWRI1631
-. . . . . . Saccharomyces cerevisiae JAY291
-. . . . . . Saccharomyces cerevisiae Lalvin QA23
-. . . . . . Saccharomyces cerevisiae FostersB
-. . . . . . Saccharomyces cerevisiae AWRI796
-. . . . . . Saccharomyces cerevisiae VL3
-. . . . . . Saccharomyces cerevisiae Vin13
-. . . . . . Saccharomyces cerevisiae EC1118
-. . . . . . Saccharomyces cerevisiae FostersO
-. . . . . Saccharomyces cerevisiae x Saccharomyces kudriavzevii VIN7
-. . . . mitosporic Nakaseomyces
-. . . . . Candida glabrata
-. . . . . . Candida glabrata CBS 138
-. . . . Tetrapisispora phaffii CBS 4417
-. . . . Kluyveromyces
-. . . . . Kluyveromyces lactis
-. . . . . . Kluyveromyces lactis NRRL Y-1140
-. . . . Naumovozyma
-. . . . . Naumovozyma dairenensis CBS 421
-. . . . . Naumovozyma castellii CBS 4309
-. . . . Zygosaccharomyces
-. . . . . Zygosaccharomyces rouxii
-. . . . . . Zygosaccharomyces rouxii CBS 732
-. . . . Vanderwaltozyma polyspora DSM 70294
-. . . . Eremothecium
-. . . . . Eremothecium cymbalariae DBVPG#7215
-. . . . . Eremothecium gossypii
-. . . . . . Ashbya gossypii ATCC 10895
-. . . . . . Ashbya gossypii FDAG1
-. . . . Torulaspora delbrueckii
-. . . . Lachancea thermotolerans CBS 6340
-. . . . Komagataella pastoris
-. . . . . Komagataella pastoris GS115
-. . . . . Komagataella pastoris CBS 7435
-. . . Candida
-. . . . Candida dubliniensis CD36
-. . . . Candida albicans
-. . . . . Candida albicans WO-1
-. . . . . Candida albicans SC5314
-. . . Debaryomycetaceae
-. . . . Scheffersomyces stipitis CBS 6054
-. . . . Debaryomyces hansenii CBS767
-. . . Yarrowia lipolytica CLIB122
-. . leotiomyceta
-. . . mitosporic Trichocomaceae
-. . . . Aspergillus
-. . . . . Aspergillus niger
-. . . . . . Aspergillus niger CBS 513.88
-. . . . . . Aspergillus niger ATCC 1015
-. . . . . Aspergillus fumigatus
-. . . . . . Aspergillus fumigatus Af293
-. . . . . . Aspergillus fumigatus A1163
-. . . . Penicillium chrysogenum Wisconsin 54-1255
-. . . Sordariomycetidae
-. . . . Magnaporthe
-. . . . . Magnaporthe oryzae 70-15
-. . . . . Magnaporthe grisea
-. . . . Chaetomiaceae .
-. . . . . Myceliophthora thermophila ATCC 42464 .
-. . . . . Thielavia terrestris NRRL 8126
-. . . Dothideomycetes .
-. . . . Zymoseptoria tritici IPO323 .
-. . . . Phaeosphaeria nodorum SN15
-. . Schizosaccharomyces
-. . . Schizosaccharomyces pombe
-. . . . Schizosaccharomyces pombe 972h-
-. Basidiomycota .
-. . Ustilago maydis 521 .
-. . Filobasidiella/Cryptococcus neoformans species complex
-. . . Cryptococcus neoformans var. neoformans .
-. . . . Cryptococcus neoformans var. neoformans JEC21 .
-. . . . Cryptococcus neoformans var. neoformans B-3501A .
-. . . Cryptococcus gattii WM276 .
-</source>
-</div>
-</div>
+To enter these 16 species as an Entrez restriction, they need to be formatted as below. (<small>One could also enter species one by one, by pressing the '''(+)''' button after the organism list</small>)
-You need to note that this report gives '''the highest taxonomic rank that is common to a group below it''', i.e. two species differ in the rank immediately below the last one named. For example the two species identified as common at the ''class'' level ...
- <source lang="text">. . . Dothideomycetes .
- . . . . Zymoseptoria tritici IPO323 .
- . . . . Phaeosphaeria nodorum SN15
-</source>
-... differ at the ''subclass'' rank as we can see if we follow their links to the taxonomy browser.
- <source lang="text">. . . Dothideomycetes .
- . . . . Dothideomycetidae
- . . . . . Zymoseptoria tritici IPO323 .
- . . . . Pleosporomycetidae
- . . . . . Phaeosphaeria nodorum SN15
-</source>
-Our goal is to remove species that are "too similar". What that means precisely is really up to us, but for this purpose let's keep only one representative of each ''subfamily'', i.e. we will remove (by hand) all but one representative of a set that shares a -oideaea (subfamily), -eaea (tribe) or -ineae (subtribe) designation or below. The result is the following set of 23 species (I have formatted them so they can be pasted into the Entrez filter field, but one could also enter species one by one, by pressing the '''(+)''' button after the organism list):
+<source lang="text">
+Aspergillus nidulans[orgn]
+OR Blumeria graminis[orgn]
+OR Cryptococcus neoformans[orgn]
+OR Glomerella graminicola[orgn]
+OR Magnaporthe oryzae[orgn]
+OR Microbotryum violaceum[orgn]
+OR Neurospora crassa[orgn]
+OR Phaeosphaeria nodorum[orgn]
+OR Puccinia graminis[orgn]
+OR Sclerotinia sclerotiorum[orgn]
+OR Trichoderma reesei[orgn]
+OR Tuber melanosporum[orgn]
+OR Saccharomyces cerevisiae[orgn]
+OR Schizosaccharomyces pombe[orgn]
+OR Ustilago maydis[orgn]
+OR Zymoseptoria tritici[orgn]
-<source lang="text">
-Saccharomyces cerevisiae [ORGN]
-OR Candida glabrata [ORGN]
-OR Tetrapisispora phaffii [ORGN]
-OR Kluyveromyces lactis [ORGN]
-OR Naumovozyma dairenensis [ORGN]
-OR Zygosaccharomyces rouxii [ORGN]
-OR Vanderwaltozyma polyspora [ORGN]
-OR Ashbya gossypii [ORGN]
-OR Torulaspora delbrueckii [ORGN]
-OR Lachancea thermotolerans [ORGN]
-OR Komagataella pastoris [ORGN]
-OR Candida albicans [ORGN]
-OR Debaryomyces hansenii [ORGN]
-OR Yarrowia lipolytica [ORGN]
-OR Aspergillus niger [ORGN]
-OR Penicillium chrysogenum [ORGN]
-OR Magnaporthe oryzae [ORGN]
-OR Myceliophthora thermophila [ORGN]
-OR Zymoseptoria tritici [ORGN]
-OR Phaeosphaeria nodorum [ORGN]
-OR Schizosaccharomyces pombe [ORGN]
-OR Ustilago maydis [ORGN]
-OR Cryptococcus neoformans [ORGN]
 </source>
-(Consider that this list is quite a bit shorter than the much larger number of species you originally found in the BLAST search report.)
@@ Line 423: / Line 297: @@
 ==Executing the PSI-BLAST search==
-We have a list of species. Goof. Next up: how do we '''use''' it.
+We have a list of species. Good. Next up: how do we '''use''' it.
 {{task|1=
@@ Line 432: / Line 306: @@
 # Paste the APSES domain sequence into the search field.
 # Select '''refseq''' as the database.
-# Copy the organism restriction list from above '''and enter the correct name for YFO''' into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the '''Organism''' field.
+# Copy the organism restriction list from above '''and enter the correct name for YFO''' into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the '''Entrez Query''' field.
 # In the '''Algorithm''' section, select PSI-BLAST.
 #Click on '''BLAST'''.
@@ Line 438: / Line 312: @@
-Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there are also quite a few sequences that have a low query coverage. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with lower E-values in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
+Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
 {{task|1=
-#In the header section, click on '''Formatting options''' and in the line "Format for..." set the '''inclusion threshold''' to <code>0.0001</code> (i.e. one more zero. This meansE-values can't be above 1e-04 for the sequence to be included.)
+#In the header section, click on '''Formatting options''' and in the line "Format for..." set the '''with inclusion threshold''' to <code>0.001</code> (This means E-values can't be above 10<sup>-03</sup> for the sequence to be included.)
 # Click on the '''Reformat''' button (top right).
 # In the table of sequence descriptions (not alignments!), click on the '''Query coverage''' to sort the table by coverage, not by score.
 # Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
-# '''Deselect''' the check mark next to these sequences. (For me these are three sequences, but with YFO included that may be a bit different.)
+# '''Deselect''' the check mark next to these sequences in the right-hand column '''Select for PSI blast'''. (For me these are six sequences, but with YFO included that may be a bit different.)
-# Then next to '''Run PSI-BLAST iteration ...''', click on '''<code>Go</code>'''.
+# Then scroll to '''Run PSI-BLAST iteration 2 ...''' and click on '''<code>Go</code>'''.
 }}
@@ Line 453: / Line 327: @@
 This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the '''profile''', not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
 # we are explicitly removing sequences with poor coverage; and
-# we are requiring a minimum E-value for each sequence.
+# we are requiring a more stringent minimum E-value for each sequence.
 {{task|1=
-#Again, study the table of hits. Sequences marked with green dots were previously included. Sequences labelled with ''new'' have gone below the E-value threshold only in the second iteration. There are quite a few! Sequences without a label were previously excluded.
+#Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
-# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage <small>(Here is where all your video game hours practicing rapid, targeted mouse clicks finally pay off!)</small>) Then run the third iteration.
+# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
-# This time there are only a small number of new sequences, but the number of low-coverage sequences has also decreased somewhat.
+# Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
-# Again, deselect all sequences with less than 80% coverage. Note that the longer of these now have '''very''' low e-values. They look like true positives. But excluding them does not need to worry us, the will come back. We are more worried about false positives.
-# Iterate the search in this way until no more "New" sequences are added to the profile.
 }}
@@ Line 484: / Line 356: @@
 [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6323658&dopt=GenPept ref|NP_013729.1|] Sok2p [Saccharomyces cerevisiae S288c]          [  93]  3e-25<br \>
 [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322090&dopt=GenPept ref|NP_012165.1|] Xbp1p [Saccharomyces cerevisiae S288c]          [  40]  5e-07<br \>
-[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320279&dopt=GenPept ref|NP_010359.1|] Tps2p [Saccharomyces cerevisiae S288c]          [  26]  0.011<br \>
 </code>
-But I believe that Tps2 is a false positive, with low coverage, high E-value, unrelated function<ref>It is a trehalose-6-phosphate synthase/phosphatase.</ref> and a different [http://modbase.compbio.ucsf.edu/modbase-cgi/model_details.cgi?queryfile=1350847045_3197&searchmode=default&displaymode=moddetail&referer=yes&snpflag=& structure]. I ignore this one.
 [[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
@@ Line 496: / Line 366: @@
 {{task|1=
-# Back at the header of BLAST results page, again open the '''Formatting options'''.
+# Return to the BLAST results page and again open the '''Formatting options'''.
 # Find the '''Limit results''' section and enter YFO's name into the field. For example <code>Saccharomyces cerevisiae [ORGN]</code>
 # Click on '''Reformat'''
-# Scroll to the '''Alignments''' section, check the box next to each sequence you want to keep. At the bottom, click on '''Get selected sequences'''.
+# Scroll to the '''Descriptions''' section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on '''Download &rarr; FASTA complete sequence &rarr; Continue'''.
-<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">This will take you to the '''Protein''' database. Note the URL for this page - it contains the GIs for the sequences separated by a comma. You can bookmark the page. You can also select '''Display settings ... FASTA (text)''', copy the sequences and save them in a single file. By the way: we call a file that contains several FASTA formatted sequences a '''Multi FASTA file'''. Also (not required for the assignment) you can explore the options to download sequences via  the URL:
+<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...
 <div  class="mw-collapsible-content">
@@ Line 506: / Line 379: @@
 * http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup
-But even more flexible is the [http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records '''eUtils'''] interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
+Even more flexible is the [http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records '''eUtils'''] interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
 * http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=6320147,6320957,6322808,6323658,6322090&rettype=fasta&retmode=text