Expected Preparations:
|
|||||||
|
|||||||
Keywords: BLAST algorithm and Web interface; interpretation of BLAST alignments | |||||||
|
|||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||
|
|||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||
|
|||||||
Evaluation: NA: This unit is not evaluated for course marks. |
This unit introduces the BLAST algorithm and practices a BLAST search via the Web interface, and scripted in R.
BLAST is
by a margin the most important computational tool of molecular biology.
It is so important, that we couldn’t even begin our explorations here
without it, and thus we have already used BLAST in the BIN-Storing_data
unit, to find the most similar sequence to MBP1_SACCE
in
MYSPE.
Task…
A good, detailed introduction to finding homologues on a database scale - and recognizing whether similar sequences are indeed homologous - is Pearson (2013)1.
In this unit we will use BLAST to perform Reciprocal Best Matches.
One of the important questions of model-organism based
inference is: which genes perform the same function in two
different organisms. In the absence of other information, our best guess
is that these are the two genes that are mutually most
similar. The keyword here is mutually. If
MBP1_SACCE
from S. cerevisiae is the best match to
RES2_SCHPO
in S. pombe, the two proteins are only
mutually most similar if RES2_SCHPO
is more similar to
MBP1_SACCE
than to any other S. cerevisiae
protein. We call this a Reciprocal Best Match, or
“RBM”2.
The argument is summarized in the figure on the right: genes that evolve under continuos selective pressure on their function have relatively lower mutation rates and are thus more similar to each other, than genes that undergo neo- or sub-functionalization after duplication.
However, there is a catch: proteins are often composed of multiple domains that implement distinct roles of their function. Under the assumptions above we could hypothesize: * a gene in MYSPE that has the “same” function as the Mbp1 cell-cycle checkpoint switch in yeast should be an RBM to Mbp1; * a gene that binds to the same DNA sites as Mbp1 should have a DNA-binding domain that is an RBM to the DNA binding domain of Mbp1.
Thus we’ll compare RBMs in MYSPE for full-length
Mbp1_SACCE
and its DNA-binding domain, and see if the
results are the same.
You have already performed the first half of the experiment: matching from S. cerevisiae to MYSPE. The backward match is simple.
Task…
MBP1_MYSPE
in the Query
sequence field.refseq_protein
as the database
to search in, and enter Saccharomyces cerevisiae
(taxid:4932)
to restrict the organism for which
hits are reported.If your top-hit is NP_010227
, you have confirmed the RBM
between Mbp1_SACCE
and Mbp1_MYSPE
. If it is
not, let me know. I expect this to be the same and would like to verify
your results if it is not3.
The DNA-binding domain of Mbp1_SACCE
is called an
APSES domain. If the RBM between Saccharomyces
cerevisiae Mbp1 and MYSPE is truly an orthologue, we expect all of
the protein’s respective domains to have the RBM property as well. But
let’s not simply assume what we can easily test. We’ll define the
sequence of the APSES domain in MBP1_SACCE and MYSPE and see how these
definitions reflect in a BLAST search.
{{#lst:Reference annotation yeast Mbp1|CDD_APSES}}
Task…
Access BLAST and follow the link to the protein blast program.
Forward search: 1. Paste only the APSES domain
sequence for MBP1_SACCE
in the Query
sequence field (copy the sequence from above). 1. Select
refseq_protein
as the database to search
in, and enter the correct taxonomy ID for MYSPE in the
Organism field. 1. Run BLAST. Examine the results. 1.
If the top hit is the same protein you have already seen, oK. If it’s
not add it to your protein database in RStudio.
With this we have confirmed the sequence with the most highly conserved APSES domain in MYSPE. Can we take the sequence for the reverse search from the alignment that BLAST returns? Actually, that is not a good idea. The BLAST alignment is not guaranteed to be optimal. We should do an optimal sequnece alignment instead. That is: we use two different tools here for two different purposes: we use BLAST to identify one protein as the most similar among many alternatives and we use optimal sequence alignment to determine the best alignment between two sequences. That best alignment is what we will annotate as the MYSPE APSES domain.
We will execute the sequence alignment and the reverse search in R.
Task…
ABC-units
R project. If you
have loaded it before, choose File ▸ Recent
projects ▸ ABC-Units. If you have not loaded
it before, follow the instructions in the RPR-Introduction
unit.init()
if requested.BIN-ALI-BLAST.R
and follow the
instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Boratyn, Grzegorz
M et al.. (2013). “BLAST: a more efficient report with
usability improvements”. Nucleic Acids Research 41(Web
Server issue):W29–33 .
[PMID: 23609542]
[DOI: 10.1093/nar/gkt282]
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
Note that RBMs are usually orthologues, but the definition of orthologue and RBM is not the same. Most importantly, many orthologues are not RBMs. We will explore this more when we discuss phylogenetic inference.↩︎
One such case we encountered involved a protein that has a corrupted annotation for the DNA binding domain. It appears to be the correct orthologue, but it only has the second highest BLAST score.↩︎