Biomolecules: The molecules of life; The genetic
code; Nucleic acids; Amino acids; Protein folding; Post-translational
modifications and protein biochemistry; Membrane proteins; Biological
function.
The Central Dogma: Regulation of transcription and
translation; Protein biosynthesis and degradation; Quality control.
Evolution: Theory of evolution; Variation, neutral
drift and selection.
If you are not already familiar with the
prior knowledge listed above, you need to prepare yourself from
other information sources.
The units listed above are part of this
course and contain important preparatory material.
Keywords: Concepts of homology; Orthologs; Paralogs
Objectives:
This unit will …
… introduce the concept of homology, define orthologues and
paralogues and discuss reasons for and consequences of gene
conservation;
… explore public database resources to find orthologues by BLAST
and in pre-annotated databases.
Outcomes:
After working through this unit you …
… define “homology”, “orthologue” and “paralogue”, and use the
terms correctly, and with a precise understanding of their meaning and
implications;
… are familar with issues around the definition of homologous
genes and domains;
… know about sequence similarity and other measures that can
identify related proteins and be able to use this to define your own
exploratory strategies;
… have identified the RBM for the saccharomyces
cerevisiae Mbp1 gene in MYSPE and explored other databses that make
pre-annotated relatedness information available.
Deliverables:
Time management: Before you begin,
estimate how long it will take you to complete this unit. Then, record
in your course journal: the number of hours you estimated, the number of
hours you worked on the unit, and the amount of time that passed between
start and completion of this unit.
Journal: Document your progress in
your Course
Journal. Some tasks may ask you to include specific items in your
journal. Don’t overlook these.
Insights: If you find something
particularly noteworthy about this unit, make a note in your insights!
page.
Evaluation:
NA: This unit is not evaluated for course marks.
Contents
Homology is the most important concept
for bioinformatics, since shared ancestry allows many inferences
about the structure and function of proteins. This unit introduces the
concept and explores MBP1_MYSPE relationships.
In the BIN-Storing_data
unit you have found the protein of MYSPE that is most
similar to yeast Mbp1, in MYSPE. Now we consider if this
protein is homologous to the yeast protein.
Are the sequences similar?
Obviously you have found the MYSPE sequence as a result of a BLAST
search and you probably known that BLAST finds similar sequences in
large databases. But it will usually always find something, and
that could be a chance similarity. Significant
similarity would be very high, would extend over the whole length of the
protein, could be restricted to individual domains. When would one say:
similar enough?
Do the proteins have similar structures?
If your protein happens to have had a part of its structure analyzed by
X-ray crystallography, you could compare the structures. However, this
is unlikely for the Mbp1 relatives.
What about patterns of conserved residues?
We need more proteins to consider that - and we need to
align them.
Are the proteins known to perform similar functions?
That might require function prediction. There might be an annotation in
the FASTA header of the MYSPE protein - but most likely the annotation
just says: inferred by similarity to the yeas protein (i.e. annotation
transfer). There could be experimental evidence though - check
carefully, just in case.
All of these considerations can be translated into bioinformatics
queries that we will pursue in later units.
Defining orthologs
For functional inference between organisms, the key is to find
orthologs.
To be reasonably certain about orthology relationships, one needs to
construct and analyze detailed evolutionary trees. This is
computationally expensive and the results are not always unambiguous.
But a number of different strategies are available that use
approximations, or precomputed results to define orthologs. These are
especially useful for large, cross genome surveys. They are less useful
for detailed analysis of individual genes.
Orthologs by RBM (Reciprocal Best Match)
The RBM criterion is only an approximation to orthology, but
computationally very tractable and usually correct1. To find an RBM, first
search for the best match of a gene in the target genome, then check
whether that best match retrieves the original query when it used to
search in the source genome. You have already done the first step when
you identified the best match of yeast Mbp1 in MYSPE. Now do the second
step:
Get the ID for the gene which you have identified and annotated as
the best BLAST match for Mbp1 in MYSPE and confirm that this gene has
Mbp1 as the most significant hit in the yeast proteome. The
results are unambiguous, but there may be residual doubt whether these
two best-matching sequences are actually the most similar
orthologs.
Again, the RBM workflow: To find the RBM of
gene-1 of species A in species B … : With
a BLAST search, find the best match to gene-1 in species
B. Let that be “gene-2”. : With a BLAST search,
find the best match to gene-2 in species A. : If
that match is again gene-1 the “RBM” has been
confirmed.
Task…
Perfom the second step of the RBM workflow:
Navigate to the BLAST homepage and access the protein BLAST
page.
Copy the RefSeq identifier for MBP1_MYSPE from your journal into the
search field (You can search directly with an NCBI identifier
IF you want to search with the full-length
sequence.)
Set the database to refseq;
restrict the species to Saccharomyces cerevisiae
S288C.
Run BLAST.
Keep the window open for the next task.
The top hit should be yeast Mbp1 (NP_010227). Discuss on the board if
it is not.
If the top hit is NP_010227, you have confirmed the
RBM criterion (Reciprocal Best Match).
Task…
Explain to someone you know why RBM is expected to
find orthologous pairs of genes. Don’t paraphrase the fact that they do,
or merely describe how an RBM analysis works, but explain
why we can expect it to be successful in identifying an
evolutionary relationship when all we have are measures of pairwise
similarity.
If you can’t figure it out, ask on the Discussion board list.
Orthology by annotation
The NCBI precomputes gropus of related genes and makes them available
via the HomoloGene database from the RefSeq database entry for your
protein.
Task…
Navigate to the RefSeq protein page for MBP1_MYSPE. (There should be
a link from the query identifier in your BLAST result page).
Follow the Homologene link in the right-hand menu
under Related information. (Follow the link
to MBP1_SACCE if your species has not been annotated and
there is no Homologene link from your protein’s page.)
You should see a number of genes that are considered homologous other
fungi, but there is no way to tell whether these are orthologues, and
the links to proteins with shared domains shows you that there are
several that share (non-specific) ankyrin domains, and only a few that
also have the (highly specific) Kila-N (or APSES) domain.
Orthologs by eggNOG
The eggNOG
(evolutionary genealogy of genes: Non-supervised Orthologous Groups)
database contains orthologous groups of genes at the EMBL. It seems to
be continuously updated, and the search functionality is reasonable. Try
the search with the MBP1_MYSPE refseq identifier. What I see are
orthologs annotated in non-fungi but to the ankyrin
domain, which is a meaningless relationship. Alignments and
trees are also available, as are database downloads for algorithmic
analysis.
With the increasing availability of various ’omics data, high-quality
orthology assignment is crucial for evolutionary and functional genomics
studies. We here present the fourth version of the eggNOG database
(available at http://eggnog.embl.de) that derives nonsupervised
orthologous groups (NOGs) from complete genomes, and then applies a
comprehensive characterization and analysis pipeline to the resulting
gene families. Compared with the previous version, we have more than
tripled the underlying species set to cover 3686 organisms, keeping
track with genome project completions while prioritizing the inclusion
of high-quality genomes to minimize error propagation from incomplete
proteome sets. Major technological advances include (i) a robust and
scalable procedure for the identification and inclusion of high-quality
genomes, (ii) provision of orthologous groups for 107 different
taxonomic levels compared with 41 in eggNOGv3, (iii) identification and
annotation of particularly closely related orthologous groups,
facilitating analysis of related gene families, (iv) improvements of the
clustering and functional annotation approach, (v) adoption of a revised
tree building procedure based on the multiple alignments generated
during the process and (vi) implementation of quality control procedures
throughout the entire pipeline. As in previous versions, eggNOGv4
provides multiple sequence alignments and maximum-likelihood trees, as
well as broad functional annotation. Users can access the complete
database of orthologous groups via a web interface, as well as through
bulk download.
Orthologs at OrthoDB
OrthoDB includes
a large number of species, among them all of our protein-sequenced
fungi. However the search function (by keyword - try “Mbp1”) retrieves
many paralogs together with the orthologs, for example, the yeast Soc2
and Phd1 proteins are found in the same orthologous group these two are
clearly paralogs and, again, the results are overloaded with
ankyrin-domain containing proteins.
Waterhouse, Robert
Met al.. (2013). “OrthoDB: a hierarchical catalog of
animal, fungal and bacterial orthologs”. Nucleic Acids Research41(Database issue):D358–65 . [PMID: 23180791][DOI: 10.1093/nar/gks1116]
The concept of orthology provides a foundation for formulating
hypotheses on gene and genome evolution, and thus forms the cornerstone
of comparative genomics, phylogenomics and metagenomics. We present the
update of OrthoDB-the hierarchical catalog of orthologs (http://www.orthodb.org).
From its conception, OrthoDB promoted delineation of orthologs at
varying resolution by explicitly referring to the hierarchy of species
radiations, now also adopted by other resources. The current release
provides comprehensive coverage of animals and fungi representing 252
eukaryotic species, and is now extended to prokaryotes with the
inclusion of 1115 bacteria. Functional annotations of orthologous groups
are provided through mapping to InterPro, GO, OMIM and model organism
phenotypes, with cross-references to major resources including UniProt,
NCBI and FlyBase. Uniquely, OrthoDB provides computed evolutionary
traits of orthologs, such as gene duplicability and loss profiles,
divergence rates, sibling groups, and now extended with exon-intron
architectures, syntenic orthologs and parent-child trees. The
interactive web interface allows navigation along the species
phylogenies, complex queries with various identifiers, annotation
keywords and phrases, as well as with gene copy-number profiles and
sequence homology searches. With the explosive growth of available data,
OrthoDB also provides mapping of newly sequenced genomes and
transcriptomes to the current orthologous groups.
Orthologs at OMAOMA (the Orthologous
Matrix) maintained at the Swiss Federal Institute of Technology contains
a large number of orthologs from sequenced genomes. Searching with the
refseq identifier of MBP1_MYSPE may retrieve hits that you can access
via the “Orthologs” tab (If not, try yeast Mbp1 NP_010227).
As a whole this database is well constructed, the output is useful, and
data is available for download and API access; this would be the
resource of my first choice for pre-computed orthology queries.
Altenhoff, Adrian
Met al.. (2011). “OMA 2011: orthology inference among
1000 complete genomes”. Nucleic Acids Research39(Database issue):D289–94 . [PMID: 21113020][DOI: 10.1093/nar/gkq1238]
OMA (Orthologous MAtrix) is a database that identifies orthologs among
publicly available, complete genomes. Initiated in 2004, the project is
at its 11th release. It now includes 1000 genomes, making it one of the
largest resources of its kind. Here, we describe recent developments in
terms of species covered; the algorithmic pipeline–in particular
regarding the treatment of alternative splicing, and new features of the
web (OMA Browser) and programming interface (SOAP API). In the second
part, we review the various representations provided by OMA and their
typical applications. The database is publicly accessible at http://omabrowser.org.
… see also the related articles, much innovative and carefully done
work on automated orthologue definition by the Dessimoz group.
Orthologs by syntenic gene order conservation
OMA also provides synteny information, one hallmark of an orthologous
relationship (Why?).
Questions, comments
If in doubt, ask! If anything about this contents is
not clear to you, do not proceed but ask for clarification. If you have
ideas about how to make this material better, let’s hear them. We are
aiming to compile a list of FAQs for all learning units, and your
contributions will count towards your participation marks.
Improve this page! If you have questions or
comments, please post them on the Quercus Discussion board with a
subject line that includes the name of the unit.