Difference between revisions of "FND-Homology"

From "A B C"
Jump to navigation Jump to search
m
m
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Concepts and Consequences of Homology
 
Concepts and Consequences of Homology
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Concepts of homology; Orthologs; Paralogs)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Concepts of homology; Orthologs; Paralogs
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
 
 
__TOC__
 
 
 
{{Vspace}}
 
 
 
 
 
{{LIVE}}
 
 
 
{{Vspace}}
 
  
  
</div>
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
<div id="ABC-unit-framework">
+
<div style="font-size:118%;">
== Abstract ==
+
<b>Abstract:</b><br />
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "../components/FND-Homology.components.wtxt", section: "abstract" -->
 
 
Homology is the most important concept for bioinformatics, since shared ancestry allows many inferences about the structure and function of proteins. This unit introduces the concept and explores MBP1_MYSPE relationships.
 
Homology is the most important concept for bioinformatics, since shared ancestry allows many inferences about the structure and function of proteins. This unit introduces the concept and explores MBP1_MYSPE relationships.
 
<section end=abstract />
 
<section end=abstract />
 
+
</div>
{{Vspace}}
+
<!-- ============================  -->
 
+
<hr>
 
+
<table>
== This unit ... ==
+
<tr>
=== Prerequisites ===
+
<td style="padding:10px;">
<!-- included from "../components/FND-Homology.components.wtxt", section: "prerequisites" -->
+
<b>Objectives:</b><br />
<!-- included from "ABC-unit_components.wtxt", section: "notes-external_prerequisites" -->
+
This unit will ...
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:
+
* ... introduce the concept of homology, define orthologues and paralogues and discuss reasons for and consequences of gene conservation;
<!-- included from "FND-prerequisites.wtxt", section: "biomolecules" -->
+
* ... explore public database resources to find orthologues by BLAST and in pre-annotated databases.
 +
</td>
 +
<td style="padding:10px;">
 +
<b>Outcomes:</b><br />
 +
After working through this unit you ...
 +
* ... define "homology", "orthologue" and "paralogue", and use the terms correctly, and with a precise understanding of their meaning and implications;
 +
* ... are familar with issues around the definition of homologous genes and domains;
 +
* ... know about sequence similarity and other measures that can identify related proteins and be able to use this to define your own exploratory strategies;
 +
* ... have identified the RBM for the ''saccharomyces cerevisiae'' Mbp1 gene in MYSPE and explored other databses that make pre-annotated relatedness information available.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================ -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:<br />
 
*<b>Biomolecules</b>: The molecules of life; nucleic acids and amino acids; the genetic code; protein folding; post-translational modifications and protein biochemistry; membrane proteins; biological function.
 
*<b>Biomolecules</b>: The molecules of life; nucleic acids and amino acids; the genetic code; protein folding; post-translational modifications and protein biochemistry; membrane proteins; biological function.
<!-- included from "FND-prerequisites.wtxt", section: "central_dogma" -->
 
 
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
 
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
<!-- included from "FND-prerequisites.wtxt", section: "evolution" -->
 
 
*<b>Evolution</b>: Theory of evolution; variation, neutral drift and selection.
 
*<b>Evolution</b>: Theory of evolution; variation, neutral drift and selection.
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
+
This unit builds on material covered in the following prerequisite units:<br />
You need to complete the following units before beginning this one:
 
 
*[[BIN-Storing_data|BIN-Storing_data (Storing Data)]]
 
*[[BIN-Storing_data|BIN-Storing_data (Storing Data)]]
 
*[[BIN-Sequence|BIN-Sequence (Sequence)]]
 
*[[BIN-Sequence|BIN-Sequence (Sequence)]]
 
*[[BIN-SX-Concepts|BIN-SX-Concepts (Concepts of Molecular Structure)]]
 
*[[BIN-SX-Concepts|BIN-SX-Concepts (Concepts of Molecular Structure)]]
 +
<section end=prerequisites />
 +
<!-- ============================  -->
 +
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
 
 
=== Objectives ===
 
<!-- included from "../components/FND-Homology.components.wtxt", section: "objectives" -->
 
This unit will ...
 
* ... introduce the concept of homology, define orthologues and paralogues and discuss reasons for and consequences of gene conservation;
 
* ... explore public database resources to find orthologues by BLAST and in pre-annotated databases.
 
 
 
{{Vspace}}
 
  
  
=== Outcomes ===
 
<!-- included from "../components/FND-Homology.components.wtxt", section: "outcomes" -->
 
After working through this unit you ...
 
* ... define "homology", "orthologue" and "paralogue", and use the terms correctly, and with a precise understanding of their meaning and implications;
 
* ... are familar with issues around the definition of homologous genes and domains;
 
* ... know about sequence similarity and other measures that can identify related proteins and be able to use this to define your own exploratory strategies;
 
* ... have identified the RBM for the ''saccharomyces cerevisiae'' Mbp1 gene in MYSPE and explored other databses that make pre-annotated relatedness information available.
 
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Deliverables ===
+
__TOC__
<!-- included from "../components/FND-Homology.components.wtxt", section: "deliverables" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 88: Line 74:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "../components/FND-Homology.components.wtxt", section: "evaluation" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
 
 
<b>Evaluation: NA</b><br />
 
<b>Evaluation: NA</b><br />
:This unit is not evaluated for course marks.
+
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
 
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="BIO">
 
 
== Contents ==
 
== Contents ==
<!-- included from "../components/FND-Homology.components.wtxt", section: "contents" -->
 
  
 
{{Task|1=
 
{{Task|1=
Line 112: Line 89:
 
;Consider!
 
;Consider!
  
In the [[BIN-Storing_data]] unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE. Consider if this protein is homologous to the yeast protein. For most of these questions, you will probably not know the answer right now, but we will find out more in later units.
+
In the [[BIN-Storing_data]] unit you have found the protein of MYSPE that is '''most similar''' to yeast Mbp1, in MYSPE. Now we consider if this protein is '''homologous''' to the yeast protein.
  
 
* Are the sequences similar?
 
* Are the sequences similar?
:Obviously you have found the MYSPE sequence as a result of a BLAST search and you probably known that BLAST finds similar sequences in large databases. But it will usually always find ''something'', and that could be a chance similarity. '''Significant''' similarity would be very high, would extend over the whole length of the protein, could be restricted to individual domains. When would you say: similar enough?
+
:Obviously you have found the MYSPE sequence as a result of a BLAST search and you probably known that BLAST finds similar sequences in large databases. But it will usually always find ''something'', and that could be a chance similarity. '''Significant''' similarity would be very high, would extend over the whole length of the protein, could be restricted to individual domains. When would one say: similar enough?
  
 
* Do the proteins have similar structures?
 
* Do the proteins have similar structures?
:If your protein happens to have had a part of its structure analyzed by X-ray crystallography, you could compare the structures. However, this is unlikely for the Mbp1 relatives - except for the ankyrin domains. These are ubiquitous protein-protein interaction motifs and won't tell us much more than that. It's unlikely that other (parts of) the MYSPE protein structure are known.
+
:If your protein happens to have had a part of its structure analyzed by X-ray crystallography, you could compare the structures. However, this is unlikely for the Mbp1 relatives.
  
 
* What about patterns of conserved residues?
 
* What about patterns of conserved residues?
Line 124: Line 101:
  
 
* Are the proteins known to perform similar functions?
 
* Are the proteins known to perform similar functions?
:That might require function prediction. There might be an annotation in the FASTA header of the MYSPE protein - but it's likely to be made based on homology to the yeast protein. Could be experimental evidence though - check carefully, just in case.
+
:That might require function prediction. There might be an annotation in the FASTA header of the MYSPE protein - but most likely the annotation just says: inferred by similarity to the yeas protein (i.e. annotation transfer). There ''could'' be experimental evidence though - check carefully, just in case.
  
All of these considerations lead to bioinformatics queries that we will pursue in later units.
+
All of these considerations can be translated into bioinformatics queries that we will pursue in later units.
  
 
{{Vspace}}
 
{{Vspace}}
Line 138: Line 115:
  
 
;Orthologs by RBM (Reciprocal Best Match)
 
;Orthologs by RBM (Reciprocal Best Match)
:The RBM criterion is only an approximation to orthology, but computationally very tractable and usually correct. To find an RBM, first search for the best match of a gene in the target genome, then check whether that best match retrieves the original query when it used to serach in the source genome. You have already done the first step when you identified the best match of yeast Mbp1 in MYSPE. Now do the second step.
+
:The RBM criterion is only an approximation to orthology, but computationally very tractable and usually correct<ref>{{#pmid:23160176}}</ref>. To find an RBM, first search for the best match of a gene in the target genome, then check whether that best match retrieves the original query when it used to search in the source genome. You have already done the first step when you identified the best match of yeast Mbp1 in MYSPE. Now do the second step:
  
 
Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in MYSPE and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. <small>The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.</small>
 
Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in MYSPE and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. <small>The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.</small>
 +
 +
;Again, the RBM workflow:
 +
To find the RBM of <tt>gene-1</tt> of species ''A'' in species ''B'' ...
 +
: With a BLAST search, find the best match to <tt>gene-1</tt> in species ''B''. Let that be "<tt>gene-2</tt>".
 +
: With a BLAST search, find the best match to <tt>gene-2</tt> in species ''A''.
 +
: If that match is again <tt>gene-1</tt> the "RBM" has been confirmed.
 +
  
 
{{task|1=
 
{{task|1=
 +
 +
Perfom the second step of the RBM workflow:
 +
 
# Navigate to the BLAST homepage and access the protein BLAST page.
 
# Navigate to the BLAST homepage and access the protein BLAST page.
 
# Copy the RefSeq identifier for MBP1_MYSPE from your journal into the search field (You can search directly with an NCBI identifier '''IF''' you want to search with the full-length sequence.)
 
# Copy the RefSeq identifier for MBP1_MYSPE from your journal into the search field (You can search directly with an NCBI identifier '''IF''' you want to search with the full-length sequence.)
 
# Set the database to refseq;
 
# Set the database to refseq;
#  restrict the species to ''Saccharomyces cerevisiae''.
+
#  restrict the species to ''Saccharomyces cerevisiae S288C''.
 
# Run BLAST.
 
# Run BLAST.
 
# Keep the window open for the next task.
 
# Keep the window open for the next task.
  
The top hit should be yeast Mbp1 (NP_010227). Discuss on the list if it is not.
+
The top hit should be yeast Mbp1 (NP_010227). Discuss on the board if it is not.
  
If the top hit is NP_010227, you have confirmed the '''RBM''' or '''BBM''' criterion (Reciprocal Best Match or Bidirectional Best Hit, respectively).
+
If the top hit is NP_010227, you have confirmed the '''RBM''' criterion (Reciprocal Best Match).
  
 
}}
 
}}
Line 160: Line 147:
 
Explain to someone you know why '''RBM''' is expected to find orthologous pairs of genes. Don't paraphrase the fact that they do, or merely describe how an RBM analysis works, but explain '''why''' we can expect it to be successful in identifying an evolutionary relationship when all we have are measures of pairwise similarity.
 
Explain to someone you know why '''RBM''' is expected to find orthologous pairs of genes. Don't paraphrase the fact that they do, or merely describe how an RBM analysis works, but explain '''why''' we can expect it to be successful in identifying an evolutionary relationship when all we have are measures of pairwise similarity.
  
If you can't figure it out, ask on the mailing list.
+
If you can't figure it out, ask on the Discussion board list.
  
 
}}
 
}}
Line 166: Line 153:
  
 
;Orthology by annotation
 
;Orthology by annotation
:The NCBI precomputes gropus of related genes and makes them available via the HomoloGene dtatabase from the RefSeq database entry for your protein.
+
:The NCBI precomputes gropus of related genes and makes them available via the HomoloGene database from the RefSeq database entry for your protein.
  
 
{{task|1=
 
{{task|1=
 
# Navigate to the RefSeq protein page for MBP1_MYSPE. (There should be a link from the query identifier in your BLAST result page).
 
# Navigate to the RefSeq protein page for MBP1_MYSPE. (There should be a link from the query identifier in your BLAST result page).
# Follow the '''Homologene''' link in the right-hand menu under '''Related information'''.
+
# Follow the '''Homologene''' link in the right-hand menu under '''Related information'''. (Follow the [https://www.ncbi.nlm.nih.gov/homologene/?term=NP_010227 '''link to MBP1_SACCE'''] if your species has not been annotated and there is no Homologene link from your protein's page.)
  
 
You should see a number of genes that are considered homologous other fungi, but there is no way to tell whether these are orthologues, and the links to proteins with shared domains shows you that there are several that share (non-specific) ankyrin domains, and only a few that also have the (highly specific) Kila-N (or APSES) domain.
 
You should see a number of genes that are considered homologous other fungi, but there is no way to tell whether these are orthologues, and the links to proteins with shared domains shows you that there are several that share (non-specific) ankyrin domains, and only a few that also have the (highly specific) Kila-N (or APSES) domain.
Line 190: Line 177:
  
 
;Orthologs at OrthoDB
 
;Orthologs at OrthoDB
:[http://www.orthodb.org/ '''OrthoDB'''] includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword - try "Mbp1") retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs and again results focus on ankyrin-domain containing proteins.
+
:[http://www.orthodb.org/ '''OrthoDB'''] includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword - try "Mbp1") retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs and, again, the results are overloaded with ankyrin-domain containing proteins.
 
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
 
&nbsp;
 
&nbsp;
Line 202: Line 189:
  
 
;Orthologs at OMA
 
;Orthologs at OMA
[http://omabrowser.org/ '''OMA'''] (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with the refseq identifier of MBP1_MYSPE will probably retrieve hits that you can access via the "Orthologs" tab. As a whole this database is well constructed, the output is useful, and data is available for download and API access; this would be the resource of my first choice for pre-computed orthology queries.
+
[http://omabrowser.org/ '''OMA'''] (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with the refseq identifier of MBP1_MYSPE may retrieve hits that you can access via the "Orthologs" tab (If not, try yeast Mbp1 [https://omabrowser.org/oma/info/YEAST00907/ <tt>NP_010227</tt>]). As a whole this database is well constructed, the output is useful, and data is available for download and API access; this would be the resource of my first choice for pre-computed orthology queries.
  
 
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
Line 221: Line 208:
  
  
 
 
 
{{Vspace}}
 
 
 
== Further reading, links and resources ==
 
{{#pmid: 16285863}}
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
 
{{Vspace}}
 
  
  
 
== Notes ==
 
== Notes ==
<!-- included from "../components/FND-Homology.components.wtxt", section: "notes" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 
 
<references />
 
<references />
  
 
{{Vspace}}
 
{{Vspace}}
  
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "../components/FND-Homology.components.wtxt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
 
 
{{Vspace}}
 
 
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 291: Line 224:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-09-30
+
:2020-09-23
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.0
+
:1.0.1
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.0.1 2020 Maintenance
 
*1.0 First live version
 
*1.0 First live version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 23:39, 23 September 2020

Concepts and Consequences of Homology

(Concepts of homology; Orthologs; Paralogs)


 


Abstract:

Homology is the most important concept for bioinformatics, since shared ancestry allows many inferences about the structure and function of proteins. This unit introduces the concept and explores MBP1_MYSPE relationships.


Objectives:
This unit will ...

  • ... introduce the concept of homology, define orthologues and paralogues and discuss reasons for and consequences of gene conservation;
  • ... explore public database resources to find orthologues by BLAST and in pre-annotated databases.

Outcomes:
After working through this unit you ...

  • ... define "homology", "orthologue" and "paralogue", and use the terms correctly, and with a precise understanding of their meaning and implications;
  • ... are familar with issues around the definition of homologous genes and domains;
  • ... know about sequence similarity and other measures that can identify related proteins and be able to use this to define your own exploratory strategies;
  • ... have identified the RBM for the saccharomyces cerevisiae Mbp1 gene in MYSPE and explored other databses that make pre-annotated relatedness information available.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

    • Biomolecules: The molecules of life; nucleic acids and amino acids; the genetic code; protein folding; post-translational modifications and protein biochemistry; membrane proteins; biological function.
    • The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
    • Evolution: Theory of evolution; variation, neutral drift and selection.

    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    Task:


    Considerations for the MYSPE "Mbp1"

     
    Consider!

    In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE. Now we consider if this protein is homologous to the yeast protein.

    • Are the sequences similar?
    Obviously you have found the MYSPE sequence as a result of a BLAST search and you probably known that BLAST finds similar sequences in large databases. But it will usually always find something, and that could be a chance similarity. Significant similarity would be very high, would extend over the whole length of the protein, could be restricted to individual domains. When would one say: similar enough?
    • Do the proteins have similar structures?
    If your protein happens to have had a part of its structure analyzed by X-ray crystallography, you could compare the structures. However, this is unlikely for the Mbp1 relatives.
    • What about patterns of conserved residues?
    We need more proteins to consider that - and we need to align them.
    • Are the proteins known to perform similar functions?
    That might require function prediction. There might be an annotation in the FASTA header of the MYSPE protein - but most likely the annotation just says: inferred by similarity to the yeas protein (i.e. annotation transfer). There could be experimental evidence though - check carefully, just in case.

    All of these considerations can be translated into bioinformatics queries that we will pursue in later units.


     


    Defining orthologs

    For functional inference between organisms, the key is to find orthologs.

    To be reasonably certain about orthology relationships, one needs to construct and analyze detailed evolutionary trees. This is computationally expensive and the results are not always unambiguous. But a number of different strategies are available that use approximations, or precomputed results to define orthologs. These are especially useful for large, cross genome surveys. They are less useful for detailed analysis of individual genes.

    Orthologs by RBM (Reciprocal Best Match)
    The RBM criterion is only an approximation to orthology, but computationally very tractable and usually correct[1]. To find an RBM, first search for the best match of a gene in the target genome, then check whether that best match retrieves the original query when it used to search in the source genome. You have already done the first step when you identified the best match of yeast Mbp1 in MYSPE. Now do the second step:

    Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in MYSPE and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.

    Again, the RBM workflow

    To find the RBM of gene-1 of species A in species B ...

    With a BLAST search, find the best match to gene-1 in species B. Let that be "gene-2".
    With a BLAST search, find the best match to gene-2 in species A.
    If that match is again gene-1 the "RBM" has been confirmed.


    Task:
    Perfom the second step of the RBM workflow:

    1. Navigate to the BLAST homepage and access the protein BLAST page.
    2. Copy the RefSeq identifier for MBP1_MYSPE from your journal into the search field (You can search directly with an NCBI identifier IF you want to search with the full-length sequence.)
    3. Set the database to refseq;
    4. restrict the species to Saccharomyces cerevisiae S288C.
    5. Run BLAST.
    6. Keep the window open for the next task.

    The top hit should be yeast Mbp1 (NP_010227). Discuss on the board if it is not.

    If the top hit is NP_010227, you have confirmed the RBM criterion (Reciprocal Best Match).

    Task:
    Explain to someone you know why RBM is expected to find orthologous pairs of genes. Don't paraphrase the fact that they do, or merely describe how an RBM analysis works, but explain why we can expect it to be successful in identifying an evolutionary relationship when all we have are measures of pairwise similarity.

    If you can't figure it out, ask on the Discussion board list.


    Orthology by annotation
    The NCBI precomputes gropus of related genes and makes them available via the HomoloGene database from the RefSeq database entry for your protein.

    Task:

    1. Navigate to the RefSeq protein page for MBP1_MYSPE. (There should be a link from the query identifier in your BLAST result page).
    2. Follow the Homologene link in the right-hand menu under Related information. (Follow the link to MBP1_SACCE if your species has not been annotated and there is no Homologene link from your protein's page.)

    You should see a number of genes that are considered homologous other fungi, but there is no way to tell whether these are orthologues, and the links to proteins with shared domains shows you that there are several that share (non-specific) ankyrin domains, and only a few that also have the (highly specific) Kila-N (or APSES) domain.


    Orthologs by eggNOG
    The eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) database contains orthologous groups of genes at the EMBL. It seems to be continuously updated, and the search functionality is reasonable. Try the search with the MBP1_MYSPE refseq identifier. What I see are orthologs annotated in non-fungi but to the ankyrin domain, which is a meaningless relationship. Alignments and trees are also available, as are database downloads for algorithmic analysis.

     

    Powell et al. (2014) eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:D231-9. (pmid: 24297252)

    PubMed ] [ DOI ] With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.


    Orthologs at OrthoDB
    OrthoDB includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword - try "Mbp1") retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs and, again, the results are overloaded with ankyrin-domain containing proteins.

     

    Waterhouse et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41:D358-65. (pmid: 23180791)

    PubMed ] [ DOI ] The concept of orthology provides a foundation for formulating hypotheses on gene and genome evolution, and thus forms the cornerstone of comparative genomics, phylogenomics and metagenomics. We present the update of OrthoDB-the hierarchical catalog of orthologs (http://www.orthodb.org). From its conception, OrthoDB promoted delineation of orthologs at varying resolution by explicitly referring to the hierarchy of species radiations, now also adopted by other resources. The current release provides comprehensive coverage of animals and fungi representing 252 eukaryotic species, and is now extended to prokaryotes with the inclusion of 1115 bacteria. Functional annotations of orthologous groups are provided through mapping to InterPro, GO, OMIM and model organism phenotypes, with cross-references to major resources including UniProt, NCBI and FlyBase. Uniquely, OrthoDB provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and now extended with exon-intron architectures, syntenic orthologs and parent-child trees. The interactive web interface allows navigation along the species phylogenies, complex queries with various identifiers, annotation keywords and phrases, as well as with gene copy-number profiles and sequence homology searches. With the explosive growth of available data, OrthoDB also provides mapping of newly sequenced genomes and transcriptomes to the current orthologous groups.


    Orthologs at OMA

    OMA (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with the refseq identifier of MBP1_MYSPE may retrieve hits that you can access via the "Orthologs" tab (If not, try yeast Mbp1 NP_010227). As a whole this database is well constructed, the output is useful, and data is available for download and API access; this would be the resource of my first choice for pre-computed orthology queries.

     

    Altenhoff et al. (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289-94. (pmid: 21113020)

    PubMed ] [ DOI ] OMA (Orthologous MAtrix) is a database that identifies orthologs among publicly available, complete genomes. Initiated in 2004, the project is at its 11th release. It now includes 1000 genomes, making it one of the largest resources of its kind. Here, we describe recent developments in terms of species covered; the algorithmic pipeline--in particular regarding the treatment of alternative splicing, and new features of the web (OMA Browser) and programming interface (SOAP API). In the second part, we review the various representations provided by OMA and their typical applications. The database is publicly accessible at http://omabrowser.org.

    ... see also the related articles, much innovative and carefully done work on automated orthologue definition by the Dessimoz group.


    Orthologs by syntenic gene order conservation
    OMA also provides synteny information, one hallmark of an orthologous relationship (Why?).




    Notes

    1. Wolf & Koonin (2012) A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol 4:1286-94. (pmid: 23160176)

      PubMed ] [ DOI ] Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this "BBH-orthology conjecture," we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in "syntenic orthologous gene triplets" form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH-orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-23

    Version:

    1.0.1

    Version history:

    • 1.0.1 2020 Maintenance
    • 1.0 First live version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.