Difference between revisions of "FND-Homology"

From "A B C"
Jump to navigation Jump to search
m (Created page with "<div id="BIO"> <div class="b1"> Concepts and Consequences of Homology </div> {{Vspace}} <div class="keywords"> <b>Keywords:</b>  Concepts of homology </div> {{...")
 
m
Line 8: Line 8:
 
<div class="keywords">
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
<b>Keywords:</b>&nbsp;
Concepts of homology
+
Concepts of homology; Orthologs; Paralogs
 
</div>
 
</div>
  
Line 19: Line 19:
  
  
{{STUB}}
+
{{DEV}}
  
 
{{Vspace}}
 
{{Vspace}}
Line 48: Line 48:
 
*[[BIN-Sequence]]
 
*[[BIN-Sequence]]
 
*[[BIN-SX-Concepts]]
 
*[[BIN-SX-Concepts]]
*[[FND-Genetic_code]]
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 92: Line 91:
 
== Contents ==
 
== Contents ==
 
<!-- included from "../components/FND-Homology.components.wtxt", section: "contents" -->
 
<!-- included from "../components/FND-Homology.components.wtxt", section: "contents" -->
...
+
 
 +
{{Task|1=
 +
* Read the introductory notes on {{ABC-PDF|FND-Homology|concepts about "homology" of genes}}.
 +
}}
 +
 
 +
 
 +
===Selecting the YFO "Mbp1"===
 +
 
 +
{{Vspace}}
 +
 
 +
{{task|1=
 +
 
 +
# Back at the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 Mbp1 protein page] follow the link to [http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&BLAST_PROGRAMS=blastp&QUERY=NP_010227.1&LINK_LOC=protein&PAGE_TYPE=BlastSearch Run BLAST...] under "Analyze this sequence".
 +
# This allows you to perform a sequence similarity search. You need to set two parameters:
 +
## As '''Database''', select '''Reference proteins (refseq_protein)''' from the drop down menu;
 +
## In the '''Organism''' field, type the species you have selected as YFO and select the corresponding taxonomy ID.
 +
# Click on '''Run BLAST''' to start the search. This should find a handful of genes, all of them in YFO. If you find none, or hundreds, or they are not all in the same species, you did something wrong. Ask on the mailing list and make sure to fix the problem.
 +
# Look at the top "hit" in the '''Descriptions''' section. The rightmost column contains sequence IDs unter the '''Accession''' heading. The alignment and alignment score are shown in the '''Alignments''' section a bit further down the page. Look at the result.
 +
# In the header information for each hit is a link to its database entry, right next to '''Sequence ID'''.  It says something like <code>ref&#124;NP_123456789.1</code> or <code>ref&#124;XP_123456789</code> ... follow that link.
 +
# Note the RefSeq ID, and the search results %ID, E-value, whether one or more similar regions were found etc. in your Journal. We will refer to this sequence as "''YFO'' Mbp1" or similar in the future.
 +
# Finally access the [http://www.uniprot.org/uploadlists/ UniProt ID mapping service] to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose '''RefSeq Protein''' as the '''From:''' option and '''UniProtKB''' as the '''To:''' option.
 +
 
 +
:<small>If the mapping works, the UniProt ID will be in the '''Entry:''' column of the table that is being returned. Click the link and have a look at the UniProt entry page while you're there.</small>
 +
 
 +
<!-- What could go wrong? Sometimes the mapping does not work:
 +
I don't know why the mapping for some sequences is not available.
 +
If this happens, you can work around the problem as follows.
 +
 
 +
1. Load the RefSeq protein page
 +
2. View the protein as FASTA and copy the sequence.
 +
3. Open the UniProt BLAST page http://www.uniprot.org/blast/
 +
  (Yes, UniProt runs its own BLAST version, and that searches UniProt databases, not Genbank)
 +
4. Paste the sequence into the search form and run BLAST.
 +
 
 +
... if the sequence is in UniProt, you will get the top hit with 100% sequence identity. In your case it is:
 +
  H1VQK3  ( http://www.uniprot.org/uniprot/H1VQK3 )
 +
 
 +
I.e. UniProt contains the sequence, but the mapping service does not know.
 +
-->
 +
 
 +
 
 +
 
 +
}}
 +
 
 +
{{Vspace}}
 +
 
 +
 
 +
 
 +
==Defining orthologs==
 +
 
 +
To be reasonably certain about orthology relationships, we would need to construct and analyze detailed evolutionary trees. This is computationally expensive and the results are not always unambiguous either, as we will see in a later assignment. But a number of different strategies are available that use precomputed results to define orthologs. These are especially useful for large, cross genome surveys. They are less useful for detailed analysis of individual genes. Pay the sites a visit and try a search.
 +
 
 +
 
 +
;Orthologs by eggNOG
 +
:The [http://eggnog.embl.de/ '''eggNOG'''] (evolutionary genealogy of genes: Non-supervised Orthologous Groups) database contains orthologous groups of genes at the EMBL. It seems to be continuously updtaed, the search functionality is reasonable and the results for yeast Mbp1 show many genes from several fungi. Importantly, there is only one gene annotated for each species. Alignments and trees are also available, as are database downloads for algorithmic analysis.
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
 +
&nbsp;
 +
<div class="mw-collapsible-content">
 +
 
 +
{{#pmid: 24297252}}
 +
 
 +
</div>
 +
</div>
 +
 
 +
 
 +
;Orthologs at OrthoDB
 +
:[http://www.orthodb.org/ '''OrthoDB'''] includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword) retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs.
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
 +
&nbsp;
 +
<div class="mw-collapsible-content">
 +
 
 +
{{#pmid: 23180791}}
 +
 
 +
</div>
 +
</div>
 +
 
 +
 
 +
;Orthologs at OMA
 +
[http://omabrowser.org/ '''OMA'''] (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with <code>MBP1_YEAST</code> (this is the Swissprot ID) as a "Group" search finds the correct gene in EREGO, KLULA, CANGL and SACCE. But searching with the sequence of the ''Ustilago maydis'' ortholog does not find the yeast protein, but the orthologs in YARLI, SCHPO, LACCBI, CRYNE and USTMA. Apparently the orthologous group has been split into several subgroups across the fungi. However as a whole the database is carefully constructed and available for download and API access; a large and useful resource.
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
 +
&nbsp;
 +
<div class="mw-collapsible-content">
 +
 
 +
{{#pmid: 21113020}}
 +
 
 +
... see also the related articles, much innovative and carefully done work on automated orthologue definition by the Dessimoz group.
 +
</div>
 +
</div>
 +
 
 +
 
 +
;Orthologs by syntenic gene order conservation
 +
:We will revisit this when we explore the UCSC genome browser.
 +
 
 +
 
 +
;Orthologs by RBM
 +
:Defining it yourself. RBM (or: Reciprocal Best Match) is easy to compute and half of the work you have already done in [[BIO_Assignment_Week_3|Assignment 3]]. Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in YFO and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. <small>The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.</small>
 +
 
 +
{{task|1=
 +
# Navigate to the BLAST homepage.
 +
# Paste the YFO RefSeq sequence identifier into the search field. (You don't have to search with sequences&ndash;you can search directly with an NCBI identifier '''IF''' you want to search with the full-length sequence.)
 +
# Set the database to refseq, and restrict the species to ''Saccharomyces cerevisiae''.
 +
# Run BLAST.
 +
# Keep the window open for the next task.
 +
 
 +
The top hit should be yeast Mbp1 (NP_010227). E mail me your sequence identifiers if it is not.
 +
If it is, you have confirmed the '''RBM''' or '''BBM''' criterion (Reciprocal Best Match or Bidirectional Best Hit, respectively).
 +
 
 +
<small>Technically, this is not perfectly true since you have searched with the APSES domain in one direction, with the full-length sequence in the other. For this task I wanted you to try the ''search-with-accession-number''. Therefore the procedural laxness, I hope it is permissible. In fact, performing the reverse search with the YFO APSES domain should actually be more stringent, i.e. if you find the right gene with the longer sequence, you are even more likely to find the right gene with the shorter one.</small>
 +
}}
 +
 
 +
 
 +
;Orthology by annotation
 +
:The NCBI precomputes BLAST results and makes them available at the RefSeq database entry for your protein.
 +
 
 +
{{task|1=
 +
# In your BLAST result page, click on the RefSeq link for your query to navigate to the RefSeq database entry for your protein.
 +
# Follow the '''Blink''' link in the right-hand column under '''Related information'''.
 +
# Restrict the view RefSeq under the "Display options" and to Fungi.
 +
 
 +
You should see a number of genes with low E-values and high coverage in other fungi - however this search is problematic since the full length gene across the database finds mostly Ankyrin domains.
 +
}}
 +
 
 +
 
 +
You will find that '''all''' of these approaches yield '''some''' of the orthologs. But none finds them all. The take home message is: precomputed results are good for large-scale survey-type investigations, where you can't humanly process the information by hand. But for more detailed questions, careful manual searches are still indsipensable.
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for crowdsourcing" data-collapsetext="Collapse">
 +
;Orthology by crowdsourcing
 +
:Luckily a crowd of willing hands has prepared the necessary sequences for you: in the section below you will find a link to the annotated and verified Mbp1 orthologs from last year's course  :-)
 +
 
 +
<div class="mw-collapsible-content">
 +
We could call this annotation by many hands {{WP|Crowdsourcing|"crowdsourcing"}} - handing out small parcels of work to many workers, who would typically allocate only a small share of their time, but here the strength is in numbers and especially projects that organize via the Internet can tally up very impressive manpower, for free, or as {{WP|Microwork}}. These developments have some interest for bioinformatics: many of our more difficult tasks  can not be easily built into an algorithm, language related tasks such as text-mining, or pattern matching tasks come to mind. Allocating this to a large number of human contributors may be a viable alternative to computation. A marketplace where this kind of work is already a reality is {{WP|Amazon Mechanical Turk|Amazon's "Mechanical Turk" Marketplace}}: programmers&ndash;"requesters"&ndash; use an open interface to post tasks for payment, "providers" from all over the world can engage in these. Tasks may include matching of pictures, or evaluating the aesthetics of competing designs. A quirky example I came across recently was when information designer David McCandless had 200 "Mechanical Turks" draw a small picture of their soul for his collection.
 +
 
 +
The name {{WP|The Turk|"Mechanical Turk"}} by the way relates to a famous ruse, when a Hungarian inventor and adventurer toured the imperial courts of 18<sup>th</sup> century Europe with an automaton, dressed in turkish robes and turban, that played chess at the grandmaster level against opponents that included Napoleon Bonaparte and Benjamin Franklin. No small mechanical feat in any case, it was only in the 19<sup>th</sup> century that it was revealed that the computational power was actually provided by a concealed human.
 +
 
 +
Are you up for some "Turking"? Before the next quiz, edit [http://biochemistry.utoronto.ca/steipe/abc/students/index.php/BCH441_2014_Assignment_7_RBM '''the Mbp1 RBM page on the Student Wiki] and include the RBM for Mbp1, for a 10% bonus on the next quiz.
 +
 
 +
</div>
 +
</div>
 +
 
 +
 
 +
 
  
 
{{Vspace}}
 
{{Vspace}}

Revision as of 04:35, 31 August 2017

Concepts and Consequences of Homology


 

Keywords:  Concepts of homology; Orthologs; Paralogs


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

  • Biomolecules: The molecules of life; nucleic acids and amino acids; the genetic code; protein folding; post-translational modifications and protein biochemistry; membrane proteins; biological function.
  • The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
  • Evolution: Theory of evolution; variation, neutral drift and selection.

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your course journal.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

Task:


Selecting the YFO "Mbp1"

 

Task:

  1. Back at the Mbp1 protein page follow the link to Run BLAST... under "Analyze this sequence".
  2. This allows you to perform a sequence similarity search. You need to set two parameters:
    1. As Database, select Reference proteins (refseq_protein) from the drop down menu;
    2. In the Organism field, type the species you have selected as YFO and select the corresponding taxonomy ID.
  3. Click on Run BLAST to start the search. This should find a handful of genes, all of them in YFO. If you find none, or hundreds, or they are not all in the same species, you did something wrong. Ask on the mailing list and make sure to fix the problem.
  4. Look at the top "hit" in the Descriptions section. The rightmost column contains sequence IDs unter the Accession heading. The alignment and alignment score are shown in the Alignments section a bit further down the page. Look at the result.
  5. In the header information for each hit is a link to its database entry, right next to Sequence ID. It says something like ref|NP_123456789.1 or ref|XP_123456789 ... follow that link.
  6. Note the RefSeq ID, and the search results %ID, E-value, whether one or more similar regions were found etc. in your Journal. We will refer to this sequence as "YFO Mbp1" or similar in the future.
  7. Finally access the UniProt ID mapping service to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose RefSeq Protein as the From: option and UniProtKB as the To: option.
If the mapping works, the UniProt ID will be in the Entry: column of the table that is being returned. Click the link and have a look at the UniProt entry page while you're there.


 


Defining orthologs

To be reasonably certain about orthology relationships, we would need to construct and analyze detailed evolutionary trees. This is computationally expensive and the results are not always unambiguous either, as we will see in a later assignment. But a number of different strategies are available that use precomputed results to define orthologs. These are especially useful for large, cross genome surveys. They are less useful for detailed analysis of individual genes. Pay the sites a visit and try a search.


Orthologs by eggNOG
The eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) database contains orthologous groups of genes at the EMBL. It seems to be continuously updtaed, the search functionality is reasonable and the results for yeast Mbp1 show many genes from several fungi. Importantly, there is only one gene annotated for each species. Alignments and trees are also available, as are database downloads for algorithmic analysis.

 

Powell et al. (2014) eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:D231-9. (pmid: 24297252)

PubMed ] [ DOI ] With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.


Orthologs at OrthoDB
OrthoDB includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword) retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs.

 

Waterhouse et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41:D358-65. (pmid: 23180791)

PubMed ] [ DOI ] The concept of orthology provides a foundation for formulating hypotheses on gene and genome evolution, and thus forms the cornerstone of comparative genomics, phylogenomics and metagenomics. We present the update of OrthoDB-the hierarchical catalog of orthologs (http://www.orthodb.org). From its conception, OrthoDB promoted delineation of orthologs at varying resolution by explicitly referring to the hierarchy of species radiations, now also adopted by other resources. The current release provides comprehensive coverage of animals and fungi representing 252 eukaryotic species, and is now extended to prokaryotes with the inclusion of 1115 bacteria. Functional annotations of orthologous groups are provided through mapping to InterPro, GO, OMIM and model organism phenotypes, with cross-references to major resources including UniProt, NCBI and FlyBase. Uniquely, OrthoDB provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and now extended with exon-intron architectures, syntenic orthologs and parent-child trees. The interactive web interface allows navigation along the species phylogenies, complex queries with various identifiers, annotation keywords and phrases, as well as with gene copy-number profiles and sequence homology searches. With the explosive growth of available data, OrthoDB also provides mapping of newly sequenced genomes and transcriptomes to the current orthologous groups.


Orthologs at OMA

OMA (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with MBP1_YEAST (this is the Swissprot ID) as a "Group" search finds the correct gene in EREGO, KLULA, CANGL and SACCE. But searching with the sequence of the Ustilago maydis ortholog does not find the yeast protein, but the orthologs in YARLI, SCHPO, LACCBI, CRYNE and USTMA. Apparently the orthologous group has been split into several subgroups across the fungi. However as a whole the database is carefully constructed and available for download and API access; a large and useful resource.

 

Altenhoff et al. (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289-94. (pmid: 21113020)

PubMed ] [ DOI ] OMA (Orthologous MAtrix) is a database that identifies orthologs among publicly available, complete genomes. Initiated in 2004, the project is at its 11th release. It now includes 1000 genomes, making it one of the largest resources of its kind. Here, we describe recent developments in terms of species covered; the algorithmic pipeline--in particular regarding the treatment of alternative splicing, and new features of the web (OMA Browser) and programming interface (SOAP API). In the second part, we review the various representations provided by OMA and their typical applications. The database is publicly accessible at http://omabrowser.org.

... see also the related articles, much innovative and carefully done work on automated orthologue definition by the Dessimoz group.


Orthologs by syntenic gene order conservation
We will revisit this when we explore the UCSC genome browser.


Orthologs by RBM
Defining it yourself. RBM (or: Reciprocal Best Match) is easy to compute and half of the work you have already done in Assignment 3. Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in YFO and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.

Task:

  1. Navigate to the BLAST homepage.
  2. Paste the YFO RefSeq sequence identifier into the search field. (You don't have to search with sequences–you can search directly with an NCBI identifier IF you want to search with the full-length sequence.)
  3. Set the database to refseq, and restrict the species to Saccharomyces cerevisiae.
  4. Run BLAST.
  5. Keep the window open for the next task.

The top hit should be yeast Mbp1 (NP_010227). E mail me your sequence identifiers if it is not. If it is, you have confirmed the RBM or BBM criterion (Reciprocal Best Match or Bidirectional Best Hit, respectively).

Technically, this is not perfectly true since you have searched with the APSES domain in one direction, with the full-length sequence in the other. For this task I wanted you to try the search-with-accession-number. Therefore the procedural laxness, I hope it is permissible. In fact, performing the reverse search with the YFO APSES domain should actually be more stringent, i.e. if you find the right gene with the longer sequence, you are even more likely to find the right gene with the shorter one.


Orthology by annotation
The NCBI precomputes BLAST results and makes them available at the RefSeq database entry for your protein.

Task:

  1. In your BLAST result page, click on the RefSeq link for your query to navigate to the RefSeq database entry for your protein.
  2. Follow the Blink link in the right-hand column under Related information.
  3. Restrict the view RefSeq under the "Display options" and to Fungi.

You should see a number of genes with low E-values and high coverage in other fungi - however this search is problematic since the full length gene across the database finds mostly Ankyrin domains.


You will find that all of these approaches yield some of the orthologs. But none finds them all. The take home message is: precomputed results are good for large-scale survey-type investigations, where you can't humanly process the information by hand. But for more detailed questions, careful manual searches are still indsipensable.

Orthology by crowdsourcing
Luckily a crowd of willing hands has prepared the necessary sequences for you: in the section below you will find a link to the annotated and verified Mbp1 orthologs from last year's course :-)

We could call this annotation by many hands "crowdsourcing" - handing out small parcels of work to many workers, who would typically allocate only a small share of their time, but here the strength is in numbers and especially projects that organize via the Internet can tally up very impressive manpower, for free, or as Microwork. These developments have some interest for bioinformatics: many of our more difficult tasks can not be easily built into an algorithm, language related tasks such as text-mining, or pattern matching tasks come to mind. Allocating this to a large number of human contributors may be a viable alternative to computation. A marketplace where this kind of work is already a reality is Amazon's "Mechanical Turk" Marketplace: programmers–"requesters"– use an open interface to post tasks for payment, "providers" from all over the world can engage in these. Tasks may include matching of pictures, or evaluating the aesthetics of competing designs. A quirky example I came across recently was when information designer David McCandless had 200 "Mechanical Turks" draw a small picture of their soul for his collection.

The name "Mechanical Turk" by the way relates to a famous ruse, when a Hungarian inventor and adventurer toured the imperial courts of 18th century Europe with an automaton, dressed in turkish robes and turban, that played chess at the grandmaster level against opponents that included Napoleon Bonaparte and Benjamin Franklin. No small mechanical feat in any case, it was only in the 19th century that it was revealed that the computational power was actually provided by a concealed human.

Are you up for some "Turking"? Before the next quiz, edit the Mbp1 RBM page on the Student Wiki and include the RBM for Mbp1, for a 10% bonus on the next quiz.



 


Further reading, links and resources

 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.