Difference between revisions of "BIN-SEQA-Collaboration"

From "A B C"
Jump to navigation Jump to search
m
m
Line 89: Line 89:
 
== Contents ==
 
== Contents ==
 
<!-- included from "../components/BIN-SEQA-Collaboration.components.wtxt", section: "contents" -->
 
<!-- included from "../components/BIN-SEQA-Collaboration.components.wtxt", section: "contents" -->
==Co-Expression==
+
 
 +
 
 +
===Pathways===
 +
 
 +
{{Vspace}}
 +
 
 +
Pathways are perhaps the earliest biochemical representation of systems of collaborating genes. They are a particularly active area of bioinformatics research, stimulated by the vision of automatically grouping biological entities into meaningful systems, based on their known properties, or the properties of homologues. For instance if '''Enzyme A''' produces ''metabolite 1'' and '''Enzyme B''' consumes ''metabolite 1'', even a computer can figure out that '''A''' and '''B''' can be functionally connected. If we stitch all such connections together, we arrive at a description of material flow inside the cell, or of regulatory connections, or of signaling events.  Of course, '''A''' and '''B''' have to be in the same compartment, and ''metabolite 1'' shouldn't be ATP. Or H<sub>2</sub>O. And if we infer relationships from well-studied model organisms, the components we compare really have to be orthologues. We are not sure yet to what degree automation will be possible. Careful, manual curation of pathway data is going to be with us for some time to come.
 +
 
 +
{{Vspace}}
 +
 
 +
==== KEGG====
 +
 
 +
{{Vspace}}
 +
 
 +
The [http://www.genome.ad.jp/kegg/ '''Kyoto Encyclopedia of Genes and Genomes'''] is one of the oldest and best curated databases of metabolic and functional pathways. It stores hand-curated pathways for a number of model organisms and supports computational inference for other organisms by determining orthologues.
 +
 
 +
 
 +
{{Task|1=
 +
* Access the [http://www.genome.ad.jp/kegg/ KEGG Web site].
 +
 
 +
 
 +
Kegg identifies organisms in the database according to three or four letter codes, eg. ''homo sapiens'' &rarr; <code>hsa</code>, ''saccharomyces cerevisiae'' &rarr; <code>sce</code>. '''Some''' of our fungi have manually curated genes in annotated in KEGG, however not all.
 +
 
 +
* Navigate to the [http://www.genome.jp/kegg-bin/get_htext#B3 KEGG Organisms selection page] and find whether MYSPE has been annotated in KEGG by following its taxonomic lineage to the species level or entering the species name into the search field. Note the KEGG three letter species code if it is there. If you are certain it is not there, note that you have not found it, note the full taxonomic lineage of MYSPE (from NCBI Taxonomy DB if it is not in your notes). Then proceed through this task using ''Cryptococcus neoformans'' ([http://www.genome.jp/kegg-bin/show_organism?org=cne '''<tt>cne</tt>''']).
 +
 
 +
 
 +
;Explore
 +
* Navigate to the [http://www.genome.ad.jp/kegg/kegg2.html '''KEGG2''' entry page], which contains a number of options to search the database contents. I have instantiated the links with searches relevant to yeast Mbp1 and the cell-cycle in the list below, try the links and make sure you understand what they contain.
 +
 
 +
*The simplest option is to search for a [http://www.genome.ad.jp/dbget-bin/www_bfind_sub?mode=bfind&max_hit=1000&dbkey=kegg&keywords=mbp1 '''gene name''']. KEGG will return all matches to that name in record titles and annotations.
 +
*You can execute a [http://blast.genome.jp/ '''BLAST'''] search against the database and thus search with domain sequences, such as the APSES domain, rather than with entire genes;
 +
*define a ligand or [http://www.genome.ad.jp/dbget-bin/www_bfind_sub?dbkey=enzyme&keywords=%22cyclin-dependent+kinase%22&mode=bfind&max_hit=1000 '''enzyme'''];
 +
*use the [http://www.genome.ad.jp/dbget-bin/www_bfind_sub?dbkey=pathway&keywords=%22cell+cycle%22&mode=bfind&max_hit=1000 '''PATHWAY'''] search tool to retrieve information on a particular system.
 +
 
 +
 
 +
The gene-search results return a list of genes, one of them should be [http://www.genome.ad.jp/dbget-bin/www_bget?sce:YDL056W '''sce:YDL056W] the KEGG code for Mbp1's systematic name. Access that record and click on the '''Help''' button on top of the record to find information about what the returned results contain.
 +
 
 +
'''Not all of KEGG's curated genes contain a link to a pathway record.''' However Mbp1 does. There is a line labeled Pathway with a link to the protein's curated pathway information: '''sce04111'''. (This pathway code '''<code>04111</code>''' also should have come up as one of the pathways returned via the search for "cyclin-dependent kinase" as an enzyme, or as one of the pathways returned for the pathway search for "cell-cycle".)
 +
 
 +
 
 +
* Follow the link to the yeast cell-cycle pathway.
 +
 
 +
{{Vspace}}
 +
 
 +
The position of Mbp1 is emphasized with a red box. Almost all boxes in this reference pathway are green, this indicates that a gene for that component of the pathway has been curated and stored in the database. The boxes are linked to the respective KEGG gene pages. The phases of the cell cycle ''G1 - S - G2 - M'' are indicated at the bottom of the chart.
 +
 
 +
 
 +
* Use the drop-down menu to switch to the comparative pathway map curated '''for MYSPE'''.
 +
 
 +
Note that this displays the same map, but now some of the boxes are white (the KEGG curators have not annotated an orthologue for these genes in the KO (KEGG Ontology) database) and the green boxes are now linked to your organism's gene instead of yeast. This is a very convenient way to check which components of the well described yeast pathways have been curated as conserved in your organism.
 +
 
 +
}}
 +
 
 +
{{Vspace}}
 +
 
 +
If you explore the various organisms for which this map has been transferred by homology (options menu at the top), you will notice that most organisms have very much fewer genes mapped to this yeast pathway by the KEGG curators (the situation is somewhat better with metabolic pathways, by the way, since computational inference through orthology is less ambiguous). But orthologues  are quite easy to find in KEGG too:
 +
{{Smallvspace}}
 +
{{Task|1=
 +
* Return to the yeast version of the cell-cycle pathway map
 +
 
 +
}}
 +
* Click on the Mbp1 link on the yeast pathway map to take you to the KEGG gene record for the Mbp1 protein. In the '''SSDB''' row, click the '''Ortholog''' button. Orthologs for all of the organisms (or at least their close relatives in the KEGG database) have been precomputed. It should take only a moment to check that the orthologue in your organism is listed too - even if the box was "white" on the pathway page. This is not an error - it just reflects different levels of annotation, curation and inference.
 +
 
 +
Once again, we are back at a familiar problem: much and increasingly more of our annotations are based on analogy and inference. We study one system experimentally in a model organism, then we attempt to map the components to another organism. But pursuing the idea of orthology in order to map function is tricky. Even orthologues may have diverged in evolution to distinct and dissimilar functional '''systems'''. Note for example that in yeast Mbp1 binds to Swi6 (the MBF complex) and Swi6 can also bind to [http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=6320957 Swi4], an Mbp1 homologue (the SBF complex). In many CRMs (cis-regulatory modules) the respective binding sites of Mbp1 and Swi4 are closely juxtaposed. However only the ''Saccharomycotina'' seem to posess orthologues to Swi4, at least as far as they are more similar to Swi4 than to Mbp1. However in our phylogenetic analysis we noted an Mbp1 paralogue in the fungal cenancestor, which then was at the root of the Swi4/MbpA genes of our tree. We have called its descendant Swi4 in some cases, MbpA in others since we have annotated it from the perspective of similarity to yeast. Is Mbp1 a gene that has taken on functions that are distinct from Swi4? MBF and SBF appear to be two complementary systems, presumably each having taken over some part of the space of functions from the other and probably acquired a few novel functions  along the way. But the situation in the other fungi cannot be unambiguously inferred from the evidence we have considered.
 +
 
 +
I hope that this short discussion has illuminated the problems associated with mapping functions between organisms, based on gene similarity. To paraphrase the issue one more time: we are mapping concepts to biology, but "concepts" and "biology" exist in two different worlds. It is helpful, indeed crucial to explain biology in terms of higher-order concepts. This is what we ultimately mean by "understanding" and indeed, if we would not try this, we would be merely "butterfly-collecting". But never, never fall into the trap of basing your biological conclusions - eg. '''functional equivalence of biological objects''' - mechanically on a computed '''similarity of concepts''' (such as gene similarity, pathway position, GO annotation ''etc.''). The mapping of concept to object may be arbitrarily imprecise and as a consequence, so is the equivalence, once we apply it to the "real" world.
 +
&nbsp;<br>
 +
&nbsp;
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 +
===Co-Expression===
  
  

Revision as of 16:25, 9 November 2017

Analysis of Collaborating Sequences


 

Keywords:  Analysis of collaborating sequences (with R examples); Coexpression; Pathways; Enrichment.


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

  • Biomolecules: The molecules of life; nucleic acids and amino acids; the genetic code; protein folding; post-translational modifications and protein biochemistry; membrane proteins; biological function.

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

Pathways

 

Pathways are perhaps the earliest biochemical representation of systems of collaborating genes. They are a particularly active area of bioinformatics research, stimulated by the vision of automatically grouping biological entities into meaningful systems, based on their known properties, or the properties of homologues. For instance if Enzyme A produces metabolite 1 and Enzyme B consumes metabolite 1, even a computer can figure out that A and B can be functionally connected. If we stitch all such connections together, we arrive at a description of material flow inside the cell, or of regulatory connections, or of signaling events. Of course, A and B have to be in the same compartment, and metabolite 1 shouldn't be ATP. Or H2O. And if we infer relationships from well-studied model organisms, the components we compare really have to be orthologues. We are not sure yet to what degree automation will be possible. Careful, manual curation of pathway data is going to be with us for some time to come.


 

KEGG

 

The Kyoto Encyclopedia of Genes and Genomes is one of the oldest and best curated databases of metabolic and functional pathways. It stores hand-curated pathways for a number of model organisms and supports computational inference for other organisms by determining orthologues.


Task:


Kegg identifies organisms in the database according to three or four letter codes, eg. homo sapienshsa, saccharomyces cerevisiaesce. Some of our fungi have manually curated genes in annotated in KEGG, however not all.

  • Navigate to the KEGG Organisms selection page and find whether MYSPE has been annotated in KEGG by following its taxonomic lineage to the species level or entering the species name into the search field. Note the KEGG three letter species code if it is there. If you are certain it is not there, note that you have not found it, note the full taxonomic lineage of MYSPE (from NCBI Taxonomy DB if it is not in your notes). Then proceed through this task using Cryptococcus neoformans (cne).


Explore
  • Navigate to the KEGG2 entry page, which contains a number of options to search the database contents. I have instantiated the links with searches relevant to yeast Mbp1 and the cell-cycle in the list below, try the links and make sure you understand what they contain.
  • The simplest option is to search for a gene name. KEGG will return all matches to that name in record titles and annotations.
  • You can execute a BLAST search against the database and thus search with domain sequences, such as the APSES domain, rather than with entire genes;
  • define a ligand or enzyme;
  • use the PATHWAY search tool to retrieve information on a particular system.


The gene-search results return a list of genes, one of them should be sce:YDL056W the KEGG code for Mbp1's systematic name. Access that record and click on the Help button on top of the record to find information about what the returned results contain.

Not all of KEGG's curated genes contain a link to a pathway record. However Mbp1 does. There is a line labeled Pathway with a link to the protein's curated pathway information: sce04111. (This pathway code 04111 also should have come up as one of the pathways returned via the search for "cyclin-dependent kinase" as an enzyme, or as one of the pathways returned for the pathway search for "cell-cycle".)


  • Follow the link to the yeast cell-cycle pathway.


 

The position of Mbp1 is emphasized with a red box. Almost all boxes in this reference pathway are green, this indicates that a gene for that component of the pathway has been curated and stored in the database. The boxes are linked to the respective KEGG gene pages. The phases of the cell cycle G1 - S - G2 - M are indicated at the bottom of the chart.


  • Use the drop-down menu to switch to the comparative pathway map curated for MYSPE.

Note that this displays the same map, but now some of the boxes are white (the KEGG curators have not annotated an orthologue for these genes in the KO (KEGG Ontology) database) and the green boxes are now linked to your organism's gene instead of yeast. This is a very convenient way to check which components of the well described yeast pathways have been curated as conserved in your organism.


 

If you explore the various organisms for which this map has been transferred by homology (options menu at the top), you will notice that most organisms have very much fewer genes mapped to this yeast pathway by the KEGG curators (the situation is somewhat better with metabolic pathways, by the way, since computational inference through orthology is less ambiguous). But orthologues are quite easy to find in KEGG too:

 

Task:

  • Return to the yeast version of the cell-cycle pathway map
  • Click on the Mbp1 link on the yeast pathway map to take you to the KEGG gene record for the Mbp1 protein. In the SSDB row, click the Ortholog button. Orthologs for all of the organisms (or at least their close relatives in the KEGG database) have been precomputed. It should take only a moment to check that the orthologue in your organism is listed too - even if the box was "white" on the pathway page. This is not an error - it just reflects different levels of annotation, curation and inference.

Once again, we are back at a familiar problem: much and increasingly more of our annotations are based on analogy and inference. We study one system experimentally in a model organism, then we attempt to map the components to another organism. But pursuing the idea of orthology in order to map function is tricky. Even orthologues may have diverged in evolution to distinct and dissimilar functional systems. Note for example that in yeast Mbp1 binds to Swi6 (the MBF complex) and Swi6 can also bind to Swi4, an Mbp1 homologue (the SBF complex). In many CRMs (cis-regulatory modules) the respective binding sites of Mbp1 and Swi4 are closely juxtaposed. However only the Saccharomycotina seem to posess orthologues to Swi4, at least as far as they are more similar to Swi4 than to Mbp1. However in our phylogenetic analysis we noted an Mbp1 paralogue in the fungal cenancestor, which then was at the root of the Swi4/MbpA genes of our tree. We have called its descendant Swi4 in some cases, MbpA in others since we have annotated it from the perspective of similarity to yeast. Is Mbp1 a gene that has taken on functions that are distinct from Swi4? MBF and SBF appear to be two complementary systems, presumably each having taken over some part of the space of functions from the other and probably acquired a few novel functions along the way. But the situation in the other fungi cannot be unambiguously inferred from the evidence we have considered.

I hope that this short discussion has illuminated the problems associated with mapping functions between organisms, based on gene similarity. To paraphrase the issue one more time: we are mapping concepts to biology, but "concepts" and "biology" exist in two different worlds. It is helpful, indeed crucial to explain biology in terms of higher-order concepts. This is what we ultimately mean by "understanding" and indeed, if we would not try this, we would be merely "butterfly-collecting". But never, never fall into the trap of basing your biological conclusions - eg. functional equivalence of biological objects - mechanically on a computed similarity of concepts (such as gene similarity, pathway position, GO annotation etc.). The mapping of concept to object may be arbitrarily imprecise and as a consequence, so is the equivalence, once we apply it to the "real" world.  
 




Co-Expression

Task:
CoExpressdb is a well curated database of pre-calculated co-expression profiles for model organisms. Expression values across a large number of published experiments on the same platform are compared via their coefficient of correlation. Highly correlated genes are either co-regulated, or one gene influences the expression level of the other.

  • Navigate to CoExpressdb.
  • Enter Mbp1 as a "gene alias" in the search field.
  • Click on the link to the coexpressed gene list. Do any of the "known" target genes appear here? How do you interpret this result?

Unfortunately, the support for yeast genes is very limited. CoexDB is however an excellent resource to study higher eukaryotic, especially human genes. You might want to consider it for its additional capabilities for your "systems" term project. Refer to the YouTube tutorials for details.


 


 


Further reading, links and resources

Okamura et al. (2015) COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res 43:D82-6. (pmid: 25392420)

PubMed ] [ DOI ] The COXPRESdb (http://coxpresdb.jp) provides gene coexpression relationships for animal species. Here, we report the updates of the database, mainly focusing on the following two points. For the first point, we added RNAseq-based gene coexpression data for three species (human, mouse and fly), and largely increased the number of microarray experiments to nine species. The increase of the number of expression data with multiple platforms could enhance the reliability of coexpression data. For the second point, we refined the data assessment procedures, for each coexpressed gene list and for the total performance of a platform. The assessment of coexpressed gene list now uses more reasonable P-values derived from platform-specific null distribution. These developments greatly reduced pseudo-predictions for directly associated genes, thus expanding the reliability of coexpression data to design new experiments and to discuss experimental results.


 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.