BCB420 2015 Project Plan

From "A B C"
Jump to navigation Jump to search

The 2015 Computational Systems Biology Project



Systems Discovery From Public Database Information

ABSTRACT

Systems can be defined as sets of all entities that jointly contribute to a function (or more generally: objective). While the notion of function in molecular biology is a topic of discussion, collaborating entities - such as biomolecules, complexes and higher-order assemblies - may be identifiable from shared properties, annotations, evolutionary history, expression patterns and other experimental observables. Here we integrate observations from a number of publicly available databases to discover the composition and hierarchical organization of human cellular systems.

The systems discovery proceeds along four tracks: (i) phylogenetic profiles harvest mutual information of genes; (ii) gene coexpression patterns identify regulatory units; (iii) importing gene sets from known signalling and regulatory pathways utilizes expert knowledge; and (iv) semantic similarity of GO annotations identifies functional similarity. All of these measures provide categorical and continuous data about functional similarities between pairs of genes. This information can be conceptualized as a weighted (directed) multigraph. Clustering, other evidence-combining or machine learning strategies can be applied to this dataset to discover systems as defined above.

An analysis of such systems will describe their composition, their relationships, and their presumed properties.


Scope

  • Goal 1 - Discover systems
    • Build (four) data processing pipelines that contribute shared-function information for genes
    • Aggregate data as multigraph
    • Discover systems as sets of highly connected genes
  • Goal 2 - Visualize/interpret systems
    • Visualize using graph analysis software (e.g. Cytoscape)
    • Determine intra-system relationships (composition)
    • Determine inter-system relationships (networks and hierarchies)
    • Determine properties from gene-wise annotations
  • Goal 3 - Disseminate
    • Disseminate methods (e.g. as R-package)
    • Disseminate results (e.g. as publication)


 

Preparations

  • Preparations I - Project Plan
The project plan will be drafted, discussed, reviewed and iteratively revised until it is precisely defined in every single step, including all required resources. It will be continuously refined as information becomes available. We need to define the objectives of the project, its partial goals and how to achieve them. This includes definition of data, data sources, analysis workflow, visualizations and their respective implementation as workflows. Special emphasis will be on the interpretation of the results. We will strive to achieve a meaningful conclusion that can be written up for a publication.
Milestone - Project plan defined

  • Preparations II - Organization
All project participants need access to a number of collaborative tools for sharing information and project results.
Milestone - Workflow for documentation and result sharing established

  • Preparations III - Roles and Tasks
The class will comprise students with widely varying skills and backgrounds. We need to identify available skills, but students will also define their own learning objectives for the project, through which they push the boundaries of their knowledge and capabilities as far as possible. Based on this survey of resources, we will break down the project into work packages and assign them, keeping redundancy and contingencies in mind.
Milestone - Packages assigned


 

Implementation

Phylogenetic Profile

  • Data
Define a well distributed set of species; define suitable genes (or gene sets) for analysis.
Milestone: Input data defined

  • Algorithms and tools
Define procedure for determining orthology, install databases and tools, define algorithm for mutual information, address undersampling issue (many more genes than species), address threshold issue (when is a shared function implied).
Milestone - Workflow defined, resources prepared

  • Data acquisition
Compute phylogenetic profiles
Milestone - Profiles computed, stored

  • Data analysis
Calculate mutual information, output results
Milestone - Results in database

Gene Coexpression

  • Data
Define suitable platforms and perturbation conditions, find datasets, define normalization procedure to make datasets comparable.
Milestone - acquisition of expression profiles defined.

  • Algorithms and tools
Define co-expression measure (correlation), implement in software, decide on significance threshold
Milestone - co-expression computable

  • Data acquisition
Retrieve data from GEO
Milestone - Data ready for analysis

  • Data analysis
Run co-expression measure, apply threshold
Milestone - Results in database


Pathways

  • Data

Q: Are web-based APIs the best way to access data? Maybe we need a way to download data in bulk?


Introduction

We first define two types of pathways, as they shall be dealt with uniquely: 1. Flux carrying pathways (enzymes which convert metabolites) 2. Non-flux carrying pathways (signalling cascades, cell regulatory pathways, etc.)

  • Pathway data is commonly expressed in models with the COBRA format, the model is a structure of:
  • .rxns (all reaction names)
  • .S (connectivity matrix)
  • .ub (upper bound on flux)
  • .lb (lower bound on flux)
  • Non-flux carrying pathways can also be represented as such, but more work may need to be done to make them more robust. Signalling pathways in essence involve the change of one enzyme form to another, these could be considered metabolites?
    • In the case of signalling pathways, we may have two binary states for an enzyme. The two states will be represented by two different metabolites, we may be able to alter this depending on the need.


1. Define pathway data sources

Open question: We need to define sources which specifically deal with signalling pathways.

  • KEGG Pathway
    • KEGG has an API available here: [1]
    • A sample of the available information is here: [2]
      • We can get things like:
        • Orthonology (KEGG orthology / KO)
        • Currently assigned pathway
        • Protein/Nucleotide sequences (although these can of course be found elsewhere as well)
  • Metacyc
    • BioCyc API is available here: [3]
      • A list of useful information available from biocyc is here: [4]
  • Reactome pathway database
  • BRENDA could be used to find new enzyme data for candidate reactions


2. Find out how to create gene lists in target organism from pathways

  • From raw sequence data, the RAST server can be used to generate a list of genes and corresponding reactions.
  • From predefined pathways,


3. Define how to treat connections (networks) and crosstalk

  • This is defined in the S matrix, with rows corresponding to metabolites and column corresponding to reactions. The reaction is defined by putting in stoichiometric coefficients for each reaction.


4. Define how to represent metadata (names, functions)

Types of metadata:

  • Assigned pathway based on curated databases (KEGG)
  • GO function
  • Lethality
  • Isozymes
  • This information can be stored in unique vectors. Isozyme information could be stored as connectivity information. Lethality could be stored as binary variables for each reaction.


5. Define how to represent pathway topology and gene position

  • Chromosome
  • Position
    • This information can be found from the aforementioned databases.


6. Define how to deal with gaps.

  • Gaps can be filled using SEED's gap filling algorithm to make complete pathways. We can then search for these pathways using homology studies to see if there are candidate loci from the genomic data.
  • The answer to this question will likely be dependent on how we represent the pathways, if we stick to COBRA format, there are gap filling tools [5]
  • We could rewrite gap filling tools specifically in R, or send to an online server (like SEED) for gap filling
  • If we cannot fill the gaps, we can at least identify them as well as identify orphan reactions (reactions for which the substrate cannot be produced)


Milestone - Data source and data model defined


7. Define how to deal with determining new related pathways, assuming the annotation server does not complete the job

Let us first define criterion for 'related' in terms of pathways:

  • If we define pathways as all enzymes which share a common node somewhere, we will classify almost all of metabolism into a common pathway. Signalling pathways may be easier to deal with in this manner
  • Genetic context (e.g. gene position) may indicate related function/in the same patwhay
  • Co-expression can indicate related function, but not necessarily part of the same pathway. Although if we define the pathway more broadly to include sub-pathways this may still work.
  • Phylogenetic distance could be related to shared function and potentially related pathways.

One difficulty will be to distinguish between genes related to UNIQUE pathways vs RELATED pathways

By using a neural network approach, we can use the aforementioned inputs to a deep learning system. Depending on the size of the data, we can split it into three categories: 1. Training 2. Test 3. Validation With sufficient data, the neural network can be generated to improve the algorithm's ability to cluster based on the given inputs. To extend this even further, we can provide more information which may seem trivial, to allow for unexpected correlations to be generated in the hidden nodes.

This is an interesting question and the answer will likely be determined by the other methods of data collection. Genome annotation will classify annotated genes into respective pathways based on their homology with other genes. Two further difficulties could arise:

  • The annotated gene does not belong to a pathway

  • Algorithms and tools
  • COBRA toolbox (for performing simulations on constraint based models)


Define tool to create pathway directory and tool to download data per pathway.



Milestone - Ready to collect data

  • Data acquisition
Collect data
Milestone - Pathway data downloaded

  • Data analysis
Break down pathway data into per-gene-pair information
Milestone - Pathway data represented in database


Gene Ontology Annotations

  • Data
Define GOA data sources
Milestone - Data source defined

  • Algorithms and tools
Define functional (semantic) similarity measure to use, define threshold, implement algorithm
Milestone - Semantic similarity computable

  • Data acquisition
Download GOA data
Milestone - GOA data downloaded

  • Data analysis
Compute similarity for gene pairs, store results
Milestone - Semantic similarity stored in database

Data Integration

  • Algorithms and tools
Define clustering or other suitable integration algorithm, define parameters, implement
Milestone - Systems can be identified

  • Data analysis
Run data integration
Milestone - Gene lists for systems stored in database



Results

Results phase 1 - Interpretation
  • Visualize using graph analysis software (e.g. Cytoscape)
  • Determine intra-system relationships (composition)
  • Determine inter-system relationships (networks and hierarchies)
  • Determine properties from gene-wise annotations
Milestone - Analysis completed

  • Results phase 2 - Publication
(Write up project and submit for publication)
Milestone - TBD


 

Tasks and timeline

...TBD