BCB420 2015 Project Plan

The 2015 Computational Systems Biology Project

ABSTRACT

Systems can be defined as sets of all entities that jointly contribute to a function (or more generally: objective). While the notion of function in molecular biology is a topic of discussion, collaborating entities - such as biomolecules, complexes and higher-order assemblies - may be identifiable from shared properties, annotations, evolutionary history, expression patterns and other experimental observables. Here we integrate observations from a number of publicly available databases to discover the composition and hierarchical organization of human cellular systems.

The systems discovery proceeds along four tracks: (i) phylogenetic profiles harvest mutual information of genes; (ii) gene coexpression patterns identify regulatory units; (iii) importing gene sets from known signalling and regulatory pathways utilizes expert knowledge; and (iv) semantic similarity of GO annotations identifies functional similarity. All of these measures provide categorical and continuous data about functional similarities between pairs of genes. This information can be conceptualized as a weighted (directed) multigraph. Clustering, other evidence-combining or machine learning strategies can be applied to this dataset to discover systems as defined above.

An analysis of such systems will describe their composition, their relationships, and their presumed properties.

Scope

Goal 1 - Discover systems
- Build (four) data processing pipelines that contribute shared-function information for genes
- Aggregate data as multigraph
- Discover systems as sets of highly connected genes

Goal 2 - Visualize/interpret systems
- Visualize using graph analysis software (e.g. Cytoscape)
- Determine intra-system relationships (composition)
- Determine inter-system relationships (networks and hierarchies)
- Determine properties from gene-wise annotations

Goal 3 - Disseminate
- Disseminate methods (e.g. as R-package)
- Disseminate results (e.g. as publication)

Preparations

Preparations I - Project Plan

The project plan will be drafted, discussed, reviewed and iteratively revised until it is precisely defined in every single step, including all required resources. It will be continuously refined as information becomes available. We need to define the objectives of the project, its partial goals and how to achieve them. This includes definition of data, data sources, analysis workflow, visualizations and their respective implementation as workflows. Special emphasis will be on the interpretation of the results. We will strive to achieve a meaningful conclusion that can be written up for a publication.

Milestone - Project plan defined

Preparations II - Organization

All project participants need access to a number of collaborative tools for sharing information and project results.

Milestone - Workflow for documentation and result sharing established

Preparations III - Roles and Tasks

The class will comprise students with widely varying skills and backgrounds. We need to identify available skills, but students will also define their own learning objectives for the project, through which they push the boundaries of their knowledge and capabilities as far as possible. Based on this survey of resources, we will break down the project into work packages and assign them, keeping redundancy and contingencies in mind.

Milestone - Packages assigned

Implementation

Phylogenetic Profile

Data

Define a well distributed set of species; define suitable genes (or gene sets) for analysis.

Milestone: Input data defined

Algorithms and tools

Define procedure for determining orthology, install databases and tools, define algorithm for mutual information, address undersampling issue (many more genes than species), address threshold issue (when is a shared function implied).

Milestone - Workflow defined, resources prepared

Data acquisition

Compute phylogenetic profiles

Milestone - Profiles computed, stored

Data analysis

Calculate mutual information, output results

Milestone - Results in database

Gene Coexpression

Data

Define suitable platforms and perturbation conditions, find datasets, define normalization procedure to make datasets comparable.

Milestone - acquisition of expression profiles defined.

Algorithms and tools

Define co-expression measure (correlation), implement in software, decide on significance threshold

Milestone - co-expression computable

Data acquisition

Retrieve data from GEO

Milestone - Data ready for analysis

Data analysis

Run co-expression measure, apply threshold

Milestone - Results in database

Pathways

Data

Q: Are web-based APIs the best way to access data? Maybe we need a way to download data in bulk?

Introduction

We first define two types of pathways, as they shall be dealt with uniquely: 1. Flux carrying pathways (enzymes which convert metabolites) 2. Non-flux carrying pathways (signalling cascades, cell regulatory pathways, etc.)

Pathway data is commonly expressed in models with the COBRA format, the model is a structure of:
.rxns (all reaction names)
.S (connectivity matrix)
.ub (upper bound on flux)
.lb (lower bound on flux)

Non-flux carrying pathways can also be represented as such, but more work may need to be done to make them more robust. Signalling pathways in essence involve the change of one enzyme form to another, these could be considered metabolites?
- In the case of signalling pathways, we may have two binary states for an enzyme. The two states will be represented by two different metabolites, we may be able to alter this depending on the need.

1. Define pathway data sources

Open question: We need to define sources which specifically deal with signalling pathways.

KEGG Pathway
- KEGG has an API available here: [1]
- A sample of the available information is here: [2]
  - We can get things like:
    - Orthonology (KEGG orthology / KO)
    - Currently assigned pathway
    - Protein/Nucleotide sequences (although these can of course be found elsewhere as well)

Metacyc
- BioCyc API is available here: [3]
  - A list of useful information available from biocyc is here: [4]

Reactome pathway database

BRENDA could be used to find new enzyme data for candidate reactions

2. Find out how to create gene lists in target organism from pathways

From raw sequence data, the RAST server can be used to generate a list of genes and corresponding reactions.
From predefined pathways,

3. Define how to treat connections (networks) and crosstalk

This is defined in the S matrix, with rows corresponding to metabolites and column corresponding to reactions. The reaction is defined by putting in stoichiometric coefficients for each reaction.

4. Define how to represent metadata (names, functions)

Types of metadata:

Assigned pathway based on curated databases (KEGG)
GO function
Lethality
Isozymes

This information can be stored in unique vectors. Isozyme information could be stored as connectivity information. Lethality could be stored as binary variables for each reaction.

5. Define how to represent pathway topology and gene position

Chromosome
Position
- This information can be found from the aforementioned databases.

6. Define how to deal with gaps.

Gaps can be filled using SEED's gap filling algorithm to make complete pathways. We can then search for these pathways using homology studies to see if there are candidate loci from the genomic data.
The answer to this question will likely be dependent on how we represent the pathways, if we stick to COBRA format, there are gap filling tools [5]
We could rewrite gap filling tools specifically in R, or send to an online server (like SEED) for gap filling
If we cannot fill the gaps, we can at least identify them as well as identify orphan reactions (reactions for which the substrate cannot be produced)

Milestone - Data source and data model defined

7. Define how to deal with determining new related pathways, assuming the annotation server does not complete the job

Let us first define criterion for 'related' in terms of pathways:

If we define pathways as all enzymes which share a common node somewhere, we will classify almost all of metabolism into a common pathway. Signalling pathways may be easier to deal with in this manner
Genetic context (e.g. gene position) may indicate related function/in the same patwhay
Co-expression can indicate related function, but not necessarily part of the same pathway. Although if we define the pathway more broadly to include sub-pathways this may still work.
Phylogenetic distance could be related to shared function and potentially related pathways.

One difficulty will be to distinguish between genes related to UNIQUE pathways vs RELATED pathways

By using a neural network approach, we can use the aforementioned inputs to a deep learning system. Depending on the size of the data, we can split it into three categories: 1. Training 2. Test 3. Validation With sufficient data, the neural network can be generated to improve the algorithm's ability to cluster based on the given inputs. To extend this even further, we can provide more information which may seem trivial, to allow for unexpected correlations to be generated in the hidden nodes.

This is an interesting question and the answer will likely be determined by the other methods of data collection. Genome annotation will classify annotated genes into respective pathways based on their homology with other genes. Two further difficulties could arise:

The annotated gene does not belong to a pathway

Algorithms and tools
COBRA toolbox (for performing simulations on constraint based models)

Define tool to create pathway directory and tool to download data per pathway.

Milestone - Ready to collect data

Data acquisition

Collect data

Milestone - Pathway data downloaded

Data analysis

Break down pathway data into per-gene-pair information

Milestone - Pathway data represented in database

Gene Ontology Annotations

Data

Define GOA data sources

Milestone - Data source defined

Algorithms and tools

Define functional (semantic) similarity measure to use, define threshold, implement algorithm

Milestone - Semantic similarity computable

Data acquisition

Download GOA data

Milestone - GOA data downloaded

Data analysis

Compute similarity for gene pairs, store results

Milestone - Semantic similarity stored in database

Data Integration

Algorithms and tools

Define clustering or other suitable integration algorithm, define parameters, implement

Milestone - Systems can be identified

Data analysis

Run data integration

Milestone - Gene lists for systems stored in database

Results

Results phase 1 - Interpretation

Visualize using graph analysis software (e.g. Cytoscape)
Determine intra-system relationships (composition)
Determine inter-system relationships (networks and hierarchies)
Determine properties from gene-wise annotations

Milestone - Analysis completed

Results phase 2 - Publication

(Write up project and submit for publication)

Milestone - TBD

Tasks and timeline

...TBD

BCB420 2015 Project Plan

Contents

ABSTRACT

Scope

Preparations

Implementation

Phylogenetic Profile

Gene Coexpression

Pathways

Gene Ontology Annotations

Data Integration

Results

Tasks and timeline

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools