BCB420 2015 Project Plan
The 2015 Computational Systems Biology Project
Contents
- Systems Discovery From Public Database Information
ABSTRACT
Systems can be defined as sets of all entities that jointly contribute to a function (or more generally: objective). While the notion of function in molecular biology is a topic of discussion, collaborating entities - such as biomolecules, complexes and higher-order assemblies - may be identifiable from shared properties, annotations, evolutionary history, expression patterns and other experimental observables. Here we integrate observations from a number of publicly available databases to discover the composition and hierarchical organization of human cellular systems.
The systems discovery proceeds along four tracks: (i) phylogenetic profiles harvest mutual information of genes; (ii) gene coexpression patterns identify regulatory units; (iii) importing gene sets from known signalling and regulatory pathways utilizes expert knowledge; and (iv) semantic similarity of GO annotations identifies functional similarity. All of these measures provide categorical and continuous data about functional similarities between pairs of genes. This information can be conceptualized as a weighted (directed) multigraph. Clustering, other evidence-combining or machine learning strategies can be applied to this dataset to discover systems as defined above.
An analysis of such systems will describe their composition, their relationships, and their presumed properties.
Scope
- Goal 1 - Discover systems
- Build (four) data processing pipelines that contribute shared-function information for genes
- Aggregate data as multigraph
- Discover systems as sets of highly connected genes
- Goal 2 - Visualize/interpret systems
- Visualize using graph analysis software (e.g. Cytoscape)
- Determine intra-system relationships (composition)
- Determine inter-system relationships (networks and hierarchies)
- Determine properties from gene-wise annotations
- Goal 3 - Disseminate
- Disseminate methods (e.g. as R-package)
- Disseminate results (e.g. as publication)
Preparations
- Preparations I - Project Plan
- The project plan will be drafted, discussed, reviewed and iteratively revised until it is precisely defined in every single step, including all required resources. It will be continuously refined as information becomes available. We need to define the objectives of the project, its partial goals and how to achieve them. This includes definition of data, data sources, analysis workflow, visualizations and their respective implementation as workflows. Special emphasis will be on the interpretation of the results. We will strive to achieve a meaningful conclusion that can be written up for a publication.
- Milestone - Project plan defined
- The project plan will be drafted, discussed, reviewed and iteratively revised until it is precisely defined in every single step, including all required resources. It will be continuously refined as information becomes available. We need to define the objectives of the project, its partial goals and how to achieve them. This includes definition of data, data sources, analysis workflow, visualizations and their respective implementation as workflows. Special emphasis will be on the interpretation of the results. We will strive to achieve a meaningful conclusion that can be written up for a publication.
- Preparations II - Organization
- All project participants need access to a number of collaborative tools for sharing information and project results.
- Milestone - Workflow for documentation and result sharing established
- All project participants need access to a number of collaborative tools for sharing information and project results.
- Preparations III - Roles and Tasks
- The class will comprise students with widely varying skills and backgrounds. We need to identify available skills, but students will also define their own learning objectives for the project, through which they push the boundaries of their knowledge and capabilities as far as possible. Based on this survey of resources, we will break down the project into work packages and assign them, keeping redundancy and contingencies in mind.
- Milestone - Packages assigned
- The class will comprise students with widely varying skills and backgrounds. We need to identify available skills, but students will also define their own learning objectives for the project, through which they push the boundaries of their knowledge and capabilities as far as possible. Based on this survey of resources, we will break down the project into work packages and assign them, keeping redundancy and contingencies in mind.
Implementation
Phylogenetic Profile
|
Gene Coexpression
|
Pathways
Q: Are web-based APIs the best way to access data? Maybe we need a way to download data in bulk?
We first define two types of pathways, as they shall be dealt with uniquely: 1. Flux carrying pathways (enzymes which convert metabolites) 2. Non-flux carrying pathways (signalling cascades, cell regulatory pathways, etc.)
Open question: We need to define sources which specifically deal with signalling pathways.
3. Define how to treat connections (networks) and crosstalk
Types of metadata:
Let us first define criterion for 'related' in terms of pathways:
One difficulty will be to distinguish between genes related to UNIQUE pathways vs RELATED pathways By using a neural network approach, we can use the aforementioned inputs to a deep learning system. Depending on the size of the data, we can split it into three categories: 1. Training 2. Test 3. Validation With sufficient data, the neural network can be generated to improve the algorithm's ability to cluster based on the given inputs. To extend this even further, we can provide more information which may seem trivial, to allow for unexpected correlations to be generated in the hidden nodes. This is an interesting question and the answer will likely be determined by the other methods of data collection. Genome annotation will classify annotated genes into respective pathways based on their homology with other genes. Two further difficulties could arise:
|
Gene Ontology Annotations
|
Data Integration
|
Results
- Results phase 1 - Interpretation
- Visualize using graph analysis software (e.g. Cytoscape)
- Determine intra-system relationships (composition)
- Determine inter-system relationships (networks and hierarchies)
- Determine properties from gene-wise annotations
- Milestone - Analysis completed
- Results phase 2 - Publication
- (Write up project and submit for publication)
- Milestone - TBD
- (Write up project and submit for publication)
Tasks and timeline
...TBD