Difference between revisions of "Data integration"

From "A B C"
Jump to navigation Jump to search
m
m
Line 11: Line 11:
  
 
__TOC__
 
__TOC__
 +
 +
 +
==Introductory reading==
 +
<section begin=reading />
 +
{{#pmid:21076152}}
 +
<section end=reading />
  
  
Line 26: Line 32:
  
  
==Introductory reading==
+
<!--
<section begin=reading />
 
<section end=reading />
 
 
 
 
 
 
==Exercises==
 
==Exercises==
 
<section begin=exercises />
 
<section begin=exercises />
Line 40: Line 42:
  
  
 
+
-->
 
==Further reading and resources==
 
==Further reading and resources==
 
+
{{#pmid:21689484}}
 +
{{#pmid:18793134}}
 +
{{#pmid:17370264}}
 +
{{#pmid:19909926}}
  
  

Revision as of 23:15, 26 January 2012

Data integration


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Data integration discusses how data elements are connected (tightly or loosely); combining datasets to improve annotation quality; principles of evidence combination; ontologies.


Introductory reading

Masuya et al. (2011) The RIKEN integrated database of mammals. Nucleic Acids Res 39:D861-70. (pmid: 21076152)

PubMed ] [ DOI ] The RIKEN integrated database of mammals (http://scinets.org/db/mammal) is the official undertaking to integrate its mammalian databases produced from multiple large-scale programs that have been promoted by the institute. The database integrates not only RIKEN's original databases, such as FANTOM, the ENU mutagenesis program, the RIKEN Cerebellar Development Transcriptome Database and the Bioresource Database, but also imported data from public databases, such as Ensembl, MGI and biomedical ontologies. Our integrated database has been implemented on the infrastructure of publication medium for databases, termed SciNetS/SciNeS, or the Scientists' Networking System, where the data and metadata are structured as a semantic web and are downloadable in various standardized formats. The top-level ontology-based implementation of mammal-related data directly integrates the representative knowledge and individual data records in existing databases to ensure advanced cross-database searches and reduced unevenness of the data management operations. Through the development of this database, we propose a novel methodology for the development of standardized comprehensive management of heterogeneous data sets in multiple databases to improve the sustainability, accessibility, utility and publicity of the data of biomedical information.


Contents

Cross referencing

Biological identifiers can cross reference each other IF their semantics allow mapping from one to the other...


Evidence combination

Annotation quality can be improved if evidence can be used from varying sources. But how can such evidence be weighted and combined? How can confidence scores be constructed?


Ontologies

Ontologies describe the semantics of a field of knowledge. They are indispensable for data integration across non-uniform databases. Read more about Ontologies here.


Further reading and resources

Wang et al. (2011) Integrating multiple types of data to predict novel cell cycle-related genes. BMC Syst Biol 5 Suppl 1:S9. (pmid: 21689484)

PubMed ] [ DOI ] BACKGROUND: Cellular functions depend on genetic, physical and other types of interactions. As such, derived interaction networks can be utilized to discover novel genes involved in specific biological processes. Epistatic Miniarray Profile, or E-MAP, which is an experimental platform that measures genetic interactions on a genome-wide scale, has successfully recovered known pathways and revealed novel protein complexes in Saccharomyces cerevisiae (budding yeast). RESULTS: By combining E-MAP data with co-expression data, we first predicted a potential cell cycle related gene set. Using Gene Ontology (GO) function annotation as a benchmark, we demonstrated that the prediction by combining microarray and E-MAP data is generally >50% more accurate in identifying co-functional gene pairs than the prediction using either data source alone. We also used transcription factor (TF)-DNA binding data (Chip-chip) and protein phosphorylation data to construct a local cell cycle regulation network based on potential cell cycle related gene set we predicted. Finally, based on the E-MAP screening with 48 cell cycle genes crossing 1536 library strains, we predicted four unknown genes (YPL158C, YPR174C, YJR054W, and YPR045C) as potential cell cycle genes, and analyzed them in detail. CONCLUSION: By integrating E-MAP and DNA microarray data, potential cell cycle-related genes were detected in budding yeast. This integrative method significantly improves the reliability of identifying co-functional gene pairs. In addition, the reconstructed network sheds light on both the function of known and predicted genes in the cell cycle process. Finally, our strategy can be applied to other biological processes and species, given the availability of relevant data.

Sauro & Bergmann (2008) Standards and ontologies in computational systems biology. Essays Biochem 45:211-22. (pmid: 18793134)

PubMed ] [ DOI ] With the growing importance of computational models in systems biology there has been much interest in recent years to develop standard model interchange languages that permit biologists to easily exchange models between different software tools. In the present chapter two chief model exchange standards, SBML (Systems Biology Markup Language) and CellML are described. In addition, other related features including visual layout initiatives, ontologies and best practices for model annotation are discussed. Software tools such as developer libraries and basic editing tools are also introduced, together with a discussion on the future of modelling languages and visualization tools in systems biology.

Strömbäck et al. (2007) A review of standards for data exchange within systems biology. Proteomics 7:857-67. (pmid: 17370264)

PubMed ] [ DOI ] The rapid increase in experimental data within systems biology has increased the need for exchange of data to allow analysis and comparison of larger datasets. This has resulted in a need for standardized formats for representation of such results and currently many formats for representation of data have been developed or are under development. In this paper, we give an overview of the current state of available standards and ontologies within systems biology. We focus on XML-based standards for exchange of data and give a thorough description of similarities and differences of currently available formats. For each of these, we discuss how the important concepts such as substances, interactions, and experimental data can be represented. In particular, we note that the purpose of a standard is often visible in the structures it provides for the representation of data. A clear purpose is also crucial for the success of a standard. Moreover, we note that the development of representation formats is parallel to the development of ontologies and the recent trend is that representation formats make more and more use of available ontologies.

Kahlem et al. (2009) ENFIN--A European network for integrative systems biology. C R Biol 332:1050-8. (pmid: 19909926)

PubMed ] [ DOI ] Integration of biological data of various types and the development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both the generation of new bioinformatics tools and the experimental validation of computational predictions. With the aim of bridging the gap existing between standard wet laboratories and bioinformatics, the ENFIN Network runs integrative research projects to bring the latest computational techniques to bear directly on questions dedicated to systems biology in the wet laboratory environment. The Network maintains internally close collaboration between experimental and computational research, enabling a permanent cycling of experimental validation and improvement of computational prediction methods. The computational work includes the development of a database infrastructure (EnCORE), bioinformatics analysis methods and a novel platform for protein function analysis FuncNet.