Data integration

From "A B C"
Jump to navigation Jump to search

Data integration


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Data integration discusses how data elements are connected (tightly or loosely); combining datasets to improve annotation quality; principles of evidence combination; ontologies.


Introductory reading

Masuya et al. (2011) The RIKEN integrated database of mammals. Nucleic Acids Res 39:D861-70. (pmid: 21076152)

PubMed ] [ DOI ] The RIKEN integrated database of mammals (http://scinets.org/db/mammal) is the official undertaking to integrate its mammalian databases produced from multiple large-scale programs that have been promoted by the institute. The database integrates not only RIKEN's original databases, such as FANTOM, the ENU mutagenesis program, the RIKEN Cerebellar Development Transcriptome Database and the Bioresource Database, but also imported data from public databases, such as Ensembl, MGI and biomedical ontologies. Our integrated database has been implemented on the infrastructure of publication medium for databases, termed SciNetS/SciNeS, or the Scientists' Networking System, where the data and metadata are structured as a semantic web and are downloadable in various standardized formats. The top-level ontology-based implementation of mammal-related data directly integrates the representative knowledge and individual data records in existing databases to ensure advanced cross-database searches and reduced unevenness of the data management operations. Through the development of this database, we propose a novel methodology for the development of standardized comprehensive management of heterogeneous data sets in multiple databases to improve the sustainability, accessibility, utility and publicity of the data of biomedical information.


Contents

Cross referencing

Biological identifiers can cross reference each other IF their semantics allow mapping from one to the other...


Evidence combination

Annotation quality can be improved if evidence can be used from varying sources. But how can such evidence be weighted and combined? How can confidence scores be constructed?


Ontologies

Ontologies describe the semantics of a field of knowledge. They are indispensable for data integration across non-uniform databases. Read more about Ontologies here.


Further reading and resources

Chen et al. (2013) Integrating human omics data to prioritize candidate genes. BMC Med Genomics 6:57. (pmid: 24344781)

PubMed ] [ DOI ] BACKGROUND: The identification of genes involved in human complex diseases remains a great challenge in computational systems biology. Although methods have been developed to use disease phenotypic similarities with a protein-protein interaction network for the prioritization of candidate genes, other valuable omics data sources have been largely overlooked in these methods. METHODS: With this understanding, we proposed a method called BRIDGE to prioritize candidate genes by integrating disease phenotypic similarities with such omics data as protein-protein interactions, gene sequence similarities, gene expression patterns, gene ontology annotations, and gene pathway memberships. BRIDGE utilizes a multiple regression model with lasso penalty to automatically weight different data sources and is capable of discovering genes associated with diseases whose genetic bases are completely unknown. RESULTS: We conducted large-scale cross-validation experiments and demonstrated that more than 60% known disease genes can be ranked top one by BRIDGE in simulated linkage intervals, suggesting the superior performance of this method. We further performed two comprehensive case studies by applying BRIDGE to predict novel genes and transcriptional networks involved in obesity and type II diabetes. CONCLUSION: The proposed method provides an effective and scalable way for integrating multi omics data to infer disease genes. Further applications of BRIDGE will be benefit to providing novel disease genes and underlying mechanisms of human diseases.

Mason et al. (2014) Characterizing multi-omic data in systems biology. Adv Exp Med Biol 799:15-38. (pmid: 24292960)

PubMed ] [ DOI ] In today's biology, studies have shifted to analyzing systems over discrete biochemical reactions and pathways. These studies depend on combining the results from scores of experimental methods that analyze DNA; mRNA; noncoding RNAs, DNA, RNA, and protein interactions; and the nucleotide modifications that form the epigenome into global datasets that represent a diverse array of "omics" data (transcriptional, epigenetic, proteomic, metabolomic). The methods used to collect these data consist of high-throughput data generation platforms that include high-content screening, imaging, flow cytometry, mass spectrometry, and nucleic acid sequencing. Of these, the next-generation DNA sequencing platforms predominate because they provide an inexpensive and scalable way to quickly interrogate the molecular changes at the genetic, epigenetic, and transcriptional level. Furthermore, existing and developing single-molecule sequencing platforms will likely make direct RNA and protein measurements possible, thus increasing the specificity of current assays and making it possible to better characterize "epi-alterations" that occur in the epigenome and epitranscriptome. These diverse data types present us with the largest challenge: how do we develop software systems and algorithms that can integrate these datasets and begin to support a more democratic model where individuals can capture and track their own medical information through biometric devices and personal genome sequencing? Such systems will need to provide the necessary user interactions to work with the trillions of data points needed to make scientific discoveries. Here, we describe novel approaches in the genesis and processing of such data, models to integrate these data, and the increasing ubiquity of self-reporting and self-measured genomics and health data.

Wang et al. (2013) Accelerating cancer systems biology research through Semantic Web technology. Wiley Interdiscip Rev Syst Biol Med 5:135-51. (pmid: 23188758)

PubMed ] [ DOI ] Cancer systems biology is an interdisciplinary, rapidly expanding research field in which collaborations are a critical means to advance the field. Yet the prevalent database technologies often isolate data rather than making it easily accessible. The Semantic Web has the potential to help facilitate web-based collaborative cancer research by presenting data in a manner that is self-descriptive, human and machine readable, and easily sharable. We have created a semantically linked online Digital Model Repository (DMR) for storing, managing, executing, annotating, and sharing computational cancer models. Within the DMR, distributed, multidisciplinary, and inter-organizational teams can collaborate on projects, without forfeiting intellectual property. This is achieved by the introduction of a new stakeholder to the collaboration workflow, the institutional licensing officer, part of the Technology Transfer Office. Furthermore, the DMR has achieved silver level compatibility with the National Cancer Institute's caBIG, so users can interact with the DMR not only through a web browser but also through a semantically annotated and secure web service. We also discuss the technology behind the DMR leveraging the Semantic Web, ontologies, and grid computing to provide secure inter-institutional collaboration on cancer modeling projects, online grid-based execution of shared models, and the collaboration workflow protecting researchers' intellectual property.

Antezana et al. (2013) The emergence of Semantic Systems Biology. N Biotechnol 30:286-90. (pmid: 23165099)

PubMed ] [ DOI ] Over the past decade the biological sciences have been widely embracing Systems Biology and its various data integration approaches to discover new knowledge. Molecular Systems Biology aims to develop hypotheses based on integrated, or modelled data. These hypotheses can be subsequently used to design new experiments for testing, leading to an improved understanding of the biology; a more accurate model of the biological system and therefore an improved ability to develop hypotheses. During the same period the biosciences have also eagerly taken up the emerging Semantic Web as evidenced by the dedicated exploitation of Semantic Web technologies for data integration and sharing in the Life Sciences. We describe how these two approaches merged in Semantic Systems Biology: a data integration and analysis approach complementary to model-based Systems Biology. Semantic Systems Biology augments the integration and sharing of knowledge, and opens new avenues for computational support in quality checking and automated reasoning, and to develop new, testable hypotheses.

Wang et al. (2011) Integrating multiple types of data to predict novel cell cycle-related genes. BMC Syst Biol 5 Suppl 1:S9. (pmid: 21689484)

PubMed ] [ DOI ] BACKGROUND: Cellular functions depend on genetic, physical and other types of interactions. As such, derived interaction networks can be utilized to discover novel genes involved in specific biological processes. Epistatic Miniarray Profile, or E-MAP, which is an experimental platform that measures genetic interactions on a genome-wide scale, has successfully recovered known pathways and revealed novel protein complexes in Saccharomyces cerevisiae (budding yeast). RESULTS: By combining E-MAP data with co-expression data, we first predicted a potential cell cycle related gene set. Using Gene Ontology (GO) function annotation as a benchmark, we demonstrated that the prediction by combining microarray and E-MAP data is generally >50% more accurate in identifying co-functional gene pairs than the prediction using either data source alone. We also used transcription factor (TF)-DNA binding data (Chip-chip) and protein phosphorylation data to construct a local cell cycle regulation network based on potential cell cycle related gene set we predicted. Finally, based on the E-MAP screening with 48 cell cycle genes crossing 1536 library strains, we predicted four unknown genes (YPL158C, YPR174C, YJR054W, and YPR045C) as potential cell cycle genes, and analyzed them in detail. CONCLUSION: By integrating E-MAP and DNA microarray data, potential cell cycle-related genes were detected in budding yeast. This integrative method significantly improves the reliability of identifying co-functional gene pairs. In addition, the reconstructed network sheds light on both the function of known and predicted genes in the cell cycle process. Finally, our strategy can be applied to other biological processes and species, given the availability of relevant data.

Kahlem et al. (2009) ENFIN--A European network for integrative systems biology. C R Biol 332:1050-8. (pmid: 19909926)

PubMed ] [ DOI ] Integration of biological data of various types and the development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both the generation of new bioinformatics tools and the experimental validation of computational predictions. With the aim of bridging the gap existing between standard wet laboratories and bioinformatics, the ENFIN Network runs integrative research projects to bring the latest computational techniques to bear directly on questions dedicated to systems biology in the wet laboratory environment. The Network maintains internally close collaboration between experimental and computational research, enabling a permanent cycling of experimental validation and improvement of computational prediction methods. The computational work includes the development of a database infrastructure (EnCORE), bioinformatics analysis methods and a novel platform for protein function analysis FuncNet.

Sauro & Bergmann (2008) Standards and ontologies in computational systems biology. Essays Biochem 45:211-22. (pmid: 18793134)

PubMed ] [ DOI ] With the growing importance of computational models in systems biology there has been much interest in recent years to develop standard model interchange languages that permit biologists to easily exchange models between different software tools. In the present chapter two chief model exchange standards, SBML (Systems Biology Markup Language) and CellML are described. In addition, other related features including visual layout initiatives, ontologies and best practices for model annotation are discussed. Software tools such as developer libraries and basic editing tools are also introduced, together with a discussion on the future of modelling languages and visualization tools in systems biology.

Strömbäck et al. (2007) A review of standards for data exchange within systems biology. Proteomics 7:857-67. (pmid: 17370264)

PubMed ] [ DOI ] The rapid increase in experimental data within systems biology has increased the need for exchange of data to allow analysis and comparison of larger datasets. This has resulted in a need for standardized formats for representation of such results and currently many formats for representation of data have been developed or are under development. In this paper, we give an overview of the current state of available standards and ontologies within systems biology. We focus on XML-based standards for exchange of data and give a thorough description of similarities and differences of currently available formats. For each of these, we discuss how the important concepts such as substances, interactions, and experimental data can be represented. In particular, we note that the purpose of a standard is often visible in the structures it provides for the representation of data. A clear purpose is also crucial for the success of a standard. Moreover, we note that the development of representation formats is parallel to the development of ontologies and the recent trend is that representation formats make more and more use of available ontologies.