EDA

From "A B C"
Jump to navigation Jump to search

EDA (Exploratory Data Analysis)


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Exploratory Data Analysis (EDA) is a collection of statistical strategies to assist in the preparation of data for further processing and the generation of hypotheses for rigorous follow-up. It employs methods both from descriptive- as well as from inferential statistics, and, since one of its core objectives is to find relationships within datasets, it employs a large variety of data visualization techniques. (See also: Exploratory data analysis)



Introductory reading

Wu & Wu (2010) Exploration, visualization, and preprocessing of high-dimensional data. Methods Mol Biol 620:267-84. (pmid: 20652508)

PubMed ] [ DOI ] The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high-dimensional data and introduce the basic preprocessing procedures.


Strategies

...


Visualization

Krzywinski et al. (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19:1639-45. (pmid: 19541911)

PubMed ] [ DOI ] We created a visualization tool called Circos to facilitate the identification and analysis of similarities and differences arising from comparisons of genomes. Our tool is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies. Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line, and histogram plots, heat maps, tiles, connectors, and text. Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines.


Data Reduction

...


Model Based Exploration

...


Further reading and resources

Schreiber (2008) Visualization. Methods Mol Biol 453:441-50. (pmid: 18712318)

PubMed ] [ DOI ] Visualization is a powerful method to present and analyze a large amount of data. It is increasingly important in bioinformatics and is used for exploring different types of molecular biological data, such as structural information, high throughput data, and biochemical networks. This chapter gives a brief introduction to visualization methods for bioinformatics and presents two commonly used techniques in detail: heatmaps and force-directed network layout.

Foulkes & Au (2011) R statistical tools for gene discovery. Methods Mol Biol 760:73-90. (pmid: 21779991)

PubMed ] [ DOI ] A wide assortment of R tools are available for exploratory data analysis in high-dimensional settings and are easily applicable to data arising from population-based genetic association studies. In this chapter we illustrate the application of three such approaches, namely conditional inference trees, random forests, and logic regression. Through applications to simulated data, we explore the relative utility of each approach for uncovering underlying association between genetic polymorphisms and a quantitative trait.

Teo (2010) Exploratory data analysis in large-scale genetic studies. Biostatistics 11:70-81. (pmid: 19828557)

PubMed ] [ DOI ] Genome-wide association studies (GWAS) have become the method of choice for investigating the genetic basis of common diseases and complex traits. The immense scale of these experiments is unprecedented, involving thousands of samples and up to a million variables. The careful execution of exploratory data analysis (EDA) prior to the actual genotype-phenotype association analysis is crucial as this identifies problematic samples and poorly assayed genetic polymorphisms that, if undetected, can compromise the outcome of the experiment. EDA of such large-scale genetic data sets thus requires specialized numerical and graphical strategies, and this article provides a review of the current exploratory tools commonly used in GWAS.

Azuaje et al. (2005) Non-linear mapping for exploratory data analysis in functional genomics. BMC Bioinformatics 6:13. (pmid: 15661072)

PubMed ] [ DOI ] BACKGROUND: Several supervised and unsupervised learning tools are available to classify functional genomics data. However, relatively less attention has been given to exploratory, visualisation-driven approaches. Such approaches should satisfy the following factors: Support for intuitive cluster visualisation, user-friendly and robust application, computational efficiency and generation of biologically meaningful outcomes. This research assesses a relaxation method for non-linear mapping that addresses these concerns. Its applications to gene expression and protein-protein interaction data analyses are investigated. RESULTS: Publicly available expression data originating from leukaemia, round blue-cell tumours and Parkinson disease studies were analysed. The method distinguished relevant clusters and critical analysis areas. The system does not require assumptions about the inherent class structure of the data, its mapping process is controlled by only one parameter and the resulting transformations offer intuitive, meaningful visual displays. Comparisons with traditional mapping models are presented. As a way of promoting potential, alternative applications of the methodology presented, an example of exploratory data analysis of interactome networks is illustrated. Data from the C. elegans interactome were analysed. Results suggest that this method might represent an effective solution for detecting key network hubs and for clustering biologically meaningful groups of proteins. CONCLUSION: A relaxation method for non-linear mapping provided the basis for visualisation-driven analyses using different types of data. This study indicates that such a system may represent a user-friendly and robust approach to exploratory data analysis. It may allow users to gain better insights into the underlying data structure, detect potential outliers and assess assumptions about the cluster composition of the data.

Ivakhno & Armstrong (2007) Non-linear dimensionality reduction of signaling networks. BMC Syst Biol 1:27. (pmid: 17559646)

PubMed ] [ DOI ] BACKGROUND: Systems wide modeling and analysis of signaling networks is essential for understanding complex cellular behaviors, such as the biphasic responses to different combinations of cytokines and growth factors. For example, tumor necrosis factor (TNF) can act as a proapoptotic or prosurvival factor depending on its concentration, the current state of signaling network and the presence of other cytokines. To understand combinatorial regulation in such systems, new computational approaches are required that can take into account non-linear interactions in signaling networks and provide tools for clustering, visualization and predictive modeling. RESULTS: Here we extended and applied an unsupervised non-linear dimensionality reduction approach, Isomap, to find clusters of similar treatment conditions in two cell signaling networks: (I) apoptosis signaling network in human epithelial cancer cells treated with different combinations of TNF, epidermal growth factor (EGF) and insulin and (II) combination of signal transduction pathways stimulated by 21 different ligands based on AfCS double ligand screen data. For the analysis of the apoptosis signaling network we used the Cytokine compendium dataset where activity and concentration of 19 intracellular signaling molecules were measured to characterise apoptotic response to TNF, EGF and insulin. By projecting the original 19-dimensional space of intracellular signals into a low-dimensional space, Isomap was able to reconstruct clusters corresponding to different cytokine treatments that were identified with graph-based clustering. In comparison, Principal Component Analysis (PCA) and Partial Least Squares - Discriminant analysis (PLS-DA) were unable to find biologically meaningful clusters. We also showed that by using Isomap components for supervised classification with k-nearest neighbor (k-NN) and quadratic discriminant analysis (QDA), apoptosis intensity can be predicted for different combinations of TNF, EGF and insulin. Prediction accuracy was highest when early activation time points in the apoptosis signaling network were used to predict apoptosis rates at later time points. Extended Isomap also outperformed PCA on the AfCS double ligand screen data. Isomap identified more functionally coherent clusters than PCA and captured more information in the first two-components. The Isomap projection performs slightly worse when more signaling networks are analyzed; suggesting that the mapping function between cues and responses becomes increasingly non-linear when large signaling pathways are considered. CONCLUSION: We developed and applied extended Isomap approach for the analysis of cell signaling networks. Potential biological applications of this method include characterization, visualization and clustering of different treatment conditions (i.e. low and high doses of TNF) in terms of changes in intracellular signaling they induce.

Speed (2011) Mathematics. A correlation for the 21st century. Science 334:1502-3. (pmid: 22174235)

PubMed ] [ DOI ]

Reshef et al. (2011) Detecting novel associations in large data sets. Science 334:1518-24. (pmid: 22174245)

PubMed ] [ DOI ] Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

(Also see: Supporting Online Material)