Difference between revisions of "EDA"
(Created page with "<div id="APB"> <div class="b1"> EDA (Exploratory Data Analysis) </div> {{dev}} Exploratory Data Analysis (EDA) is a collection of statistical strategies to assist in the p...") |
m |
||
Line 36: | Line 36: | ||
{{#pmid:19828557}} | {{#pmid:19828557}} | ||
{{#pmid:15661072}} | {{#pmid:15661072}} | ||
+ | {{#pmid:20652508}} | ||
<!-- | <!-- | ||
{{WWW|WWW_ }} | {{WWW|WWW_ }} |
Revision as of 20:43, 26 January 2012
EDA (Exploratory Data Analysis)
Exploratory Data Analysis (EDA) is a collection of statistical strategies to assist in the preparation of data for further processing and the generation of hypotheses for rigorous follow-up. It employs methods both from descriptive- as well as from inferential statistics, and, since one of its core objectives is to find relationships within datasets, it employs a large variety of data visualization techniques. (See also: Exploratory data analysis)
Contents
Further reading and resources
Teo (2010) Exploratory data analysis in large-scale genetic studies. Biostatistics 11:70-81. (pmid: 19828557) |
[ PubMed ] [ DOI ] Genome-wide association studies (GWAS) have become the method of choice for investigating the genetic basis of common diseases and complex traits. The immense scale of these experiments is unprecedented, involving thousands of samples and up to a million variables. The careful execution of exploratory data analysis (EDA) prior to the actual genotype-phenotype association analysis is crucial as this identifies problematic samples and poorly assayed genetic polymorphisms that, if undetected, can compromise the outcome of the experiment. EDA of such large-scale genetic data sets thus requires specialized numerical and graphical strategies, and this article provides a review of the current exploratory tools commonly used in GWAS. |
Azuaje et al. (2005) Non-linear mapping for exploratory data analysis in functional genomics. BMC Bioinformatics 6:13. (pmid: 15661072) |
[ PubMed ] [ DOI ] BACKGROUND: Several supervised and unsupervised learning tools are available to classify functional genomics data. However, relatively less attention has been given to exploratory, visualisation-driven approaches. Such approaches should satisfy the following factors: Support for intuitive cluster visualisation, user-friendly and robust application, computational efficiency and generation of biologically meaningful outcomes. This research assesses a relaxation method for non-linear mapping that addresses these concerns. Its applications to gene expression and protein-protein interaction data analyses are investigated. RESULTS: Publicly available expression data originating from leukaemia, round blue-cell tumours and Parkinson disease studies were analysed. The method distinguished relevant clusters and critical analysis areas. The system does not require assumptions about the inherent class structure of the data, its mapping process is controlled by only one parameter and the resulting transformations offer intuitive, meaningful visual displays. Comparisons with traditional mapping models are presented. As a way of promoting potential, alternative applications of the methodology presented, an example of exploratory data analysis of interactome networks is illustrated. Data from the C. elegans interactome were analysed. Results suggest that this method might represent an effective solution for detecting key network hubs and for clustering biologically meaningful groups of proteins. CONCLUSION: A relaxation method for non-linear mapping provided the basis for visualisation-driven analyses using different types of data. This study indicates that such a system may represent a user-friendly and robust approach to exploratory data analysis. It may allow users to gain better insights into the underlying data structure, detect potential outliers and assess assumptions about the cluster composition of the data. |
Wu & Wu (2010) Exploration, visualization, and preprocessing of high-dimensional data. Methods Mol Biol 620:267-84. (pmid: 20652508) |
[ PubMed ] [ DOI ] The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high-dimensional data and introduce the basic preprocessing procedures. |