Data mining

From "A B C"
Jump to navigation Jump to search

Data mining


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Data mining (or knowledge discovery) is a collection of methods for discovering patterns of interest in large datasets. Similar to Exploratory Data Analysis in that one aims to approach the data without preconceived notions about what the patterns could be, data mining relies less on visualization to allow the investigator to discover patterns by inspection, and more on computable descriptions of patterns. Such strategies are especially well suited to situations in which it is hard to devise good visual presentations, such as text mining, or mining of phenotype descriptions, or when the data is very high-dimensional. There are however significant overlaps between the two concepts. This page focusses predominantly on text mining, but David Reshef's work on the maximal information coefficient deserves special mention.


Introductory reading

Speed (2011) Mathematics. A correlation for the 21st century. Science 334:1502-3. (pmid: 22174235)

PubMed ] [ DOI ]


Contents

...


Further reading and resources

Reshef et al. (2011) Detecting novel associations in large data sets. Science 334:1518-24. (pmid: 22174245)

PubMed ] [ DOI ] Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Clegg & Shepherd (2008) Text mining. Methods Mol Biol 453:471-91. (pmid: 18712320)

PubMed ] [ DOI ] One of the fastest-growing fields in bioinformatics is text mining: the application of natural language processing techniques to problems of knowledge management and discovery, using large collections of biological or biomedical text such as MEDLINE. The techniques used in text mining range from the very simple (e.g., the inference of relationships between genes from frequent proximity in documents) to the complex and computationally intensive (e.g., the analysis of sentence structures with parsers in order to extract facts about protein-protein interactions from statements in the text). This chapter presents a general introduction to some of the key principles and challenges of natural language processing, and introduces some of the tools available to end-users and developers. A case study describes the construction and testing of a simple tool designed to tackle a task that is crucial to almost any application of text mining in bioinformatics--identifying gene/protein names in text and mapping them onto records in an external database.

Krallinger et al. (2010) Analysis of biological processes and diseases using text mining approaches. Methods Mol Biol 593:341-82. (pmid: 19957157)

PubMed ] [ DOI ] A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.

Groth et al. (2010) Phenoclustering: online mining of cross-species phenotypes. Bioinformatics 26:1924-5. (pmid: 20562418)

PubMed ] [ DOI ] SUMMARY: Recently, several methods for analyzing phenotype data have been published, but only few are able to cope with data sets generated in different studies, with different methods, or for different species. We developed an online system in which more than 300 000 phenotypes from a wide variety of sources and screening methods can be analyzed together. Clusters of similar phenotypes are visualized as networks of highly similar phenotypes, inducing gene groups useful for functional analysis. This system is part of PhenomicDB, providing the world's largest cross-species phenotype data collection with a tool to mine its wealth of information. AVAILABILITY: Freely available at http://www.phenomicdb.de

Groth et al. (2011) Phenotype mining for functional genomics and gene discovery. Methods Mol Biol 760:159-73. (pmid: 21779996)

PubMed ] [ DOI ] In gene prediction, studying phenotypes is highly valuable for reducing the number of locus candidates in association studies and to aid disease gene candidate prioritization. This is due to the intrinsic nature of phenotypes to visibly reflect genetic activity, making them potentially one of the most useful data types for functional studies. However, systematic use of these data has begun only recently. 'Comparative phenomics' is the analysis of genotype-phenotype associations across species and experimental methods. This is an emerging research field of utmost importance for gene discovery and gene function annotation. In this chapter, we review the use of phenotype data in the biomedical field. We will give an overview of phenotype resources, focusing on PhenomicDB--a cross-species genotype-phenotype database--which is the largest available collection of phenotype descriptions across species and experimental methods. We report on its latest extension by which genotype-phenotype relationships can be viewed as graphical representations of similar phenotypes clustered together ('phenoclusters'), supplemented with information from protein-protein interactions and Gene Ontology terms. We show that such 'phenoclusters' represent a novel approach to group genes functionally and to predict novel gene functions with high precision. We explain how these data and methods can be used to supplement the results of gene discovery approaches. The aim of this chapter is to assist researchers interested in understanding how phenotype data can be used effectively in the gene discovery field.