Clustering

From "A B C"
Jump to navigation Jump to search

Clustering and Classification


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Clustering and Classification are conceptually related statistical techniques in that clustering attempts to identify groupings in existing data and classification asks how best to assign new data into existing groups. However, the techniques that are employed are quite different. Clustering techniques find partitions such that relationships within a set are greater than between members of different sets; they are often divided into connectivity based approaches (e.g. "hierarchical clustering") and centroid based approaches (e.g. K-means), but other approaches such as density based clustering (eg. DBSCAN) or the flow-based MCL algorithm are increasingly important in our field. Classification techniques rely heavily on machine learning methods, often in a Bayesian framework: neural networks, support vector machines, hidden Markov models, decision trees etc'.



Introductory reading

Nugent & Meila (2010) An overview of clustering applied to molecular biology. Methods Mol Biol 620:369-404. (pmid: 20652512)

PubMed ] [ DOI ] In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method's assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.


Further reading and resources

Xu & Wunsch (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120-54. (pmid: 22275205)

PubMed ] [ DOI ] Applications of clustering algorithms in biomedical research are ubiquitous, with typical examples including gene expression data analysis, genomic sequence analysis, biomedical document mining, and MRI image analysis. However, due to the diversity of cluster analysis, the differing terminologies, goals, and assumptions underlying different clustering algorithms can be daunting. Thus, determining the right match between clustering algorithms and biomedical applications has become particularly important. This paper is presented to provide biomedical researchers with an overview of the status quo of clustering algorithms, to illustrate examples of biomedical applications based on cluster analysis, and to help biomedical researchers select the most suitable clustering algorithms for their own applications.

van Dongen & Abreu-Goodger (2012) Using MCL to extract clusters from networks. Methods Mol Biol 804:281-95. (pmid: 22144159)

PubMed ] [ DOI ] MCL is a general purpose cluster algorithm for both weighted and unweighted networks. The algorithm utilises network topology as well as edge weights, is highly scalable and has been applied in a wide variety of bioinformatic methods. In this chapter, we give protocols and case studies for clustering of networks derived from, respectively, protein sequence similarities and gene expression profile correlations.

Frades & Matthiesen (2010) Overview on techniques in cluster analysis. Methods Mol Biol 593:81-107. (pmid: 19957146)

PubMed ] [ DOI ] Clustering is the unsupervised, semisupervised, and supervised classification of patterns into groups. The clustering problem has been addressed in many contexts and disciplines. Cluster analysis encompasses different methods and algorithms for grouping objects of similar kinds into respective categories. In this chapter, we describe a number of methods and algorithms for cluster analysis in a stepwise framework. The steps of a typical clustering analysis process include sequentially pattern representation, the choice of the similarity measure, the choice of the clustering algorithm, the assessment of the output, and the representation of the clusters.

Yona et al. (2009) Comparing algorithms for clustering of expression data: how to assess gene clusters. Methods Mol Biol 541:479-509. (pmid: 19381534)

PubMed ] [ DOI ] Clustering is a popular technique commonly used to search for groups of similarly expressed genes using mRNA expression data. There are many different clustering algorithms and the application of each one will usually produce different results. Without additional evaluation, it is difficult to determine which solutions are better.In this chapter we discuss methods to assess algorithms for clustering of gene expression data. In particular, we present a new method that uses two elements: an internal index of validity based on the MDL principle and an external index of validity that measures the consistency with experimental data. Each one is used to suggest an effective set of models, but it is only the combination of both that is capable of pinpointing the best model overall. Our method can be used to compare different clustering algorithms and pick the one that maximizes the correlation with functional links in gene networks while minimizing the error rate. We test our methods on several popular clustering algorithms as well as on clustering algorithms that are specially tailored to deal with noisy data. Finally, we propose methods for assessing the significance of individual clusters and study the correspondence between gene clusters and biochemical pathways.

McLachlan et al. (2008) Clustering. Methods Mol Biol 453:423-39. (pmid: 18712317)

PubMed ] [ DOI ] Clustering techniques are used to arrange genes in some natural way, that is, to organize genes into groups or clusters with similar behavior across relevant tissue samples (or cell lines). These techniques can also be applied to tissues rather than genes. Methods such as hierarchical agglomerative clustering, k-means clustering, the self-organizing map, and model-based methods have been used. This chapter focuses on mixtures of normals to provide a model-based clustering of tissue samples (gene signatures) and gene profiles.

Alonzo & Pepe (2007) Development and evaluation of classifiers. Methods Mol Biol 404:89-116. (pmid: 18450047)

PubMed ] [ DOI ] Diagnostic tests, medical tests, screening tests, biomarkers, and prediction rules are all types of classifiers. This chapter introduces methods for classifier development and evaluation. We first introduce measures of classification performance including sensitivity, specificity, and receiver operating characteristic (ROC) curves. We then review some issues in the design of studies to assess and compare the performance of classifiers. Approaches for using the data to estimate and compare classifier accuracy are then introduced. Next, methods for combining multiple classifiers into a single classifier are presented. Lastly, we discuss other important aspects of classifier development and evaluation. The methods presented are illustrated with real data.