Statistics

From "A B C"
Jump to navigation Jump to search

Statistics


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Statistics is indispensable to describe and analyse large data sets, such as those commonly encountered in computational biology. Roughly, three areas apply: descriptive statistics, to describe features of sets of data, inferential statistics, to quantify the significance of observations, and probability theory, to provide the theoretical basis for drawing such inferences. In practice we often apply procedures of Exploratory Data Analysis to find interesting features of our data and devise hypotheses and strategies for its analysis.


Introductory reading

Nicholls (2011) What do we know?: simple statistical techniques that help. Methods Mol Biol 672:531-81. (pmid: 20838984)

PubMed ] [ DOI ] An understanding of simple statistical techniques is invaluable in science and in life. Despite this, and despite the sophistication of many concerning the methods and algorithms of molecular modeling, statistical analysis is usually rare and often uncompelling. I present here some basic approaches that have proved useful in my own work, along with examples drawn from the field. In particular, the statistics of evaluations of virtual screening are carefully considered.


Further reading and resources

Johnson (2013) Revised standards for statistical evidence. Proc Natl Acad Sci U.S.A 110:19313-7. (pmid: 24218581)

PubMed ] [ DOI ] Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

Xu et al. (2010) Categorical data analysis in experimental biology. Dev Biol 348:3-11. (pmid: 20826130)

PubMed ] [ DOI ] The categorical data set is an important data class in experimental biology and contains data separable into several mutually exclusive categories. Unlike measurement of a continuous variable, categorical data cannot be analyzed with methods such as the Student's t-test. Thus, these data require a different method of analysis to aid in interpretation. In this article, we will review issues related to categorical data, such as how to plot them in a graph, how to integrate results from different experiments, how to calculate the error bar/region, and how to perform significance tests. In addition, we illustrate analysis of categorical data using experimental results from developmental biology and virology studies.

Xu et al. (2010) Categorical data analysis in experimental biology. Dev Biol 348:3-11. (pmid: 20826130)

PubMed ] [ DOI ] The categorical data set is an important data class in experimental biology and contains data separable into several mutually exclusive categories. Unlike measurement of a continuous variable, categorical data cannot be analyzed with methods such as the Student's t-test. Thus, these data require a different method of analysis to aid in interpretation. In this article, we will review issues related to categorical data, such as how to plot them in a graph, how to integrate results from different experiments, how to calculate the error bar/region, and how to perform significance tests. In addition, we illustrate analysis of categorical data using experimental results from developmental biology and virology studies.

Cumming et al. (2007) Error bars in experimental biology. J Cell Biol 177:7-11. (pmid: 17420288)

PubMed ] [ DOI ] Error bars commonly appear in figures in publications, but experimental biologists are often unsure how they should be used and interpreted. In this article we illustrate some basic features of error bars and explain how they can help communicate data and assist correct interpretation. Error bars may show confidence intervals, standard errors, standard deviations, or other quantities. Different types of error bars give quite different information, and so figure legends must make clear what error bars represent. We suggest eight simple rules to assist with effective use and interpretation of error bars.



Explorations in Statistics series

Curran-Everett (2008) Explorations in statistics: standard deviations and standard errors. Adv Physiol Educ 32:203-8. (pmid: 18794241)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This series in Advances in Physiology Education provides an opportunity to do just that: we will investigate basic concepts in statistics using the free software package R. Because this series uses R solely as a vehicle with which to explore basic concepts in statistics, I provide the requisite R commands. In this inaugural paper we explore the essential distinction between standard deviation and standard error: a standard deviation estimates the variability among sample observations whereas a standard error of the mean estimates the variability among theoretical sample means. If we fail to report the standard deviation, then we fail to fully report our data. Because it incorporates information about sample size, the standard error of the mean is a misguided estimate of variability among observations. Instead, the standard error of the mean provides an estimate of the uncertainty of the true value of the population mean.

Curran-Everett (2009) Explorations in statistics: hypothesis tests and P values. Adv Physiol Educ 33:81-6. (pmid: 19509391)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This second installment of Explorations in Statistics delves into test statistics and P values, two concepts fundamental to the test of a scientific null hypothesis. The essence of a test statistic is that it compares what we observe in the experiment to what we expect to see if the null hypothesis is true. The P value associated with the magnitude of that test statistic answers this question: if the null hypothesis is true, what proportion of possible values of the test statistic are at least as extreme as the one I got? Although statisticians continue to stress the limitations of hypothesis tests, there are two realities we must acknowledge: hypothesis tests are ingrained within science, and the simple test of a null hypothesis can be useful. As a result, it behooves us to explore the notions of hypothesis tests, test statistics, and P values.

Curran-Everett (2009) Explorations in statistics: confidence intervals. Adv Physiol Educ 33:87-90. (pmid: 19509392)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This third installment of Explorations in Statistics investigates confidence intervals. A confidence interval is a range that we expect, with some level of confidence, to include the true value of a population parameter such as the mean. A confidence interval provides the same statistical information as the P value from a hypothesis test, but it circumvents the drawbacks of that hypothesis test. Even more important, a confidence interval focuses our attention on the scientific importance of some experimental result.

Curran-Everett (2009) Explorations in statistics: confidence intervals. Adv Physiol Educ 33:87-90. (pmid: 19509392)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This third installment of Explorations in Statistics investigates confidence intervals. A confidence interval is a range that we expect, with some level of confidence, to include the true value of a population parameter such as the mean. A confidence interval provides the same statistical information as the P value from a hypothesis test, but it circumvents the drawbacks of that hypothesis test. Even more important, a confidence interval focuses our attention on the scientific importance of some experimental result.

Curran-Everett (2009) Explorations in statistics: the bootstrap. Adv Physiol Educ 33:286-92. (pmid: 19948676)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This fourth installment of Explorations in Statistics explores the bootstrap. The bootstrap gives us an empirical approach to estimate the theoretical variability among possible values of a sample statistic such as the sample mean. The appeal of the bootstrap is that we can use it to make an inference about some experimental result when the statistical theory is uncertain or even unknown. We can also use the bootstrap to assess how well the statistical theory holds: that is, whether an inference we make from a hypothesis test or confidence interval is justified.

Curran-Everett (2010) Explorations in statistics: power. Adv Physiol Educ 34:41-3. (pmid: 20522895)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This fifth installment of Explorations in Statistics revisits power, a concept fundamental to the test of a null hypothesis. Power is the probability that we reject the null hypothesis when it is false. Four things affect power: the probability with which we are willing to reject-by mistake-a true null hypothesis, the magnitude of the difference we want to be able to detect, the variability of the underlying population, and the number of observations in our sample. In an application to an Institutional Animal Care and Use Committee or to the National Institutes of Health, we define power to justify the sample size we propose.

Curran-Everett (2010) Explorations in statistics: correlation. Adv Physiol Educ 34:186-91. (pmid: 21098385)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This sixth installment of Explorations in Statistics explores correlation, a familiar technique that estimates the magnitude of a straight-line relationship between two variables. Correlation is meaningful only when the two variables are true random variables: for example, if we restrict in some way the variability of one variable, then the magnitude of the correlation will decrease. Correlation cannot help us decide if changes in one variable result in changes in the second variable, if changes in the second variable result in changes in the first variable, or if changes in a third variable result in concurrent changes in the first two variables. Correlation can help provide us with evidence that study of the nature of the relationship between x and y may be warranted in an actual experiment in which one of them is controlled.

Curran-Everett (2011) Explorations in statistics: regression. Adv Physiol Educ 35:347-52. (pmid: 22139769)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This seventh installment of Explorations in Statistics explores regression, a technique that estimates the nature of the relationship between two things for which we may only surmise a mechanistic or predictive connection. Regression helps us answer three questions: does some variable Y depend on another variable X; if so, what is the nature of the relationship between Y and X; and for some value of X, what value of Y do we predict? Residual plots are an essential component of a thorough regression analysis: they help us decide if our statistical regression model of the relationship between Y and X is appropriate.

Curran-Everett (2012) Explorations in statistics: permutation methods. Adv Physiol Educ 36:181-7. (pmid: 22952255)

PubMed ] [ DOI ] Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This eighth installment of Explorations in Statistics explores permutation methods, empiric procedures we can use to assess an experimental result-to test a null hypothesis-when we are reluctant to trust statistical theory alone. Permutation methods operate on the observations-the data-we get from an experiment. A permutation procedure answers this question: out of all the possible ways we can rearrange the observations we got, in what proportion of those arrangements is the sample statistic we care about at least as extreme as the one we got? The answer to that question is the P value.


General

Nick (2007) Descriptive statistics. Methods Mol Biol 404:33-52. (pmid: 18450044)

PubMed ] [ DOI ] Statistics is defined by the Medical Subject Headings (MeSH) thesaurus as the science and art of collecting, summarizing, and analyzing data that are subject to random variation. The two broad categories of summarizing and analyzing data are referred to as descriptive and inferential statistics. This chapter considers the science and art of summarizing data where descriptive statistics and graphics are used to display data. In this chapter, we discuss the fundamentals of descriptive statistics, including describing qualitative and quantitative variables. For describing quantitative variables, measures of location and spread, for example the standard deviation, are presented along with graphical presentations. We also discuss distributions of statistics, for example the variance, as well as the use of transformations. The concepts in this chapter are useful for uncovering patterns within the data and for effectively presenting the results of a project.

Tu (2007) Basic principles of statistical inference. Methods Mol Biol 404:53-72. (pmid: 18450045)

PubMed ] [ DOI ] In this chapter, we discuss the fundamental principles behind two of the most frequently used statistical inference procedures: confidence interval estimation and hypothesis testing, both procedures are constructed on the sampling distributions that we have learned in previous chapters. To better understand these inference procedures, we focus on the logic of statistical decision making and the role that experimental data play in the decision process. Numerical examples are used to illustrate the implementation of the discussed procedures. This chapter also introduces some of the most important concepts associated with confidence interval estimation and hypothesis testing, including P values, significance level, power, sample size, and two types of errors. We conclude the chapter with a brief discussion on statistical and practical significance of test results.

Perkins (2007) Statistical inference on categorical variables. Methods Mol Biol 404:73-88. (pmid: 18450046)

PubMed ] [ DOI ] Categorical data are data that capture a characteristic of an experimental unit (such as a tissue specimen) rather than a numerical value. In this chapter, we first describe types of categorical data (nominal and ordinal) and how these types of data are distributed (binomial, multinomial, and independent multinomial). Next, methods for estimation and making statistical inferences for categorical data in commonly seen situations are presented. This includes approximation of the binomial distribution with a normal distribution, estimation and inference for one and two binomial samples, inference for 2 x 2 and R x C contingency tables, and estimation of sample size. Relevant data examples, along with discussions of which study designs generated the data, are presented throughout the chapter.

Wittkowski & Song (2010) Nonparametric methods for molecular biology. Methods Mol Biol 620:105-53. (pmid: 20652502)

PubMed ] [ DOI ] In 2003, the completion of the Human Genome Project (1) together with advances in computational resources (2) were expected to launch an era where the genetic and genomic contributions to many common diseases would be found. In the years following, however, researchers became increasingly frustrated as most reported 'findings' could not be replicated in independent studies (3). To improve the signal/noise ratio, it was suggested to increase the number of cases to be included to tens of thousands (4), a requirement that would dramatically restrict the scope of personalized medicine. Similarly, there was little success in elucidating the gene-gene interactions involved in complex diseases or even in developing criteria for assessing their phenotypes. As a partial solution to these enigmata, we here introduce a class of statistical methods as the 'missing link' between advances in genetics and informatics. As a first step, we provide a unifying view of a plethora of nonparametric tests developed mainly in the 1940s, all of which can be expressed as u-statistics. Then, we will extend this approach to reflect categorical and ordinal relationships between variables, resulting in a flexible and powerful approach to deal with the impact of (1) multiallelic genetic loci, (2) poly-locus genetic regions, and (3) oligo-genetic and oligo-genomic collaborative interactions on complex phenotypes.

Alonzo & Pepe (2007) Development and evaluation of classifiers. Methods Mol Biol 404:89-116. (pmid: 18450047)

PubMed ] [ DOI ] Diagnostic tests, medical tests, screening tests, biomarkers, and prediction rules are all types of classifiers. This chapter introduces methods for classifier development and evaluation. We first introduce measures of classification performance including sensitivity, specificity, and receiver operating characteristic (ROC) curves. We then review some issues in the design of studies to assess and compare the performance of classifiers. Approaches for using the data to estimate and compare classifier accuracy are then introduced. Next, methods for combining multiple classifiers into a single classifier are presented. Lastly, we discuss other important aspects of classifier development and evaluation. The methods presented are illustrated with real data.

Berman (2007) Comparison of means. Methods Mol Biol 404:117-42. (pmid: 18450048)

PubMed ] [ DOI ] This chapter describes statistical methods to test for differences between means or other measures of central tendency of 2 or more populations. Parametric tests and nonparametric tests are included. Methods for pairwise comparisons when more than 2 groups are being compared are included.

Eberly (2007) Correlation and simple linear regression. Methods Mol Biol 404:143-64. (pmid: 18450049)

PubMed ] [ DOI ] This chapter highlights important steps in using correlation and simple linear regression to address scientific questions about the association of two continuous variables with each other. These steps include estimation and inference, assessing model fit, the connection between regression and ANOVA, and study design. Examples in microbiology are used throughout. This chapter provides a framework that is helpful in understanding more complex statistical techniques, such as multiple linear regression, linear mixed effects models, logistic regression, and proportional hazards regression.

Eberly (2007) Multiple linear regression. Methods Mol Biol 404:165-87. (pmid: 18450050)

PubMed ] [ DOI ] This chapter describes multiple linear regression, a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout.

Ip (2007) General linear models. Methods Mol Biol 404:189-211. (pmid: 18450051)

PubMed ] [ DOI ] This chapter presents the general linear model as an extension to the two-sample t-test, analysis of variance (ANOVA), and linear regression. We illustrate the general linear model using two-way ANOVA as a prime example. The underlying principle of ANOVA, which is based on the decomposition of the value of an observed variable into grand mean, group effect and random noise, is emphasized. Further into this chapter, the F test is introduced as a means to test for the strength of group effect. The procedure of F test for identifying a parsimonious set of factors in explaining an outcome of interest is also described.

Oberg & Mahoney (2007) Linear mixed effects models. Methods Mol Biol 404:213-34. (pmid: 18450052)

PubMed ] [ DOI ] Statistical models provide a framework in which to describe the biological process giving rise to the data of interest. The construction of this model requires balancing adequate representation of the process with simplicity. Experiments involving multiple (correlated) observations per subject do not satisfy the assumption of independence required for most methods described in previous chapters. In some experiments, the amount of random variation differs between experimental groups. In other experiments, there are multiple sources of variability, such as both between-subject variation and technical variation. As demonstrated in this chapter, linear mixed effects models provide a versatile and powerful framework in which to address research objectives efficiently and appropriately.

Grady (2007) Analysis of change. Methods Mol Biol 404:261-71. (pmid: 18450054)

PubMed ] [ DOI ] When the same subjects or laboratory animals are observed across a set of different conditions or over time, we are usually interested in studying change. In these study designs, each subject serves as its own control. In this chapter, we consider different ways to assess change over time, for example, analyses for evaluating changes from a baseline condition. Study designs and analyses for single group studies and studies with two groups are discussed in detail. Examples come from published data. Statistical methods used in the examples include paired t-tests and analysis of covariance. The use of difference scores is discussed relative to analysis of covariance.

Nick & Campbell (2007) Logistic regression. Methods Mol Biol 404:273-301. (pmid: 18450055)

PubMed ] [ DOI ] The Medical Subject Headings (MeSH) thesaurus used by the National Library of Medicine defines logistic regression models as "statistical models which describe the relationship between a qualitative dependent variable (that is, one which can take only certain discrete values, such as the presence or absence of a disease) and an independent variable." Logistic regression models are used to study effects of predictor variables on categorical outcomes and normally the outcome is binary, such as presence or absence of disease (e.g., non-Hodgkin's lymphoma), in which case the model is called a binary logistic model. When there are multiple predictors (e.g., risk factors and treatments) the model is referred to as a multiple or multivariable logistic regression model and is one of the most frequently used statistical model in medical journals. In this chapter, we examine both simple and multiple binary logistic regression models and present related issues, including interaction, categorical predictor variables, continuous predictor variables, and goodness of fit.

Jiang & Fine (2007) Survival analysis. Methods Mol Biol 404:303-18. (pmid: 18450056)

PubMed ] [ DOI ] This chapter introduces some fundamental results in survival analysis. We first describe what is censored failure time data and how to interpret the failure time distribution. Two nonparametric methods for estimating the survival curve, the life table estimator and the Kaplan-Meier estimator, are demonstrated. We then discuss the two-sample problem and the usage of the log-rank test for comparing survival distributions between groups. Lastly, we discuss in some detail the proportional hazards model, which is a semiparametric regression model specifically developed for censored data. All methods are illustrated with artificial or real data sets.

Glickman & van Dyk (2007) Basic Bayesian methods. Methods Mol Biol 404:319-38. (pmid: 18450057)

PubMed ] [ DOI ] In this chapter, we introduce the basics of Bayesian data analysis. The key ingredients to a Bayesian analysis are the likelihood function, which reflects information about the parameters contained in the data, and the prior distribution, which quantifies what is known about the parameters before observing data. The prior distribution and likelihood can be easily combined to from the posterior distribution, which represents total knowledge about the parameters after the data have been observed. Simple summaries of this distribution can be used to isolate quantities of interest and ultimately to draw substantive conclusions. We illustrate each of these steps of a typical Bayesian analysis using three biomedical examples and briefly discuss more advanced topics, including prediction, Monte Carlo computational methods, and multilevel models.

Wilkinson (2007) Bayesian methods in bioinformatics and computational systems biology. Brief Bioinformatics 8:109-16. (pmid: 17430978)

PubMed ] [ DOI ] Bayesian methods are valuable, inter alia, whenever there is a need to extract information from data that are uncertain or subject to any kind of error or noise (including measurement error and experimental error, as well as noise or random variation intrinsic to the process of interest). Bayesian methods offer a number of advantages over more conventional statistical techniques that make them particularly appropriate for complex data. It is therefore no surprise that Bayesian methods are becoming more widely used in the fields of genetics, genomics, bioinformatics and computational systems biology, where making sense of complex noisy data is the norm. This review provides an introduction to the growing literature in this area, with particular emphasis on recent developments in Bayesian bioinformatics relevant to computational systems biology.

D'Agostino (2007) Overview of missing data techniques. Methods Mol Biol 404:339-52. (pmid: 18450058)

PubMed ] [ DOI ] Missing data frequently arise in the course of research studies. Understanding the mechanism that led to the missing data is important in order for investigators to be able to perform analyses that will lead to proper inference. This chapter will review different missing data mechanisms, including random and non-random mechanisms. Basic methods will be presented using examples to illustrate approaches to analyzing data in the presence of missing data.

Case & Ambrosius (2007) Power and sample size. Methods Mol Biol 404:377-408. (pmid: 18450060)

PubMed ] [ DOI ] In this chapter, we discuss the concept of statistical power and show how the sample size can be chosen to ensure a desired power. Power is the probability of rejecting the null hypothesis when the null hypothesis is false, that is the probability of saying there is a difference when a difference actually exists. An underpowered study does not have a sufficiently large sample size to answer the research question of interest. An overpowered study has too large a sample size and wastes resources. We will show how the power and required sample size can be calculated for several common types of studies, mention software that can be used for the necessary calculations, and discuss additional considerations.

Sabatti (2007) Avoiding false discoveries in association studies. Methods Mol Biol 376:195-211. (pmid: 17984547)

PubMed ] [ DOI ] We consider the problem of controlling false discoveries in association studies. We assume that the design of the study is adequate so that the "false discoveries" are potentially only because of random chance, not to confounding or other flaws. Under this premise, we review the statistical framework for hypothesis testing and correction for multiple comparisons. We consider in detail the currently accepted strategies in linkage analysis. We then examine the underlying similarities and differences between linkage and association studies and document some of the most recent methodological developments for association mapping.

Berman & Gullíon (2007) Working with a statistician. Methods Mol Biol 404:489-503. (pmid: 18450064)

PubMed ] [ DOI ] This chapter presents some guidelines for working with a statistician, beginning with when and why you should consult one. We emphasize the importance of good communication between the statistician and the client and the need for clearly defined tasks and a timetable. Other considerations, such as security, confidentiality, and business arrangements are discussed.