Difference between revisions of "Statistics"

From "A B C"
Jump to navigation Jump to search
m
m
Line 33: Line 33:
 
-->
 
-->
 
==Further reading and resources==
 
==Further reading and resources==
 +
{{#pmid:20838984}}
 +
{{#pmid:18450044}}
 +
{{#pmid:18450045}}
 +
{{#pmid:18450046}}
 +
{{#pmid:20652502}}
 +
{{#pmid:18450047}}
 +
{{#pmid:18450048}}
 +
{{#pmid:18450049}}
 +
{{#pmid:18450050}}
 +
{{#pmid:18450051}}
 +
{{#pmid:18450052}}
 +
{{#pmid:18450054}}
 +
{{#pmid:18450055}}
 +
{{#pmid:18450056}}
 +
{{#pmid:18450057}}
 +
{{#pmid:17430978}}
 +
{{#pmid:18450058}}
 +
{{#pmid:18450060}}
 +
{{#pmid:17984547}}
 +
{{#pmid:18450064}}
 
<!--
 
<!--
{{#pmid: }}
 
 
{{WWW|WWW_ }}
 
{{WWW|WWW_ }}
 
-->
 
-->

Revision as of 22:32, 25 January 2012

Statistics


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Statistics is indispensable to describe and analyse large data sets, such as those commonly encountered in computational biology. Roughly, three areas apply: descriptive statistics, to describe features of sets of data, inferential statistics, to quantify the significance of observations, and probability theory, to provide the theoretical basis for drawing such inferences. In practice we often apply procedures of Exploratory Data Analysis to find interesting features of our data and devise hypotheses and strategies for its analysis.


Further reading and resources

Nicholls (2011) What do we know?: simple statistical techniques that help. Methods Mol Biol 672:531-81. (pmid: 20838984)

PubMed ] [ DOI ] An understanding of simple statistical techniques is invaluable in science and in life. Despite this, and despite the sophistication of many concerning the methods and algorithms of molecular modeling, statistical analysis is usually rare and often uncompelling. I present here some basic approaches that have proved useful in my own work, along with examples drawn from the field. In particular, the statistics of evaluations of virtual screening are carefully considered.

Nick (2007) Descriptive statistics. Methods Mol Biol 404:33-52. (pmid: 18450044)

PubMed ] [ DOI ] Statistics is defined by the Medical Subject Headings (MeSH) thesaurus as the science and art of collecting, summarizing, and analyzing data that are subject to random variation. The two broad categories of summarizing and analyzing data are referred to as descriptive and inferential statistics. This chapter considers the science and art of summarizing data where descriptive statistics and graphics are used to display data. In this chapter, we discuss the fundamentals of descriptive statistics, including describing qualitative and quantitative variables. For describing quantitative variables, measures of location and spread, for example the standard deviation, are presented along with graphical presentations. We also discuss distributions of statistics, for example the variance, as well as the use of transformations. The concepts in this chapter are useful for uncovering patterns within the data and for effectively presenting the results of a project.

Tu (2007) Basic principles of statistical inference. Methods Mol Biol 404:53-72. (pmid: 18450045)

PubMed ] [ DOI ] In this chapter, we discuss the fundamental principles behind two of the most frequently used statistical inference procedures: confidence interval estimation and hypothesis testing, both procedures are constructed on the sampling distributions that we have learned in previous chapters. To better understand these inference procedures, we focus on the logic of statistical decision making and the role that experimental data play in the decision process. Numerical examples are used to illustrate the implementation of the discussed procedures. This chapter also introduces some of the most important concepts associated with confidence interval estimation and hypothesis testing, including P values, significance level, power, sample size, and two types of errors. We conclude the chapter with a brief discussion on statistical and practical significance of test results.

Perkins (2007) Statistical inference on categorical variables. Methods Mol Biol 404:73-88. (pmid: 18450046)

PubMed ] [ DOI ] Categorical data are data that capture a characteristic of an experimental unit (such as a tissue specimen) rather than a numerical value. In this chapter, we first describe types of categorical data (nominal and ordinal) and how these types of data are distributed (binomial, multinomial, and independent multinomial). Next, methods for estimation and making statistical inferences for categorical data in commonly seen situations are presented. This includes approximation of the binomial distribution with a normal distribution, estimation and inference for one and two binomial samples, inference for 2 x 2 and R x C contingency tables, and estimation of sample size. Relevant data examples, along with discussions of which study designs generated the data, are presented throughout the chapter.

Wittkowski & Song (2010) Nonparametric methods for molecular biology. Methods Mol Biol 620:105-53. (pmid: 20652502)

PubMed ] [ DOI ] In 2003, the completion of the Human Genome Project (1) together with advances in computational resources (2) were expected to launch an era where the genetic and genomic contributions to many common diseases would be found. In the years following, however, researchers became increasingly frustrated as most reported 'findings' could not be replicated in independent studies (3). To improve the signal/noise ratio, it was suggested to increase the number of cases to be included to tens of thousands (4), a requirement that would dramatically restrict the scope of personalized medicine. Similarly, there was little success in elucidating the gene-gene interactions involved in complex diseases or even in developing criteria for assessing their phenotypes. As a partial solution to these enigmata, we here introduce a class of statistical methods as the 'missing link' between advances in genetics and informatics. As a first step, we provide a unifying view of a plethora of nonparametric tests developed mainly in the 1940s, all of which can be expressed as u-statistics. Then, we will extend this approach to reflect categorical and ordinal relationships between variables, resulting in a flexible and powerful approach to deal with the impact of (1) multiallelic genetic loci, (2) poly-locus genetic regions, and (3) oligo-genetic and oligo-genomic collaborative interactions on complex phenotypes.

Alonzo & Pepe (2007) Development and evaluation of classifiers. Methods Mol Biol 404:89-116. (pmid: 18450047)

PubMed ] [ DOI ] Diagnostic tests, medical tests, screening tests, biomarkers, and prediction rules are all types of classifiers. This chapter introduces methods for classifier development and evaluation. We first introduce measures of classification performance including sensitivity, specificity, and receiver operating characteristic (ROC) curves. We then review some issues in the design of studies to assess and compare the performance of classifiers. Approaches for using the data to estimate and compare classifier accuracy are then introduced. Next, methods for combining multiple classifiers into a single classifier are presented. Lastly, we discuss other important aspects of classifier development and evaluation. The methods presented are illustrated with real data.

Berman (2007) Comparison of means. Methods Mol Biol 404:117-42. (pmid: 18450048)

PubMed ] [ DOI ] This chapter describes statistical methods to test for differences between means or other measures of central tendency of 2 or more populations. Parametric tests and nonparametric tests are included. Methods for pairwise comparisons when more than 2 groups are being compared are included.

Eberly (2007) Correlation and simple linear regression. Methods Mol Biol 404:143-64. (pmid: 18450049)

PubMed ] [ DOI ] This chapter highlights important steps in using correlation and simple linear regression to address scientific questions about the association of two continuous variables with each other. These steps include estimation and inference, assessing model fit, the connection between regression and ANOVA, and study design. Examples in microbiology are used throughout. This chapter provides a framework that is helpful in understanding more complex statistical techniques, such as multiple linear regression, linear mixed effects models, logistic regression, and proportional hazards regression.

Eberly (2007) Multiple linear regression. Methods Mol Biol 404:165-87. (pmid: 18450050)

PubMed ] [ DOI ] This chapter describes multiple linear regression, a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout.

Ip (2007) General linear models. Methods Mol Biol 404:189-211. (pmid: 18450051)

PubMed ] [ DOI ] This chapter presents the general linear model as an extension to the two-sample t-test, analysis of variance (ANOVA), and linear regression. We illustrate the general linear model using two-way ANOVA as a prime example. The underlying principle of ANOVA, which is based on the decomposition of the value of an observed variable into grand mean, group effect and random noise, is emphasized. Further into this chapter, the F test is introduced as a means to test for the strength of group effect. The procedure of F test for identifying a parsimonious set of factors in explaining an outcome of interest is also described.

Oberg & Mahoney (2007) Linear mixed effects models. Methods Mol Biol 404:213-34. (pmid: 18450052)

PubMed ] [ DOI ] Statistical models provide a framework in which to describe the biological process giving rise to the data of interest. The construction of this model requires balancing adequate representation of the process with simplicity. Experiments involving multiple (correlated) observations per subject do not satisfy the assumption of independence required for most methods described in previous chapters. In some experiments, the amount of random variation differs between experimental groups. In other experiments, there are multiple sources of variability, such as both between-subject variation and technical variation. As demonstrated in this chapter, linear mixed effects models provide a versatile and powerful framework in which to address research objectives efficiently and appropriately.

Grady (2007) Analysis of change. Methods Mol Biol 404:261-71. (pmid: 18450054)

PubMed ] [ DOI ] When the same subjects or laboratory animals are observed across a set of different conditions or over time, we are usually interested in studying change. In these study designs, each subject serves as its own control. In this chapter, we consider different ways to assess change over time, for example, analyses for evaluating changes from a baseline condition. Study designs and analyses for single group studies and studies with two groups are discussed in detail. Examples come from published data. Statistical methods used in the examples include paired t-tests and analysis of covariance. The use of difference scores is discussed relative to analysis of covariance.

Nick & Campbell (2007) Logistic regression. Methods Mol Biol 404:273-301. (pmid: 18450055)

PubMed ] [ DOI ] The Medical Subject Headings (MeSH) thesaurus used by the National Library of Medicine defines logistic regression models as "statistical models which describe the relationship between a qualitative dependent variable (that is, one which can take only certain discrete values, such as the presence or absence of a disease) and an independent variable." Logistic regression models are used to study effects of predictor variables on categorical outcomes and normally the outcome is binary, such as presence or absence of disease (e.g., non-Hodgkin's lymphoma), in which case the model is called a binary logistic model. When there are multiple predictors (e.g., risk factors and treatments) the model is referred to as a multiple or multivariable logistic regression model and is one of the most frequently used statistical model in medical journals. In this chapter, we examine both simple and multiple binary logistic regression models and present related issues, including interaction, categorical predictor variables, continuous predictor variables, and goodness of fit.

Jiang & Fine (2007) Survival analysis. Methods Mol Biol 404:303-18. (pmid: 18450056)

PubMed ] [ DOI ] This chapter introduces some fundamental results in survival analysis. We first describe what is censored failure time data and how to interpret the failure time distribution. Two nonparametric methods for estimating the survival curve, the life table estimator and the Kaplan-Meier estimator, are demonstrated. We then discuss the two-sample problem and the usage of the log-rank test for comparing survival distributions between groups. Lastly, we discuss in some detail the proportional hazards model, which is a semiparametric regression model specifically developed for censored data. All methods are illustrated with artificial or real data sets.

Glickman & van Dyk (2007) Basic Bayesian methods. Methods Mol Biol 404:319-38. (pmid: 18450057)

PubMed ] [ DOI ] In this chapter, we introduce the basics of Bayesian data analysis. The key ingredients to a Bayesian analysis are the likelihood function, which reflects information about the parameters contained in the data, and the prior distribution, which quantifies what is known about the parameters before observing data. The prior distribution and likelihood can be easily combined to from the posterior distribution, which represents total knowledge about the parameters after the data have been observed. Simple summaries of this distribution can be used to isolate quantities of interest and ultimately to draw substantive conclusions. We illustrate each of these steps of a typical Bayesian analysis using three biomedical examples and briefly discuss more advanced topics, including prediction, Monte Carlo computational methods, and multilevel models.

Wilkinson (2007) Bayesian methods in bioinformatics and computational systems biology. Brief Bioinformatics 8:109-16. (pmid: 17430978)

PubMed ] [ DOI ] Bayesian methods are valuable, inter alia, whenever there is a need to extract information from data that are uncertain or subject to any kind of error or noise (including measurement error and experimental error, as well as noise or random variation intrinsic to the process of interest). Bayesian methods offer a number of advantages over more conventional statistical techniques that make them particularly appropriate for complex data. It is therefore no surprise that Bayesian methods are becoming more widely used in the fields of genetics, genomics, bioinformatics and computational systems biology, where making sense of complex noisy data is the norm. This review provides an introduction to the growing literature in this area, with particular emphasis on recent developments in Bayesian bioinformatics relevant to computational systems biology.

D'Agostino (2007) Overview of missing data techniques. Methods Mol Biol 404:339-52. (pmid: 18450058)

PubMed ] [ DOI ] Missing data frequently arise in the course of research studies. Understanding the mechanism that led to the missing data is important in order for investigators to be able to perform analyses that will lead to proper inference. This chapter will review different missing data mechanisms, including random and non-random mechanisms. Basic methods will be presented using examples to illustrate approaches to analyzing data in the presence of missing data.

Case & Ambrosius (2007) Power and sample size. Methods Mol Biol 404:377-408. (pmid: 18450060)

PubMed ] [ DOI ] In this chapter, we discuss the concept of statistical power and show how the sample size can be chosen to ensure a desired power. Power is the probability of rejecting the null hypothesis when the null hypothesis is false, that is the probability of saying there is a difference when a difference actually exists. An underpowered study does not have a sufficiently large sample size to answer the research question of interest. An overpowered study has too large a sample size and wastes resources. We will show how the power and required sample size can be calculated for several common types of studies, mention software that can be used for the necessary calculations, and discuss additional considerations.

Sabatti (2007) Avoiding false discoveries in association studies. Methods Mol Biol 376:195-211. (pmid: 17984547)

PubMed ] [ DOI ] We consider the problem of controlling false discoveries in association studies. We assume that the design of the study is adequate so that the "false discoveries" are potentially only because of random chance, not to confounding or other flaws. Under this premise, we review the statistical framework for hypothesis testing and correction for multiple comparisons. We consider in detail the currently accepted strategies in linkage analysis. We then examine the underlying similarities and differences between linkage and association studies and document some of the most recent methodological developments for association mapping.

Berman & Gullíon (2007) Working with a statistician. Methods Mol Biol 404:489-503. (pmid: 18450064)

PubMed ] [ DOI ] This chapter presents some guidelines for working with a statistician, beginning with when and why you should consult one. We emphasize the importance of good communication between the statistician and the client and the need for clearly defined tasks and a timetable. Other considerations, such as security, confidentiality, and business arrangements are discussed.