RPR-Data-Imputation
Imputation
(Types of missingness, Imputation of missing data)
Abstract:
In this unit, you'll learn how to impute missing values in data sets.
When performing data analytics, the data set being used may contain missing elements. Continuing analysis of the data without modification, or simply deleting incomplete data, can cause bias in the results. Data imputation aims to "fill in" those missing elements by analyzing existing data to try and estimate what those elements might be.
Objectives:
|
Outcomes:
|
Deliverables:
Prerequisites:
Contents
Evaluation
Evaluation: NA
Contents
Hastie & Tibshirani discuss missing data in the (short) section 9.6 of chapter 9 (Additive Models, Trees, and Related Models)[1].
A good introduction to the topic emphasizing the different types of missingness is Chapter 25 - Missing Data imputation, in Gelman A and Hill J. (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
There are many resons why values could be missing in a dataset, but how to handle this situation is not straightforward. Naturally this depends on the type of data.
- If the data comprises repeated measurements of the same property, missing values can be, inferred from the observed values - if necessary. These types of replacements consider data in one row of a typical dataset.
- If the data comprises independent, random samples from a population[2] missing values can be replaced based on the distribution of values in the population. These types of replacements consider the data in one column of a typical dataset - "univariate analysis". This is true, also if there are multiple categories of values such as age, birthmonth and left- or right- handedness (which are independent as far as we can tell).[3] However, it is actually the exception that particular types of observations in a study are indeed independent.
- If there are correlations between data items - for example between age, height, and weight - we are in the domain of "multivariate analysis" and we have two options:
- either we exclude cases for which we don't have all values. This is called "listwise deletion" or "Complete Case Analysis (CCA)". This is often considered a conservative approach - but it may be not only overly strict and remove observations that might actually be informative or important, more importantly, if a class of cases has a higher probability to have missing observations, removing them will introduce a bias in our dataset. However CCA is the default for many statistical analysis methods.
- or we try to make an informed guess about the missing value from the available values. This is done by multivariate regression. There is much to say about that, and solving this problem is under active development. Thus most of what you will find about the subject concerns ingenuous regression methods. This does not necessarily mean most of your data requires it. Besides the technical challenges of multivariate statistics, one issue is that naive imputation by regression assigns the value that is expected - and thus artificially strengthens the correlation. Modern approaches thus may add noise to avoid changing the error distribution, or run the imputation multiple times ("multiple imputation") and consider what confidence we have in the result.
Types of Missingness
The literature distinguishes three types of missingness: Missing Completely at Random (MCAR), when all data points have the same probability to be missing; Missing at Random (MAR), when some cases have a higher probability for data to be missing, but these can be modelled from other data in our dataset; and Missing Not At Random (MNAR), when there is a specific non-random process behind the missing data, and this process can't be inferred from observed data (cf. Gelman 550).
MCAR - Missing Completely At Random
MCAR implies that all units in the data set have the same probability to be missing. If this is the case, using CCA is justified, because removing such data is does not create bias. However, since the number of samples for analysis decreases, statistical power is also decreased.
MAR - Missing At Random
This is a case where the missingness in the data set occurs at random, but the probability of a missing value depends on the data, or other aspects of the dataset. MAR is a soemwhat weaker assumption than MCAR. Imputing MAR values can be done after stratifying subpopulations and taking their underlying value distributions into account separately. This is the major strength of the imputation packages we discuss below.
MNAR - Missing Not At Random
This is a case where you'd have to be careful when attempting to impute data. Since the missing units are not occurring at random, it means that there can be relationships and patterns not shown in the data set. As such, it will be unwise to perform and imputations without fully understanding the underlying factors and assumptions behind the non-random missingness. For example, let's say extremely small values in a dataset are left out on purpose in a study. The missing data recorded will be MNAR, and it will be difficult to pick up if the user of the data is unfamiliar with the methodology of data collection for this data set. [4][5] Attempting to impute data with MNAR can therefore understandably create values that depart a great deal away from what they should be.
Confusing? Here is a short, readable, conversational article that may clarify the points for you:
Bhaskaran & Smeeth (2014) What is the difference between missing completely at random and missing at random?. Int J Epidemiol 43:1336-9. (pmid: 24706730) |
[ PubMed ] [ DOI ] The terminology describing missingness mechanisms is confusing. In particular the meaning of 'missing at random' is often misunderstood, leading researchers faced with missing data problems away from multiple imputation, a method with considerable advantages. The purpose of this article is to clarify how 'missing at random' differs from 'missing completely at random' via an imagined dialogue between a clinical researcher and statistician. |
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
RPR-Data-Imputation.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
1.2 Types of Imputation Methods
There are many different imputation packages available in R. I have selected two popular imputation methods for this learning unit: the mice() function from package 'mice', and aregImpute() function from package 'Hmisc'.
1.2.1 The "mice" R package overview
- The "mice" R package is short for Multivariate Imputation by Chained Equations (More background on Markov Chain Monte Carlo/MCMC)[6]. It allows us to gain access to the md.pattern() and mice() functions, essential for missing data analytics and imputation. See resources of this page for documentation. [7]
- mice assumes MAR. It uses a strategy where it looks at each individual column, and predict each of the missing values in the column from all other columns. Then, the algorithm goes through all of the columns over and over with MCMC to come up with potential imputation sets. For numerical values, the mice function uses predictive mean matching, or 'pmm', for the possible values to fill in. Mice can also handle factor and binary values.[8]
- Because mice uses multiple iterations (can be specified by you, but defaults at 5), multiple lists of possible values to impute will be generated by the end of the process. You can decide to either take one of the iterations as the result, or pool all the results together and build a model. [9]
- Another advantage with using mice to impute values is that it is a highly customizable method. For example, while the mice method defaults 'pmm' for numerical imputations, if you feel like you have a better way for prediction of values, you can actually use that in place of pmm. You can even change the order of the columns being compared. [10]
1.2.2 The aregImpute() method overview (from Hmisc R package)
- The 'Hmisc" R package allows us to gain access to the aregImpute() function, which will be used in this learning unit.
- aregImpute() uses bootstrapping to draw predicted values from a Bayesian model, and pmm to impute values. Different bootstrapping samples are used for each separate iteration. Instead of going column-by-column like mice, it starts by taking random selections of missing elements. [11]
- aregImpute() handles larger data sets better than mice as a result of the difference in their approaches. [12]
1.3 Which method should I use?
Use 'mice' and its package to impute data when...
- Data set is reasonably sized
- You want to find more about your data set
- You require high customization with your imputation methodology (e.g. you want to use a custom method instead of pmm)
Use 'aregImpute()' and 'Hmisc' package when...
- Data set is comparatively larger
- You have good understanding of the data and do not require multiple customization and adjustments
2. Data Imputation with R - Imputation method specifics
Now that you have an understanding of missing data and data imputation methods, let's get started on actual application of these methods. Please refer back to Section 2 of the R-script.
2.1 Running MICE on a small synthetic data set
For this section, we'll continue using the synthetic data created in the last section, for exploration of functions in the 'mice' package. Just a refresher that since the synthetic data we created does not contain any missing values, I used prodNA(), a function from the missForest package, to randomly create NAs in the dataset. Here, we set it at 5%. Using prodNA() is inspired by Analytics Vidhya's introduction to R imputation packages.[13]
df_MAR <- prodNA(synth_df, noNA = 0.05)
An important feature in mice is md.pattern(). This function allows you to gain detailed understanding of your data set[14][15]. Running this line of code below
md.pattern(df_MAR)
will yield a table that looks like this:
ct_A | ct_C | ct_D | ct_B | ||
---|---|---|---|---|---|
8 | 1 | 1 | 1 | 1 | 0 |
2 | 1 | 1 | 1 | 0 | 1 |
0 | 0 | 0 | 2 | 2 |
Where each row denotes a pattern of missing values. In this case, it shows that there are 8 observations (rows of data) where none of the data are missing; there are 2 missing values in column ct_B, but the other variables remain intact. Therefore, there are 2 missing data points in this data set. As you can see, performing md.pattern() is good for picking up any potential irregularities when you're dealing with a data set that you are unfamiliar with, and help you determine if the data presented is suitable for imputation.
Exercise 1: Examine the "airquality" data set, which should already be available after importing the 'datasets' package. How many NA's are there? How many instances are there where data for all columns are present but not Ozone?
Now we get to actually run mice() to come up with a set of candidate imputation values.
This line below shows how we do it with our synthetic data with 2 missing values. Take df_MAR as our object, set the number of imputation sets (m) to 5 (this is default), set the max iterations to 50, and select our method to be 'pmm' since we are dealing with numerical values. The following code regarding imputation with MICE is based on mice's R documentation and R-bloggers' Imputing data with MICE unit
imputed_data <- mice(df_MAR, m=5, maxit = 50, method = "pmm", seed = 100)
This creates a list that we can "complete". The list contains 5 different imputations that you can use to "complete" the missing set with. Here I selected 1, but you can pick any number from 1-5.
df_MAR <- complete(imputed_data, 1)
Now, our missing values have been filled in. After they've been filled in, we can validate our imputations by comparing the correlation matrices.
We can also have a look with with visual representations. I used xyplot from the 'VIM' package. Note pch and cex are design parameters. Make sure that you precede the ~ with the variable with missing values. The code to perform xyplot is based off R-blogger's post again.
xyplot(imputed_data, ct_B ~ ct_A + ct_C + ct_D, pch=2, cex = 0.5)
On this plot, the pink denotes imputed values; in general, you'd want the pink to align with the blue (existing values)[16].
Of course, we can use our multiple imputation sets and fit a linear model into the data. For mice, it's a sequence of using the with() function and then pooling the fit with the pool() function[17]. Use summary() to get an overview of the fit. Again, the code is with reference to R-blogger's post.
fit <- with(data = imputed_data, lm(ct_B ~ ct_A + ct_C + ct_D))
summary(pool(fit))
Finally, let's look at the example where there's 30% NA. The imputation performance isn't as great as the previous examples, as there are less observed values to base the imputation from. This is important; know when there is just too much data loss, that when you try to impute the data, your results really wouldn't be telling you much.
imputed_data_thirty <- mice(df_MAR_thirty, m=5, maxit = 50,
method = "pmm", seed = 100)
df_MAR_thirty_completed <- complete(imputed_data_thirty,2)
cor(df_MAR_thirty_completed)
This concludes section 2. In the next section (2.2 here, but section 3 in the R script), we will look at a real data set, GSE4987 yeast cell cycle data.
2.2 Running MICE on a small subset of the GSE4987 data set
This section is more of a demonstration of how the mice() functions work with a real set of data. Work through it yourself for practice with the mice package.
Exercise 2: Let's move along with our 'airquality' data set. Run a mice() function with airquality and have the parameters the same as the ones in the script.(feel free to play around with the other parameters on the side). Make sure to seed at 100 so we have consistent results. After you have the list of imputed data, select set #2 and load it back to fill in the NAs in airquality. View it to make sure the NAs have been filled. Then, fit and pool the data. What is the t value for Day?
This concludes section 3 of the learning unit. Move on to section 4 for the final part, which is dealing with large datasets using HMisc.
2.3 large datasets
As mentioned earlier, mice crashes when your data set is large. Here we discuss running HMisc's aregImpute() on a large subset (all 6216 features, 10 samples at a time). You can try running mice but it's almost guaranteed that your machine won't be able to run it. Here's when Hmisc's aregImpute() comes in. My code for aregImpute(), impute.transcan(), and completion of imputation are inspired by Analytics Vidhya's post, with some help from a stackexchange post on formatting the formula for aregImpute(), which is posted by user 'Gurkenhals'.
We are going with the GSE4987 data with all of its observations and 10 samples. Follow along with the R script to see how to set up the formula for imputation in aregImpute(). When we have the formula ready, it's time to impute with aregImpute().
imputed_large_GSE_data <- aregImpute(formula = impute_colNames, data = large_GSE_dataset_df, n.impute = 5)
The difference between Hmisc() and mice() is that Hmisc() uses the function impute.transcan() to work like mice's complete()[18].
imputed_transcan <- impute.transcan(imputed_large_GSE_data, data = large_GSE_dataset_df, imputation = 2, list.out = TRUE, pr = FALSE, check = FALSE)
completed_large_GSE_dataset <- as.data.frame(do.call(cbind, imputed_transcan))
completed_large_GSE_dataset <- completed_large_GSE_dataset[,colnames(large_GSE_dataset_df), drop = FALSE]
Check with
summary(completed_large_GSE_dataset)
To see that indeed, these missing values were imputed.
Exercise 3: I've done imputations for columns 1-10 in the R script using Hmisc. Help me out; do imputations for columns 11-20. Keep all other parameters the same. Does the median change for GSM112148 after imputation? If so, what is the new median?
This concludes the learning unit for RPR-Imputation. See the next section for exercise solutions.
3. Solutions for Exercises
- Exercise 1:
>md.pattern(airquality)
# Answer:This should show 44 total missing values, and 35 of which have the pattern of just Ozone missing.
- Exercise 2:
>exercise2_impute_data <- mice(airquality, m=5, maxit = 50, method = "pmm", seed = 100)
>exercise2_completed_data <- complete(exercise2_impute_data, 2)
>View(exercise2_completed_data)
>exercise2_fit_data <- with(data = exercise2_impute_data, lm(Wind ~ Temp+Month+Day+Solar.R+Ozone))
>summary(pool(exercise2_fit_data))
# Answer: 0.3180647
- Exercise 3:
Simply change the subsetting from 1:10 to 11:20 for the code in section 3. Compare the values between summary(large_GSE_dataset) and summary(completed_large_GSE_dataset). The median does not change for GSM112148, which remains as 0.005550.
Further reading, links and resources
- First and foremost, the R guidebook: R for Data Science
- A quick Wikipedia overview for Data Imputation: Imputation (Statistics)
- Kowarik and Templ (2016) give a good overview of the pros and cons of various R imputation packages and how these relate to their own package VIM. See: Kowarik A. and Templ M. (2016) Imputation with the R Package VIM. J Stat Soft 74. (Download PDF).
- A good tutorial on how to use MICE in R (code makes many references): Imputing missing data with R; MICE package
- MICE documentation: Package 'mice'
- Hmisc documentation: Package 'Hmisc
- A fantastic document from Columbia University for the different methods of data imputation, along with R code and plots: Missing-data imputation
Chiu et al. (2013) Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC Syst Biol 7 Suppl 6:S12. (pmid: 24565220) |
[ PubMed ] [ DOI ] BACKGROUND: Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. RESULTS: In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. CONCLUSIONS: In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses. |
Notes
- ↑ Hastie T, Tibshirani R. and Friedman J. (2009) The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer. (pdf)
- ↑ Such independent, random samples are called iid in statistical texts - independent and identically distributed - i.e. the probability of taking a sample does not depend on the other samples (e.g. family members are not "independent" in a survey of religious beliefs in a population), an it does not depend on the value itself (e.g. non-responders will be underrepresented in a street survey on flu-shot efficacy, because they are at home in bed).
- ↑ Or are they? Do older people often think they are right-handed because they have been taught that way in school?
- ↑ https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
- ↑ http://www.stat.columbia.edu/~gelman/arm/missing.pdf
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://cran.r-project.org/web/packages/mice/mice.pdf
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- ↑ https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
- ↑ https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
- ↑ https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
- ↑ https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
About ...
Authors:
- Greg Huang <gregoryhuang2005@gmail.com> (Initial contents development, BCB410 2018)
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-09-17
Modified:
- 2018-01-31
Version:
- 1.1
Version history:
- 1.1 Rewrite
- 1.0 BCB420 submission by Greg Huang
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.