RPR-Data-Imputation

From "A B C"
Revision as of 23:14, 26 January 2018 by Boris (talk | contribs) (Created page with "<div id="ABC"> <div style="padding:5px; border:1px solid #000000; background-color:#d9ead5; font-size:300%; font-weight:400; color: #000000; width:100%;"> Imputation <div styl...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Imputation

(Types of missingness, Imputation of missing data)


 


Abstract:

In this unit, you'll learn how to impute missing values in data sets.

When performing data analytics, the data set being used may contain missing elements. Continuing analysis of the data without modification, or simply deleting incomplete data, can cause bias in the results. Data imputation aims to "fill in" those missing elements by analyzing existing data to try and estimate what those elements might be.



Objectives:

  • Introduce the concept of data imputation and its applications
  • Explore the different types of missing data and available imputation methods
  • Familiarization of the MICE (Multivariate Imputation via Chained Equations) R package
  • Familiarization of the Hmisc R package

Outcomes:

  • Ability to utilize both MICE and Hmisc packages in R for data imputation
  • Thorough understanding of data imputation types and recognize the differences
  • Generate multiple data sets with data imputed from various packages

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:


 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 



 


Contents

 

Before beginning this unit, please go to the BCB410-DataScience repository on GitHub and open the RPR_Data_Imputation.R script. Load the appropriate packages, and have Section 1 of the script ready.


The introduction of data imputation and the types of missingness is primarily based on an article from Columbia University on Missing-data Imputation, with some references to R blogs.
Data Imputation is a technique used to "fill in" missing data in a data set. For example, a table like this is missing 3 data points, where in R a data set with missing values is denoted as "NA". [1]

Height Weight Age
170 NA 19
178 78 26
182 98 NA
164 44 NA

When encountering missing data like these, the most straightforward way (and incidentally, most programs employ this) is to simply remove the entire column with missing data, or list-wise deletion. This is also called Complete Case Analysis (CCA). However, this kind of practice can cause bias when performing analytics on the data set in most cases, and will lose data as it omits other valid data points within the same column. [2][3]

It is therefore more favourable to impute the data, or try and guess what the missing values should be, whenever possible. There are many ways to do that. Which method to use depend on many different conditions that we'll detail later on in this learning unit.[4]

In this section, you will be exploring different ways to impute these missing data points.


 

1.1 Types of Missing Data

 

There are three major types of missing data:

1.1.1 MCAR, Missing Completely At Random
This is a rare case in real life. Essentially, it means that all units in the data set have equal likelihood of missingness, e.g. the chances of a unit missing in a column is determined by some arbitrary probability that does not depend on the other columns in the data. In other words, there is no way of using the observed data (e.g. it is also a completely random set of numbers) to try and infer what the missing values are. In this case, performing complete case analysis actually does not generate bias.[5]

1.1.2 MAR, Missing At Random
This is a case where the missingness in the data set occurs at random, and the missingness does not depend on other missing values. In the case of MAR, the missing data can be related to the observed data. As such, we can use other existing observed data to impute these missing units. This learning unit will be using data from this group for examples. [6][7]

1.1.3 MNAR, Missing Not At Random
This is a case where you'd have to be careful when attempting to impute data. Since the missing units are not occurring at random, it means that there can be relationships and patterns not shown in the data set. As such, it will be unwise to perform and imputations without fully understanding the underlying factors and assumptions behind the non-random missingness. For example, let's say extremely small values in a dataset are left out on purpose in a study. The missing data recorded will be MNAR, and it will be difficult to pick up if the user of the data is unfamiliar with the methodology of data collection for this data set. [8][9] Attempting to impute data with MNAR can therefore understandably create values that depart a great deal away from what they should be.


There is a quick demo in Section 1 of the R script that demonstrates the differences between the types of data, and demonstration of list-wise deletion method (CCA). Additionally, since MCAR rarely happens and requires basically a completely random set of data, I will not be demoing MCAR here.

The take away message from this subsection is to understand that before you impute data, your first task is therefore to treat 'missingness' as an observation of your data, and try to determine if there's a pattern to the missingness.



 

1.2 Types of Imputation Methods

 

There are many different imputation packages available in R. I have selected two popular imputation methods for this learning unit: the mice() function from package 'mice', and aregImpute() function from package 'Hmisc'.

1.2.1 The "mice" R package overview

  • The "mice" R package is short for Multivariate Imputation by Chained Equations (More background on Markov Chain Monte Carlo/MCMC)[10]. It allows us to gain access to the md.pattern() and mice() functions, essential for missing data analytics and imputation. See resources of this page for documentation. [11]
  • mice assumes MAR. It uses a strategy where it looks at each individual column, and predict each of the missing values in the column from all other columns. Then, the algorithm goes through all of the columns over and over with MCMC to come up with potential imputation sets. For numerical values, the mice function uses predictive mean matching, or 'pmm', for the possible values to fill in. Mice can also handle factor and binary values.[12]
  • Because mice uses multiple iterations (can be specified by you, but defaults at 5), multiple lists of possible values to impute will be generated by the end of the process. You can decide to either take one of the iterations as the result, or pool all the results together and build a model. [13]
  • Another advantage with using mice to impute values is that it is a highly customizable method. For example, while the mice method defaults 'pmm' for numerical imputations, if you feel like you have a better way for prediction of values, you can actually use that in place of pmm. You can even change the order of the columns being compared. [14]


1.2.2 The aregImpute() method overview (from Hmisc R package)

  • The 'Hmisc" R package allows us to gain access to the aregImpute() function, which will be used in this learning unit.
  • aregImpute() uses bootstrapping to draw predicted values from a Bayesian model, and pmm to impute values. Different bootstrapping samples are used for each separate iteration. Instead of going column-by-column like mice, it starts by taking random selections of missing elements. [15]
  • aregImpute() handles larger data sets better than mice as a result of the difference in their approaches. [16]


 

1.3 Which method should I use?

 

Use 'mice' and its package to impute data when...

  • Data set is reasonably sized
  • You want to find more about your data set
  • You require high customization with your imputation methodology (e.g. you want to use a custom method instead of pmm)

Use 'aregImpute()' and 'Hmisc' package when...

  • Data set is comparatively larger
  • You have good understanding of the data and do not require multiple customization and adjustments


 

2. Data Imputation with R - Imputation method specifics

 

Now that you have an understanding of missing data and data imputation methods, let's get started on actual application of these methods. Please refer back to Section 2 of the R-script.


 

2.1 Running MICE on a small synthetic data set

 

For this section, we'll continue using the synthetic data created in the last section, for exploration of functions in the 'mice' package. Just a refresher that since the synthetic data we created does not contain any missing values, I used prodNA(), a function from the missForest package, to randomly create NAs in the dataset. Here, we set it at 5%. Using prodNA() is inspired by Analytics Vidhya's introduction to R imputation packages.[17]

df_MAR <- prodNA(synth_df, noNA = 0.05)

An important feature in mice is md.pattern(). This function allows you to gain detailed understanding of your data set[18][19]. Running this line of code below

md.pattern(df_MAR)

will yield a table that looks like this:

ct_A ct_C ct_D ct_B
8 1 1 1 1 0
2 1 1 1 0 1
0 0 0 2 2

Where each row denotes a pattern of missing values. In this case, it shows that there are 8 observations (rows of data) where none of the data are missing; there are 2 missing values in column ct_B, but the other variables remain intact. Therefore, there are 2 missing data points in this data set. As you can see, performing md.pattern() is good for picking up any potential irregularities when you're dealing with a data set that you are unfamiliar with, and help you determine if the data presented is suitable for imputation.


Exercise 1: Examine the "airquality" data set, which should already be available after importing the 'datasets' package. How many NA's are there? How many instances are there where data for all columns are present but not Ozone?


Now we get to actually run mice() to come up with a set of candidate imputation values. This line below shows how we do it with our synthetic data with 2 missing values. Take df_MAR as our object, set the number of imputation sets (m) to 5 (this is default), set the max iterations to 50, and select our method to be 'pmm' since we are dealing with numerical values. The following code regarding imputation with MICE is based on mice's R documentation and R-bloggers' Imputing data with MICE unit

imputed_data <- mice(df_MAR, m=5, maxit = 50, method = "pmm", seed = 100)

This creates a list that we can "complete". The list contains 5 different imputations that you can use to "complete" the missing set with. Here I selected 1, but you can pick any number from 1-5.

df_MAR <- complete(imputed_data, 1)

Now, our missing values have been filled in. After they've been filled in, we can validate our imputations by comparing the correlation matrices.

We can also have a look with with visual representations. I used xyplot from the 'VIM' package. Note pch and cex are design parameters. Make sure that you precede the ~ with the variable with missing values. The code to perform xyplot is based off R-blogger's post again.

xyplot(imputed_data, ct_B ~ ct_A + ct_C + ct_D, pch=2, cex = 0.5)

On this plot, the pink denotes imputed values; in general, you'd want the pink to align with the blue (existing values)[20].

Of course, we can use our multiple imputation sets and fit a linear model into the data. For mice, it's a sequence of using the with() function and then pooling the fit with the pool() function[21]. Use summary() to get an overview of the fit. Again, the code is with reference to R-blogger's post.

fit <- with(data = imputed_data, lm(ct_B ~ ct_A + ct_C + ct_D))
summary(pool(fit))


Finally, let's look at the example where there's 30% NA. The imputation performance isn't as great as the previous examples, as there are less observed values to base the imputation from. This is important; know when there is just too much data loss, that when you try to impute the data, your results really wouldn't be telling you much.

imputed_data_thirty <- mice(df_MAR_thirty, m=5, maxit = 50,
                     method = "pmm", seed = 100)
df_MAR_thirty_completed <- complete(imputed_data_thirty,2)
cor(df_MAR_thirty_completed)

This concludes section 2. In the next section (2.2 here, but section 3 in the R script), we will look at a real data set, GSE4987 yeast cell cycle data.


 

2.2 Running MICE on a small subset of the GSE4987 data set

 

This section is more of a demonstration of how the mice() functions work with a real set of data. Work through it yourself for practice with the mice package.


Exercise 2: Let's move along with our 'airquality' data set. Run a mice() function with airquality and have the parameters the same as the ones in the script.(feel free to play around with the other parameters on the side). Make sure to seed at 100 so we have consistent results. After you have the list of imputed data, select set #2 and load it back to fill in the NAs in airquality. View it to make sure the NAs have been filled. Then, fit and pool the data. What is the t value for Day?


This concludes section 3 of the learning unit. Move on to section 4 for the final part, which is dealing with large datasets using HMisc.


 

2.3 large datasets

 

As mentioned earlier, mice crashes when your data set is large. Here we discuss running HMisc's aregImpute() on a large subset (all 6216 features, 10 samples at a time). You can try running mice but it's almost guaranteed that your machine won't be able to run it. Here's when Hmisc's aregImpute() comes in. My code for aregImpute(), impute.transcan(), and completion of imputation are inspired by Analytics Vidhya's post, with some help from a stackexchange post on formatting the formula for aregImpute(), which is posted by user 'Gurkenhals'.

We are going with the GSE4987 data with all of its observations and 10 samples. Follow along with the R script to see how to set up the formula for imputation in aregImpute(). When we have the formula ready, it's time to impute with aregImpute().

imputed_large_GSE_data <- aregImpute(formula = impute_colNames, data = large_GSE_dataset_df, n.impute = 5)

The difference between Hmisc() and mice() is that Hmisc() uses the function impute.transcan() to work like mice's complete()[22].

imputed_transcan <- impute.transcan(imputed_large_GSE_data, data = large_GSE_dataset_df, imputation = 2, list.out = TRUE, pr = FALSE, check = FALSE)
completed_large_GSE_dataset <- as.data.frame(do.call(cbind, imputed_transcan))
completed_large_GSE_dataset <- completed_large_GSE_dataset[,colnames(large_GSE_dataset_df), drop = FALSE]

Check with

summary(completed_large_GSE_dataset)

To see that indeed, these missing values were imputed.


Exercise 3: I've done imputations for columns 1-10 in the R script using Hmisc. Help me out; do imputations for columns 11-20. Keep all other parameters the same. Does the median change for GSM112148 after imputation? If so, what is the new median?


This concludes the learning unit for RPR-Imputation. See the next section for exercise solutions.


 

3. Solutions for Exercises

 
  • Exercise 1:
>md.pattern(airquality)
# Answer:This should show 44 total missing values, and 35 of which have the pattern of just Ozone missing.
  • Exercise 2:
>exercise2_impute_data <- mice(airquality, m=5, maxit = 50, method = "pmm", seed = 100)
>exercise2_completed_data <- complete(exercise2_impute_data, 2)
>View(exercise2_completed_data)
>exercise2_fit_data <- with(data = exercise2_impute_data, lm(Wind ~ Temp+Month+Day+Solar.R+Ozone))
>summary(pool(exercise2_fit_data))
# Answer: 0.3180647
  • Exercise 3:

Simply change the subsetting from 1:10 to 11:20 for the code in section 3. Compare the values between summary(large_GSE_dataset) and summary(completed_large_GSE_dataset). The median does not change for GSM112148, which remains as 0.005550.


 

Notes

  1. http://www.stat.columbia.edu/~gelman/arm/missing.pdf
  2. http://www.stat.columbia.edu/~gelman/arm/missing.pdf
  3. https://www.statmethods.net/stats/correlations.html
  4. http://www.stat.columbia.edu/~gelman/arm/missing.pdf
  5. https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
  6. https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
  7. http://www.stat.columbia.edu/~gelman/arm/missing.pdf
  8. https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
  9. http://www.stat.columbia.edu/~gelman/arm/missing.pdf
  10. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  11. https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
  12. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  13. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  14. https://cran.r-project.org/web/packages/mice/mice.pdf
  15. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  16. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  17. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  18. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
  19. https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
  20. https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
  21. https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
  22. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

Further reading, links and resources


 


 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Authors:

  • Greg Huang <gregoryhuang2005@gmail.com> (Initial contents development, BCB410 2018)
  • Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-09-17

Modified:

2017 09 - 2018 01

Version:

1.1

Version history:

  • 1.1 Formatting for Course Wiki
  • 1.0 BCB420 submission by Greg Huang

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.