Missing data
Download
Report
Transcript Missing data
Overview
Types of Missing Data
Strategies for Handling Missing Data
Software Applications and Examples
Sources of Missing Data
◦ Item non-response
Missing value for any given item
◦ Scale non-response
Missing value for any given scale
Often a result of item non-response
◦ Attrition
Missing value (item and/or scale) for any given time point
◦ Data entry error
Observed value not included
So I have missing data…what’s the big deal?
◦ Missing data, no matter how minimal, can (and probably do)
result in biased results
◦ Statistical power
◦ Validity
How much missing data is “problematic”? Depends on who you ask…
Answer #1
ANY
Answer #2
Its never “too much”
Optimal methods can easily accommodate 50% missing data
Answer #3
>5% (Schafer, 1999)
>10% (Bennett, 2001)
>20% (Peng, et al., 2006)
Answer #4 (Widaman, 2006)
1%-2% (Negligible)
5%-10% (Minor)
10%-25% (Moderate)
25%-50% (High)
>50% (Excessive)
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Not Missing at Random (NMAR)
Missing Completely at Random (MCAR)
◦ Missing values on Y are
unrelated to any other variable
in the analysis
◦ Cases with missing data can
be treated as a random subset
of the entire sample
◦ Best case scenario; difficult to
ascertain
Missing at Random (MAR)
◦ Missing values on Y are
related to X but not to Y
◦ Missing values on Y are
random (random effect)after
controlling for X (systematic
effect
◦ Can test systematic effect but
not random effect
Not Missing at Random (NMAR)
◦ Missing values on Y are
related to Y itself
◦ Missing data are “non-ignorable”
◦ Difficult to ascertain; difficult to
manage
Testing for MCAR
◦ Little’s Test of MCAR
Omnibus χ2 test of all specified variables
If significant, data are not MCAR
May be MAR or MNAR
If not significant, can assume MCAR
Available in SPSS under “Missing Value Analysis” and as a
SAS Macro
Testing for MAR
◦ Create a “dummy” variable for not missing/missing on the variable of interest
◦ Conduct statistical tests to see if other relevant variables are associated with
values of the new variable
Binomial logistic regression
χ2 test of independence
t-tests
◦ If significant relationships are found, then have MAR; these variables need to be
included in any analyses
◦ If no significant relationships found, then you have more work to do
If not MCAR or MAR, does that mean it is MNAR?
◦ Not necessarily…
Might still be MAR but you haven’t found the right
indicator variable
◦ Consider other potentially relevant variables and test against
the missing data “dummy” variable
Patterns of missing data
◦ Monotone pattern
Variables v1-vj can be ordered so that if data are missing on v1, they are
missing on all successive variables
VERY common with longitudinal data
Patterns of missing data
◦ Non-monotone pattern
Patterns of missing data are arbitrary
Deletion Methods
◦ Remove cases with missing values
Non-Stochastic Methods
◦ Replace missing values with “known” values
Stochastic Methods
◦ Replace missing values with estimated values
List-Wise Deletion
◦ Mechanism
Deletes cases from analysis with missing data on any variable (even if
that variable isn’t part of the analysis)
Only uses “complete cases”
◦ Pros
Easy to implement
Works for any kind of statistical analysis
If data are MCAR, does not introduce any bias in parameter estimates
Standard error estimates are appropriate
◦ Cons
May delete a large proportion of cases, resulting in loss of statistical
power
May introduce bias if MAR but not MCAR
Pair-Wise Deletion
◦ Mechanism
Deletes cases when missing data on a specific variable
involved in parameter estimation
Uses all available information for each estimation,
independent of information available for other estimations
◦ Pros
Approximately unbiased if MCAR
Uses all available information
◦ Cons
Standard errors are incorrect
Mean Imputation
◦ Mechanism
All missing values on a given variable are replaced by the
sample mean for that variable
◦ Pros
Leaves sample mean of non-missing values unchanged
◦ Cons
Often leads to biased parameter estimates (e.g., variances)
Usually leads to standard error estimates that are biased
downward
Treats imputed data as real data, ignores inherent uncertainty
in imputed values.
Individual Mean Imputation
◦ Mechanism
Scale scores are computed by taking the mean of non-missing values
Ex: Respondent answered 8 of 10 questions on Miller Anxiety
Scale – Compute Scale score by taking mean of available cases
◦ Pros
All available information for a given individual is used in the
estimation of missing values
◦ Cons
Assumes the items with missing values are similar in difficulty or
extremity to items with non-missing data
May lead to biased scores
Regression
◦ Mechanism
Missing values are replaced by “predicted” values derived
from MR using all relevant variables
◦ Pros
Predicted values maintain relationships among variables
◦ Cons
Predicted values are “perfect” and lead to positively biased
estimates
Stochastic Regression (aka “Simple Imputation”)
◦ Mechanism
Similar to non-stochastic regression in the available data are used to
predict missing values
Adds a random value to the predicted value by sampling from a
normal distribution with a mean of zero and variance equal to the
residual variance of the regression equation
◦ Pros
Improvement over Non-Stochastic methods
Provides unbiased variance estimates
◦ Cons
Only uses a single estimation step and may produce inaccurate or
unusual values
Expectation Maximization (EM)
◦ Mechanism
2-step iterative process
Step 1: Expectation
Use parameter values (initially based on complete-case data) to estimate
values for missing data
Step 2: Maximization
Use complete-case data and estimated values for missing data to estimate
new model parameters
Repeat until results converge (Successive iterations will not yield different
parameters)
◦ Pros
Minimizes bias in parameter estimates (larger samples yield less bias)
Ideal for exploratory and reliability analyses
◦ Cons
Initial estimates based on list-wise deletion (doesn’t use all available data)
Biased standard errors (minimized with larger samples)
Less efficient than FIML for hypothesis testing
Full Information Maximum Likelihood (FIML)
◦ Mechanism
Directly estimates parameters using all observed data for every case
◦ Pros
Only requires a single step for imputation and analysis
Uses all available data even if some cases are missing data
Unbiased standard errors
Can be used with smaller samples (N<100)
◦ Cons
All variables related to missing data need to be included in the analysis
Multiple Imputation (MI)
◦ Mechanism
Creates multiple data set using stochastic regression
Minimum of 3-5 recommended, but no limit on maximum (Schafer, 1997)
Each data set will be slightly different because of the random component
Parameters are estimated for each data set and then averaged
◦ Pros
Produces unbiased parameter estimates
Produces unbiased standard errors
Easy to include auxiliary variables
◦ Cons
Labor intensive
Can be difficult to integrate multiple data sets
Comparison of Stochastic Methods
Good
Better
Best
•Stochastic Regression
•Expectation-Maximization
•Multiple Imputation
•Full Information
Maximum Likelihood
SPSS/PASW
Deletion
Non-Stochastic
Replacement
Simple
Imputation
EM
FIML
MI
SAS
AMOS/MPLUS/LISREL
Modeling problematic child behavior outcomes
Predictors
N=181
Original data set missing 4 observations (<.5%)
New data set created for purpose of demonstration
◦
◦
◦
◦
Positive Parenting
Social Skills
Interpartner Violence
Child Sex
◦ Little’s Test of MCAR can be obtained as part of
PASW “Missing Values Analysis”
Little's MCAR test: Chi-Square = 36.014,
DF = 18, Sig. = .007
Conclude that data are not MCAR (not surprising given that I did not
delete values in a random manner)
Test of MAR can be conducted by creating new dichotomous variable for “Not
Missing/Missing” and using it as the outcome variable in a logistic regression
model
Most interested in missing data on outcome variable in this example, but method
is not limited to that
the Equation
Conclude that pattern of Variables
missingindata
is related to Gender
B
Step 1
a
Gender
S.E.
Wald
df
Sig.
Exp(B)
3.091
1.046
8.726
1
.003
22.003
Parenting
.074
.087
.718
1
.397
1.076
Skills
.010
.023
.195
1
.658
1.010
Aggression
Constant
-.003
.022
.024
1
.877
.997
-9.058
2.936
9.516
1
.002
.000
a. Variable(s) entered on step 1: Gender, Parenting, Skills, Aggression.
Little's MCAR test for Boys: Chi-Square = 8.338, DF = 14, Sig. = .871*
Little's MCAR test for Girls: Chi-Square = 13.026, DF = 18, Sig. = .790*
*We can conclude that data are MCAR within each group.
Gender must be included in any missing data analysis to minimize bias.
a,b
Variable Summary
Missing
N
Percent
Valid N
Mean
Std. Deviation
Behavior
59
32.6%
122
55.75
10.333
Positive Parenting
44
24.3%
137
18.4293
3.04990
Interpartner Violence
36
19.9%
145
12.77
12.229
Social Skills
27
14.9%
154
51.75
11.501
a. Maximum number of variables shown: 25
b. Minimum percentage of missing values for variable to be included: 10.0%
Although the pattern is not
monotone, these cases only make
up a very small %
PASW provides several options for handling missing data
The add-on module for “Missing Values Analysis” allows
you to implement several different strategies
simultaneously
◦ In addition to saving time, comparison output is provided for
means, SDs, and correlation/covariance matrices
Available options:
◦
◦
◦
◦
List-wise deletion
Pair-wise deletion
Stochastic regression
EM
Choose
strategies
Additional
options
Enter
continuous and
categorical
variables
The “Multiple Imputation” option is part of
the basic PASW package
◦ Provides numerous options
Choose # of iterations
Choose estimation method
(monotone vs. non-monotone patterns)
Create new data sets
Enter all variables to
use in imputation
(model + auxiliary)
Choose # of iterations
Create a new data set
with imputed data
Note: PASW allows
you to run analysis on
all imputed sets
simultaneously
“Automatic” is the default
Can manually select method
based on pattern of missing
data
If your data include
interactions, so should your
imputation model
Multiple Imputation available in PreLIS under “Statistics”
I have included both model
and auxiliary variables
Select estimation method
EM -> monotone
MCMC -> non-monotone
Decide how to handle cases
when all data are missing
Output is a “complete” data set for analysis
An alternative to MI is
to use FIML estimation
with the original data
set containing missing
values
LISREL will default to
this option if there is
missing data
List-Wise
Complete
(Constant)
Child's Sex
Positive Parenting
Social Skills
Interpartner Violence
B
Std.
Error
Sig.
83.71
5.29
-.75
Pair-Wise
B
Std.
Error
B
Std.
Error
Sig.
Sig.
.000
91.47
6.57
.000
91.34
7.01
.000
1.38
.586
-.64
1.72
.709
-.58
1.79
.748
-1.03
.22
.000
-1.27
.27
.000
-1.34
.28
.000
-.20
.06
.001
-.26
.08
.001
-.21
.08
.000
.14
.06
.024
.10
.07
.136
.07
.07
.006
Mean Substitution
Complete
(Constant)
Child's Sex
Positive Parenting
Social Skills
Interpartner Violence
B
Std.
Error
Sig.
83.71
5.29
-.75
Simple Imputation
B
Std.
Error
B
Std.
Error
Sig.
Sig.
.000
85.37
5.21
.000
80.87
6.01
.000
1.38
.586
-.42
1.19
.709
-.18
1.48
.904
-1.03
.22
.000
-1.17
.22
.000
-1.06
.24
.000
-.20
.06
.001
-.16
.05
.001
-.12
.06
.049
.14
.06
.024
.07
.05
.136
.05
.06
.390
EM-PASW
Complete
B
Std.
Error
Sig.
B
Std.
Error
83.71
5.29
.000
91.52
-.75
1.38
.586
-1.03
.22
Social Skills
-.20
Interpartner
Violence
.14
(Constant)
Child's Sex
Positive
Parenting
MCMC-LISREL
FIML-LISREL
B
Std.
Error
4.99
.000 92.96
5.32
.000 88.83
-.35
1.16
.761
-.18
1.59
.359
-.23
.79
.799
.000
-1.36
.21
.000
-1.24
.26
.000
-1.19
.26
.000
.06
.001
-.22
.05
.000
-.23
.06
.000
-.25
.07
.000
.06
.024
.09
.05
.073
.11
.06
.051
.11
.06
.076
Sig.
Sig.
B
Std.
Error
Sig.
5.86
.000
The goal of handling missing data is to find values close to the “real” (but
absent) values. (T or F)
◦ FALSE – the goal is to estimate unbiased standard errors and parameter
estimates
Which is more important – amount of missing data or type of missing data?
◦ Both are important, but type is more important than amount
List-wise deletion is a good strategy for handling missing data? (T or F)
◦ TRUE – if data are MCAR; if not MCAR, then there are better alternatives
There are no “good” strategies for handling data that are NMAR. (T or F)
◦ TRUE – but FIML is considered to yield the least biased results
Deletion is the only strategy for handling missing categorical data.
(T or F)
◦ FALSE – can use both non-stochastic and stochastic methods
If using multiple imputation, it is best to include all available variables.
(T or F)
◦ FALSE – only include variables related to those with missing data
Values such as “not applicable”, “not sure”, “I don’t know”, etc. should
be treated as missing data. (T or F)
◦ FALSE – if you included these as possible response categories, then they
constitute valid responses (i.e., they are not missing)
List-wise deletion is better than non-stochastic imputation. (T or F)
◦ TRUE – if data are MCAR and/or unless using a small sample with minimal
power
Missing data should only be imputed for predictor variables and never for
outcome variables. (T or F)
◦ DEPENDS – if you have good auxiliary variables for the outcome variable,
then you should impute on the outcome variable; otherwise you should not
impute.
Values such as “not applicable”, “not sure”, “I don’t know”, etc. can be treated
as missing data. (T or F)
◦ TRUE – IF you have a strong theoretical argument that a different response
would have been obtained under different circumstances
The most important factor in choosing a strategy is the type of missing data.
(T or F)
TRUE
Analyses should always be conducted and reported using data with and without
missing values. (T or F)
◦ TRUE
Causes (actual and/or hypothesized) of missing data
should be discussed
The amount of missing data and the strategy used to
handle it should be reported
Results of analyses with and without missing data
should be discussed
The most appropriate strategy should be used
Strategy
Type of Missing Data
MCAR
List-wise Deletion
Pair-wise Deletion
Non-stochastic
Replacement
Simple Imputation
EM
FIML
Multiple
Imputation
MAR
NMAR
Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage Publications.
Bennett, D.A. (2001). How can I deal with missing data in my study? Australian and New
Zealand Journal of Public Health, 25, 464-469.
Little, R.J.A. (1988). A test of missing completely at random for multivariate data with missing
values. Journal of the American Statistical Association , 83, 1198-1202.
Little, R. J. A., & Rubin, D.B. (1987). Statistical analysis with missing data. John Wiley & Sons,
New York.
Peng, C.Y., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods
and implications for educational research. In S Sawilowsky (Ed.), Real data analysis
(pp.31-78), Greenwich, CT: Information Age.
Schafer, J.L. (1997). Analysis of incomplete multivariate data. Thousand Oaks, CA: Sage.
Schafer, J.L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research. 8:
3-15.
Schlomer, G.L., Bauman, S., & Card, N.A. (2010). Best practices for missing data management
in counseling psychology. Journal of Counseling Psychology, 57(1), 1-10.
Widaman, K.F. (2006). Missing data: What to do with or without them. Monographs of the
Society for Research in Child Development, 71(3), 42-64.