SSC Case Study 2002 Handling Missing Data Tao Sun Lena Zhang

Download Report

Transcript SSC Case Study 2002 Handling Missing Data Tao Sun Lena Zhang

DEPARTMENT OF
MATHEMATICS AND STATISTICS
SSC Case Study 2002
Handling Missing Data
Tao Sun
Lena Zhang
Yaqing Chen
Francisco Aguirre
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Presentation Outline
Objective:
Compare different approaches to handle missing data from a
practitioner’s point of view
1.
2.
3.
4.
5.
Preliminary analysis
•
Various plots
Assessing the missing pattern
•
Spearman rank correlation, logistic regression
Data analysis with missing data - Multiple Imputation
• Random hot deck imputation with bootstrap
• PROC MI and MIANALIZE (SAS)
• Transcan function (Hmisc library in S plus or R)
Conclusions
Further work
SSC Conference Hamilton Ontario May 2002
2
Preliminary analysis
RESPONSE OVERVIEW
Histogram
of observed responses
HISTOGRAM OF RESPONSE DVHST94
DVHST94
Sample size: 2389
Males: 1097 (45.9%)
Females: 1292 (54.1%)
400
500
600
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Observed: 1691
300
Missing: 698 (28.8%)
200
Mean: 0.9129
0
100
•
0.4
0.6
0.8
The response variable
is highly skewed to
the left.
1.0
DVHST94
SSC Conference Hamilton Ontario May 2002
3
Preliminary analysis
DEPARTMENT OF
MATHEMATICS AND STATISTICS
• 8 covariates in total, first 4
shown here.
• There appears to be a pattern
of two clusters in the response
DVHST94 (below 0.5 and
above 0.5).
• DVBMI94 appears to have
some “wild” values ( = 96)
– 43 observations , all males.
(3.9% of males sample)
– Wild values were replaced
with the mean DVBMI94 of
males
– DVBMI94 transformation:
NEW.DVBMI94 = abs (DVBMI94 – 22)
SSC Conference Hamilton Ontario May 2002
4
Preliminary analysis
DEPARTMENT OF
MATHEMATICS AND STATISTICS
•
There are no obvious linear
patterns between the
covariates and the response
DVHST94
•
DVPP94 is recoded as
dichotomous:
NEW.DVPP94 = 0 (91% of observations)
NEW.DVPP94 > 0 (9% of observations)
•
The AGEGRP covariate is
recoded to NEW.AGE
NEW.AGE = mid range value (AGEGRP) – 20
SSC Conference Hamilton Ontario May 2002
5
Preliminary analysis
DEPARTMENT OF
MATHEMATICS AND STATISTICS
mean
Mean DVHST94
N
NEW.AGE
2
7
12
17
22
27
32
37
42
309
283
383
296
173
132
61
29
25
SEX
Female
Male
857
834
DVHHIN94
[ 1, 7)
[ 8,10)
[10,11]
635
259
435
362
7
DVSMKT94
1
2
3
4
5
6
535
57
44
305
155
595
NEW.DVPP94
DVPP94 > 0
DVPP94 = 0
172
1519
NUMCHRON
0
1
[2,9]
814
491
386
VISITS
[ 0, 3)
[ 3, 6)
[ 6,12)
[12,94]
487
433
382
389
NEW.WT6
[0.0547,0.447)
[0.4473,0.824)
[0.8239,1.430)
[1.4297,7.445]
424
422
420
425
NEW.DVBMI94
[0.0, 1.6)
[1.6, 3.1)
[3.1, 6.1)
[6.1,18.0]
453
383
433
422
Overall
1691
0.84
SSC Conference Hamilton Ontario May 2002
0.86
0.88
0.90
0.92
6
Preliminary analysis
DEPARTMENT OF
MATHEMATICS AND STATISTICS
• Strength of marginal
relationships between the
covariates and the
response using generalized
Spearman chi-square
SSC Conference Hamilton Ontario May 2002
7
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Assessing the missing
pattern
• The missing pattern of the
response does not appear
to depend on the sampling
weights
SSC Conference Hamilton Ontario May 2002
8
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Assessing the missing
pattern
• The missing values
depend on age
500
100%
400
80%
300
60%
200
40%
100
20%
0
0%
2
7
12
17 22 27
NEW.AGE
total sample size
SSC Conference Hamilton Ontario May 2002
Percentage of
missing
Sample size
Missing response DVHST94 vs NEW.AGE
32
37
42
% Missing values
9
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Assessing the missing
pattern
LOGISTIC REGRESSION
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-5.058793 0.367083 -13.781
NEW.AGE
0.181625 0.007524 24.140
SEXMale
-0.847947 0.131475 -6.450
DVHHIN94
0.047828 0.026768 1.787
DVSMKT94
-0.015131 0.031662 -0.478
NEW.DVPP94 = 0
0.233188 0.226732 1.028
NUMCHRON
-0.087992 0.048783 -1.804
VISITS
0.012483 0.006563 1.902
NEW.WT6
-0.043935 0.077407 -0.568
NEW.DVBMI94
-0.015622 0.017299 -0.903
< 2e-16 ***
< 2e-16 ***
1.12e-10 ***
0.0740 .
0.6327
0.3037
0.0713 .
0.0572 .
0.5703
0.3665
Missing response DVHST94 vs Gender
3000
2500
% missing for males: 24%
2000
Missing
Observed
1500
% missing for females: 34%
1000
500
0
Male
SSC Conference Hamilton Ontario May 2002
Female
Total
10
Multiple imputation
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Methods:
– Random Hot Deck MI with Bootstrap
– SAS PROC MI and PROC MIANALIZE
– Function TRANSCAN in S-plus from Hmisc Library
(Frank Harrel)
SSC Conference Hamilton Ontario May 2002
11
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Multiple Imputation
• IMPUTATION:
IMPUTATION
ANALYSIS
POOLING
Impute the missing entries of the
incomplete data sets B times, resulting
in B complete data sets.
• ANALYSIS:
Analyze each of the B completed data
sets using weighted least squares.
INCOMPLETE
DATA
IMPUTED
DATA
ANALYSIS
RESULTS
FINAL
RESULTS
• POOLING:
Integrate the B analysis results into a
final result. Simple rules exist for
combining the B analyses.
SSC Conference Hamilton Ontario May 2002
12
Random hot-deck MI with Bootstrap
DEPARTMENT OF
MATHEMATICS AND STATISTICS
  mean(  )
Estimated
Observed
Missing
response
response
~
Complete
data
Total variance  U B 
b 1
Be B
b
B
where U B   U i / B(within va riance)
Choose randomly with
replacement
Probability ~ weights
i 1
1 ( U1 , R1
)
(Within variance,R-square)
B
Be B   (U i  U B )(U i  U B ) /( B  1)
i 1
(between va riance )
B
B = 1000 replicates
Same procedure
1000(U1000, R1000)
Estimated
R 
R
i 1
i
B
(Within variance ,R-square)
Compute 95% CI for judging
significance of predictors
SSC Conference Hamilton Ontario May 2002
13
DEPARTMENT OF
MATHEMATICS AND STATISTICS
PROC MI & MIANALYZE Method
PROC MI
1 By default generates 5 imputation values for each missing value
2 Imputation method: MCMC (Markov Chain Monte Carlo)
 EM algorithm determines initial values
 MCMC repeatedly simulates the distribution of interest from which the
imputed values are drawn
3 Assumption: Data follows multivariate normal distribution
PROC REG
 Fits five weighted linear regression models to
the five complete data sets obtained from PROC MI
(used by_imputation_statement )
PROC MIANALIZE
Reads the parameter estimates and associated
covariance matrix from the analysis
performed on the multiple imputed data sets
and derives valid statistics for the parameters
SSC Conference Hamilton Ontario May 2002
14
DEPARTMENT OF
MATHEMATICS AND STATISTICS
TRANSCAN(Splus,Hmisc)
Frank Harrell
Transforms continuous and categorical variables to have maximum
correlation with the best linear combination of the other variables.
Advantage:
It approximates the multiple imputation algorithm described by
Rubin’s Bayesian bootstrap.
•Does not need normality
assumption or symmetry
of residuals.
• Draws a sample of size r from r non-missing residuals.
• Chooses a sample of size m from this sample of size r with
replacement. m is the number of missing values.
Yobs. , X obs 
LS
Bootstrap
(ˆ1,..., ˆr )
Bootstrap
(ˆ1 ,..., ˆr' )
'
(ˆ1* ,..., ˆm* )
• Generates imputed values with the linear imputation model and the
bootstrapped residuals.
•Does shrinkage to avoid
overfitting
Disadvantage:
•“Freezes” the imputation
model before drawing the
multiple imputations.
This algorithm is repeated B times to obtain the multiple imputed data
sets that are analyzed using WLS with the function LM.
SSC Conference Hamilton Ontario May 2002
15
Comparing imputation
methods
DEPARTMENT OF
MATHEMATICS AND STATISTICS
S-plus TRANSCAN
(Intercept)
NEW.AGE
SEX (Male=1)
DVHHIN94
NEW.DVBMI94
DVSMKT94
NEW.DVPP94(=0)
NUMCHRON
VISITS
Mean R-square
0.8495
-0.0039
0.0045
0.0083
-0.0001
0.0012
0.0904
-0.0174
-0.0026
(0.0135)
(0.0004)
(0.0045)
(0.0016)
(0.0007)
(0.0014)
(0.0085)
(0.0022)
(0.0003)
0.33
*
*
*
*
*
*
SAS PROC MI
0.9281
-0.0016
0.0023
0.0061
-0.0005
0.0009
0.0717
-0.0123
-0.0023
(0.01) *
(0.0004) *
(0.0045)
(0.0012) *
(0.0008)
(0.0013)
(0.0092) *
(0.0023) *
(0.0003) *
Bootstrap
(random hot
deck)
0.8711 (0.0128)
-0.0006 (0.0002)
0.0031 (0.0055)
0.0029 (0.0007)
-0.0006 (0.0005)
0.0019 (0.0008)
0.0531 (0.0089)
-0.0079 (0.0013)
-0.0017 (0.0002)
0.193
0.093
Available data
only
*
0.861
* -0.0013
0.0037
* 0.0051
-0.0007
* 0.0012
* 0.0686
* -0.013
* -0.0023
(0.012) *
(0.0003) *
(0.0049)
(0.0011) *
(0.0007)
(0.0012)
(0.0081) *
(0.0021) *
(0.0003) *
0.183
Ranking:
1.
TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)
2.
PROC MI (Drawback: normality assumption)
3.
Bootstrap random hot deck (does not use the information of the covariates)
SSC Conference Hamilton Ontario May 2002
16
Significant variables
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Intercept
0.0090
DVHHIN94
0.0160
0.0080
0.0016
0.9200
0.0140
0.0070
0.0014
0.9000
0.0120
0.0060
0.0012
0.0100
0.0080
0.0050
0.0040
0.0010
0.0008
0.0060
0.0030
0.0006
0.0040
0.0020
0.0004
0.0020
0.0010
0.0002
-
0.0000
0.8800
0.8600
0.8400
0.8200
0.8000
NEW.DVPP94(=0)
0.0018
0.9400
-
0.1000
0.0900
0.0800
0.0700
0.0600
0.0500
0.0400
0.0300
0.0200
0.0100
0.0000
0.0094
0.0092
0.0090
0.0088
0.0086
0.0084
0.0082
0.0080
0.0078
0.0076
0.0074
S-plus
SAS PROC MI Random Hot
Complete
TRANSCAN
Deck
observations
(Bootstrap)
S-plus SAS PROC MI Random Hot Complete
TRANSCAN
Deck
observations
(Bootstrap)
S-plus SAS PROC MIRandom Hot Complete
TRANSCAN
Deck
observations
(Bootstrap)
VISITS
NEW.AGE
NUMCHRON
Random Hot
S-plus
Deck
Complete
TRANSCAN SAS PROC MI (Bootstrap) observations
S-plus
Random Hot
Complete
TRANSCAN SAS PROC MIDeck (Bootstrap) observations
Random Hot
S-plus
Deck
Complete
TRANSCAN SAS PROC MI (Bootstrap) observations
0.0000
0.0005
0.0000
0.0000
0.0004
-0.0005
0.0005
-0.0020
0.0003
-0.0010
0.0004
-0.0015
0.0004
-0.0040
-0.0060
0.0020
-0.0005
-0.0080
0.0015
-0.0010
0.0003
0.0002
-0.0015
0.0002
-0.0020
0.0001
-0.0020
-0.0025
0.0003
0.0003
0.0002
-0.0120
0.0010
0.0005
-0.0030
0.0002
-0.0035
0.0001
-0.0180
-0.0200
0.0001
-0.0040
0.0001
-0.0030
-
-0.0045
-
SSC Conference Hamilton Ontario May 2002
-0.0100
-0.0140
-0.0160
-0.0025
0.0025
-
17
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Conclusions about the
missing pattern
• The missing values of the response variable DVHST94
are not MCAR. The probability of missing depends
primarily on the age and sex covariates, therefore the
missing values are MAR.
SSC Conference Hamilton Ontario May 2002
18
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Conclusions about
multiple imputation
• Transcan function appeared to perform better than
PROC MI for imputing and analyzing this data set given
non-normality.
• Random hot deck MI with bootstrap gave significantly
biased results. This approach does not take into account
the information provided by the covariates therefore is
not appropriate for data MAR.
SSC Conference Hamilton Ontario May 2002
19
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Conclusions about the
data analysis
• The health status of the population tends decreases with age.
• People with higher income tend to have better health than people
with less income.
• People with lower health status demand more medical services
(visits to a doctor).
• People that are propense to depression have lower health.
• Smoking does not appear to have a decisive influence on the
health status.
SSC Conference Hamilton Ontario May 2002
20
Future work
DEPARTMENT OF
MATHEMATICS AND STATISTICS
• GLM could be used to model the categorical response GQ.H1
using a multinomial logistic model to impute the missing
categorical responses
• Interactions of the significant variables with the insignificant
variables should be explored in order to further assess the
concomitant effects (e.g. smoking and depression).
SSC Conference Hamilton Ontario May 2002
21
DEPARTMENT OF
MATHEMATICS AND STATISTICS
Acknowledgements:
Special thanks to professor Peggy Ng and George Monette for their
support.
SSC Conference Hamilton Ontario May 2002
22