**** 1 - Stata
Download
Report
Transcript **** 1 - Stata
Jin is designed
Dr. Huber
by
Texas A&M HSC
Korean Female Colon Cancer
Event
Risk
Factors
Range
Smoking
Habits
Missing
Non-event
HR
95% CI
P
n
%
n
%
1449400
79.57
4071
95.70
-
-
-
-
19.32
93
2.19
1.000
1.000
1.000
1.000
0.25
21
0.49
1.174
1.058
1.303
0.0025
0.48
38
0.89
0.948
0.828
1.084
0.4339
0.30
26
0.61
0.991
0.901
1.09
0.8457
0.08
5
0.12
1.015
0.894
1.153
0.8162
No
351896
smoking
Smoked
before ,
4611
but quitted
Currently,
8735
1/2 pack
Currently,
1/2-One
5534
pack
Currently,
More than
1410
One pack
Not sure b/c
Is smoking protective?
Huge missing!!
1. Missing Completely At Random(MCAR)
: depends neither on observation nor on missing
Diff. by
Why data
are missing
2. Missing At Random(MAR)
: depends only on observation
3. Not Missing At Random(NMAR)
: depends both on observation and on missing
Affect the effectiveness and biasness
of methods for missing data
1. Complete Case Analysis(CCA)
Older Methods
2. Available Case Analysis(ACA)
3. Mean imputation
4. Expectation and Maximum(EM)
5. Multiple Imputation
Only CCA and MI
Single
Imputation
Multiple
Imputation
1. Complete Case Analysis (CCA)
Y1
Y2
Y3
140
.
20
31
25
.
10
35
40
25
48
57
30
49
60
35
55
65
37
47
70
140
32
30
42
65
40
50
200
20
1. Delete all cases of
missing values
on Y1,Y2,Y3
2. Analyze remaining
cases
1. CCA = NOT using any methods of handling missing data
2. By deleting cases, power will be decreased
(b/c reduced sample size)
2. Multiple Imputation (MI)
(1) Imputation Step
(2) Analysis Step
(3) Combination Step
MI has 3 steps
2. MI
(1) Imputation Step
Y
X1
Imputation
Number
Y
X1
X2
1
1
44
11
178
2
1
3
1
10
X2
1
44
11
178
2
45
10
185
3
59
.
.
4
49
9
.
5
60
8
170
6
50
.
44
7
11
176
.
8
10
49
8
9
170
50
.
4
1
11
5
1
12
6
1
13
7
14
1
8
15
1
9
16
1
17
45
10
185
Imputation
Y
X1
Number
16.5 136.4
59
1 44 8 11
2
X2
178
179.5
492
9 45
10
185
Imputation
9
Y
X1
Number 63.9
602
8 59170
9 44 98.96
19
3
11
38.4
192.3
502
44
9 45
20
3
0 49
7 10
21
2
11
22
2
23
10
24
2
170
25
2
26
“5 complete datasets”
18
2
27
X2
178
185
3 - 8 59 170
63.88 -121.12
60
608.5
3 7 38.449
9
185.82
50
44
9
60
8
170
49 3 8
3 - 17650 644.2
33.65
44
50 11
88.94
6
3
11
176 -665.12
10
49
8
3
10
49
8
170
50
3
17097.00
50 -189.96
176
Imputation
Number
Y
X1
X2
28
4
44
11
178
29
4
30
4
31
4
32
4
39
60
85
59
170
33
4
50
40
33.60
5
34
4
11
41
176
5
35
4
42
10
5
49
44
49
706.8
60
7
50
8
4
43
170
5
50
36
45 Imputation
10
185
Y
Number
458.6
59
42.87
044
37
5
179.0
38
45
49
95
7
44
5
45
5
- 11
212.1
8
10
170
X1
X2
11
178
10
185
1.64
213.9
4
9
182.0
8
8
170
33.16
44
176
720.9
2
49
8
50
222.1
6
2. MI
(2) Analysis Step
* Standard statistical procedure > regression
for each complete datasets (5) separately
Variable names f
Dependent v
or rows of
ariable
estimated COV
Y
Imputation
Number
Label
of model
Type of
statistics
1
1
MODEL1
PARMS
2
1
MODEL1
COV
Intercept
Y
9.49
3
4
5
1
1
2
MODEL1
MODEL1
MODEL1
COV
COV
PARMS
X1
X2
Y
Y
Y
9.49
9.49
11.80
6
2
MODEL1
COV
Intercept
Y
11.80
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2
2
3
3
3
3
4
4
4
4
5
5
5
5
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
COV
COV
PARMS
COV
COV
COV
PARMS
COV
COV
COV
PARMS
COV
COV
COV
X1
X2
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
11.80
11.80
3.86
3.86
3.86
3.86
1.76
1.76
1.76
1.76
1.46
1.46
1.46
1.46
Analyzed 5 times
Intercept
X1
X2
Intercept
X1
X2
Intercept
X1
X2
Root mean
squared error
Intercept
9.49
417.91
X1
-7.96
722.00
15.61
-15.61
0.34
-3.26
0.07
405.16 -7.81
1052.74
23.16
-23.16
0.52
-4.60
0.10
233.43 -4.31
28.82
-0.66
-0.66
0.02
-0.12
0.00
221.04 -4.17
5.20
-0.12
-0.12
0.00
-0.02
0.00
215.80 -4.08
3.36
-0.08
-0.08
0.00
-0.01
0.00
X2
Y
-1.64
-1
-3.26
.
0.07
0.02
-1.53
.
.
-1
-4.60
.
0.10
0.02
-0.80
-0.12
0.00
0.00
-0.74
-0.02
0.00
0.00
-0.71
-0.01
0.00
0.00
.
.
-1
.
.
.
-1
.
.
.
-1
.
.
.
2. MI
(3) Combination Step
> the results from 5 data are combined to ONE with combination equations.
1. Combined estimate:
2. Variance Total:
3. Var. Within:
4. Var. Between:
5. DF:
6. Fraction missing Info. :
7. Confidence Interval:
combined to 1 result
* Comparison of methods to handle missing values
Multiple
Imputation
EM
method
X
X
O
O
X
X
X
O
O
X
X
X
X
X
Good Estimates
Variability
X
X
X
X
O
Best Statistical
Power
X
O
O
O
O
Criteria
Unbiased
Parameter
Estimation
CCA
ACA
MCAR
O
MAR
MNAR
Mean
MI is the BEST!!
Imputation
Excellent
Estimation
Variance
among ‘M’est.
b/c multiply
imputed data
by not
deleting any
cases
(1) Imputation step of MI
: imputation mechanisms for substituting missing values
Pattern
Univariate
Monotone
Type
Normality
Imputation
mechanisms
Continuous
O
Regression
Univariate
Monotone
Continuous
X
Predictive
Mean
Matching
Multivariate
Not
Monotone
Continuous
-
MCMC
MCMC is NOT tested to
Univariate
Simulated Data
* 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous)
( Xs: observed variables and Z: partly missing var. )
* Z1, and X1,…,X6 are drawn from multivariate normal dist with
Means = 0 and Correlation =
z1
z1
x1
x2
x3
x4
x5
x6
1.0000
0.7655
0.2764
0.0509
0.1612
0.2924
0.1052
x1
x2
x3
x4
x5
x6
1.0000
0.3233 1.0000
0.0351 0.5352 1.0000
0.1415 -0.0063 -0.0738
0.3581 0.8062 -0.0640
0.1124 -0.0061 -0.0764
1.0000
0.0441
0.1157
1.0000
0.0420
1.0000
Example Data
(“A Predictive Study of Coronary Heart Disease” )
* 3154 obs. (all variables are continuous)
- Missing variable: Systolic Blood Pressure (Mean: 128.63)
- Observed variables: DBP(82.02), height(69.78), weight(169.95), age(46.28),
BMI(24.52), and Cholesterol (Mean: 226.37)
* Correlation =
sbp
sbp
dbp
height
weight
age
bmi
chol
1.0000
0.7700
0.0156
0.2513
0.1701
0.2878
0.1231
dbp
height
weight
age
bmi
chol
1.0000
0.0070 1.0000
0.2940 0.5333 1.0000
0.1440 -0.0919 -0.0331
0.3428 -0.0633 0.8079
0.1296 -0.0889 0.0085
1.0000
0.0256
0.0892
1.0000
0.0706
1.0000
1. Missing Mechanisms
1) MCAR: Randomly Z1(SBP) deleted
2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted
3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted
to 0%, 10%, 20%,
30%, 40%, 50%,
60%, 70%, 80%
2. Biasness mainly measured by
RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2)
: captures estimates’ Accuracy and Variability
and compares them in the same units.
* True value= Mean of Z1 (SBP) at 0% missing
* Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI
When RMSE “smaller”
→
Estimation “better”
3. The method to deal with missing values (to measure effectiveness of MI)
Complete Case Analysis (CCA)
Multiple Imputation (MI)
4. Imputation numbers
M=10, 20, 30, 40, and 50 numbers
5. Imputation model
(z1= x1 x2 x3 x4 x5x6),
all variable
(z1= x1 x2 x5),
highly corr. var to z1
z1=x1x2x5 model is best model
b/c
smallest RMSE
(z1= x3 x4x6)
rarely corr. var
6. Imputation Mechanisms
Regression method
PMM
MCMC
7. 500 repetitions on each MI (to reduce random variability of imputation)
ex) M=10 *500 reps. → Average them→
…
Mean of Est. for M=10
M=50 *500 reps. → Average them→
Mean of Est. for M=50
8. Statistical Software
STATA11 (Multiple Imputation)
MAR
MAR
Proportion of missing data
better
CCA
CCA
MI
MI
NMAR
1.6
0.25
1.4
0.2
1.2
0.151
0.8
0.6
0.1
0.4
0.05
0.2
00
RMSE
1.6
0.12
1.4
0.1
1.2
0.08
1
0.8
0.06
0.6
0.04
0.4
0.02
0.2
00
RMSE
RMSE
RMSE
RMSE
MCAR
MCAR
Proportion of missing data
CCA
CCA
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Proportion of missing data
MI
MI
CCA
MI
Under MCAR and MAR, both CCA and MI are Good.
changing scale of Y axis, Under All missing mechanisms,
MI is better than CCA.
Percent of missing
, RMSEs are linearly
& Diff. of RMSE b/w CCA and MI
> High amount of missing, using Multiple Imputation
MAR
NMAR
1.2
1.2
1
1
1
0.8
0.8
0.6
0.4
0.2
0.6
RMSE
1.2
RMSE
RMSE
MCAR
Similar
0.4
0.2
0
10%20%30%40%50%60%70%80%
10% 20% 30% 40% 50% 60% 70% 80%
Proportion of missing data
10 impute
40 impute
20 impute
50 impute
0.6
0.4
0.2
0
0.8
0
10%20%30%40%50%60%70%80%
Proportion of missing data
30 impute
10 impute
20 impute
40 impute
50 impute
30 impute
Proportion of missing data
10 impute
40 impute
20 impute
50 impute
30 impute
Under
NMAR, MIof
biased
est. at
(Regardless
imputation
#) 80% missing
b/c Under
large MCAR
RMSE and
≒ (MAR,
1 SDMI
ofGood!
data=0.99 )
5 lines(M=10~M=50) go together and look like 1 line.
> No difference among diff. Imputation numbers(m)=
10, 20, 30, 40, 50.
NMAR
MAR
1.4
1.4
1.4
1.2
1.2
1
1
1
0.8
0.8
0.8
0.6
0.4
1.2
RMSE
RMSE
RMSE
MCAR
0.6
0.4
0.6
0.4
0.2
0.2
0.2
0
0
0
Proportion of missing data
reg
pmm
Proportion of missing data
Proportion of missing data
mcmc
reg
pmm
MCMC/ Reg.
mcmc
reg
pmm
Normality
Theory
Practically (MI)
MCAR
Normal
Regression
All imputation mechanisms
MAR
Normal
Regression All imputation
(Reg. slightly
better)NMAR.
*Normal assumption
may notmechanisms
be important
under
NMAR
Not Normal
PMM
mcmc
Regression, MCMC
*MCMC is good under all missing mechanisms.
Thus, MCMC can be used in univariate and continuous missing.
1. Under MCAR and MAR, theoretically Reg. should be better because of normality,
but All method are good. However, Reg. method is slightly better under MAR.
2. Under NMAR, even though normality is not met, Reg. method is better than PMM.
MAR
MAR
20
4
18
3.5
16
3
14
12
2.5
10
2
8
1.56
14
0.52
00
Proportion of missing data
CCA
CCA
MI better
MI
NMAR
RMSE
20
1.618
1.416
1.214
12
1
10
0.8 8
0.6 6
0.4 4
2
0.2
0
0 10%20%30%40%50%60%70%80%
10%20%30%40%50%60%70%80%
RMSE
RMSE
RMSE
RMSE
MCAR
MCAR
10%20%
20%30%
30%40%
40%50%
50%60%
60%70%
70%80%
80%
10%
Proportion of missing data
CCA
CCA
MI
MI
20
18
16
14
12
10
8
6
4
2
0
10% 20% 30% 40% 50% 60% 70% 80%
Proportion of missing data
CCA
MI
Under MCAR and MAR, both CCA and MI are Good.
changing scale of Y axis, Under MCAR, MAR, and NMAR,
MI produced significantly unbiased values than CCA.
Percent of missing
, RMSEs are linearly
& Diff. of RMSE b/w CCA and MI
> High amount of missing, Multiple Imputation is preferable
10% 20% 30% 40% 50% 60% 70% 80%
Proportion of missing data
10 impute
30 impute
50 impute
20 impute
40 impute
16
14
12
10
8
6
4
2
0
NMAR
Similar
10% 20%30% 40%50% 60%70% 80%
Proportion of missing data
10 impute
30 impute
50 impute
20 impute
40 impute
RMSE
RMSE
RMSE
16
14
12
10
8
6
4
2
0
MAR
MCAR
16
14
12
10
8
6
4
2
0
10% 20% 30% 40% 50% 60% 70% 80%
Proportion of missing data
10 impute
30 impute
50 impute
20 impute
40 impute
Under NMAR, MI did not well at 80% missing
(Regardless of imputation # and percent of missing )
due
to large RMSE ≒ ( 1 SD of data=15.11 )
Under MCAR and MAR, MI produces unbiased est.
No difference among increased Imputation numbers
10, 20, 30, 40, 50
> Increased Imputation numbers No sign. effect to
correct bias in this data characteristics.
=
13
13
8
3
RMSE
MAR
18
RMSE
RMSE
MCAR
18
8
3
-2 10%20%30%40%50%60%70%80%
Proportion of missing data
reg
pmm
-2 10%20%30%40%50%60%70%80%
Proportion of missing data
mcmc
reg
pmm
mcmc
NMAR
18
16
14
12
10
8
6
4
2
0
MCMC/ Reg.
10%20%30%40%50%60%70%80%
Proportion of missing data
reg
Normality
Theory
Practically(MI)
MCAR
Not Normal
PMM
All missing mechanisms
MAR
Not Normal
PMM
All missing mechanisms (PMM method slightly better )
*Normal assumption maybe important only under MAR.
Not Normal
PMM
Regression, MCMC
*MCMC is good to use under MCAR, MAR, and NMAR.
NMAR
pmm
mcmc
Thus, MCMC can be used not only in multivariate and continuous
but also in PMM
univariate
andbetter
continuous
1.Under MCAR and missing,
MAR, theoretically
should be
becausemissing.
normal
assumption is broken, but All method are good.
However, PMM method is slightly better under MAR.
2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM.
Conclusion
1. Multiple Imputation (MI) > Complete Case Analysis always.
2. No significant difference in imputation numbers in my data.
3. Under MCAR and MAR, MI produce unbiased estimates at high
amount of missing.
4. However, under NMAR, the estimation by MI is also biased at
high amount of missing.
5. MCMC is good for univariate and continuous missing under
MCAR, MAR and NMAR.