Transcript Slide 1

Missing Data
.
What do we mean by missing
data?
• Missing observations which were intended
to be collected but:
– Never collected
– Lost accidently
– Wrongly collected so deleted
• Outcomes and/or Explanatory variables
Effect of Missing Data
• Can cause
– Biased estimates, means, regression
parameters
– Biased standard errors, resulting in incorrect
P-values and CI
Missing data mechanism
1. Missing Completely At Random : MCAR
– Missing does not depend on observed or
unobserved values
– Eg. Missing FBC because a tube with blood
material is accidently broken
– BP missing due broken machine
Missing data mechanism
2. Missing At Random : MAR
– Missing depends on observed data, but not
on the unobserved data.
– Eg. 18-25 year olds are less likely to respond
to a follow up postal questionnaire – more
likely to change address several times
Missing data mechanism
3. Missing Not At Random: MNAR
– Given all available observed information, the
probability of being missing still depends on
the unobserved data
– Eg. Patient misses an appointment because
they feel ill. This illness (e.g.flu) is related to
the measurement intended to be made (e.g
temperature)
The Assumptions
– Cannot tell from data at hand whether the
missing values are MCAR, MNAR or MAR
– Can distinguish between MCAR and MAR
– MAR can be made more likely by looking at
associations between missing values and non
missing observations in explanatory variables
Simple methods to handle missing
data
• Complete Case (CC) analysis
• Mean Imputation
• Regression imputation
• Stochastic Imputation
}
Problem:
Makes results too certain
Multiple Imputation (MI)
• Under MAR assumption, gives less biased
estimates and SEs, when compared to CC
• Covers many different data structures
• Never absolute best thing to do
Multiple Imputation (MI)
ID x1
1 32.4
2 .
?
3 26.7
x2
204
5.8
308
ID
x1
x2
1
32.4
204
2
14.2 5.8
3
26.7
308
4
13.3
15.9
5
6.8
10.4
6
10.1
6.0
x1
x2
4 13.3
15.9
ID
1
32.4
204
5 .
10.4
2
5.6
5.8
3
26.7
308
4
13.3
15.9
5
12.2
10.4
6
10.1
6.0
?
6 10.1
6.0
Key Idea behind Imputation
• Express our uncertainty about missing
data by creating ‘m’ imputed data sets
• Analyse each of these in usual way
• Combine estimates using particular rules
(Rubin’s rules)
Key Idea behind Imputation
• Two variables: X1 and X2
– X1 missing in some records
– X2 not missing, observed in every unit
• Learn relationship between X1 and X2
• Complete data set by drawing the missing
observations from X1 | X2
Example 1
• Longitudinal Breast Cancer study
– Outcome: Early death or disease recurrence
– Explanatory variables: age, meno, tam
• Cox regression
How much is missing?
variables with no mv's: id meno rectime censrec _st _d _t _t0 lnt
Variable
| type
obs
mv
variable label
-------------+-----------------------------------------------age
| float
554 132
age, years
tam
| byte
557 129
hormonal therapy
-------------------------------------------------------------N: 686
CC Analysis
Cox regression -- Breslow method for ties
No. of subjects =
No. of failures =
Time at risk
=
452
193
1412.848734
Number of obs
=
452
LR chi2(3)
=
5.15
Log likelihood =
-1073.9288
Prob > chi2
=
0.1613
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.993877
.0108284
-0.56
0.573
.9728787
1.015328
tam |
.723719
.1162513
-2.01
0.044
.528252
.991514
meno |
1.312512
.2877824
1.24
0.215
.85402
2.017151
------------------------------------------------------------------------------
MI in Practice
• STATA : ICE
– Multiple Imputation by Chained Equations
(MICE)
• Univariate imputation - uvis
• Multivariate imputation - ice
0
.02
.04
.06
1
0
Density
.02
.04
.06
0
20
Graphs by agemiss
40
60
Age (years)
80
MI Analysis
mim: stcox age tam meno
Multiple-imputation estimates (stcox)
Imputations =
Minimum obs =
Minimum dof =
5
686
69.9
-----------------------------------------------------------------------------_t | Haz. Rat. Std. Err.
t
P>|t|
[95% Conf. Int.]
FMI
-------------+---------------------------------------------------------------age |
.985514
.010088
-1.43
0.158
.965598 1.00584
0.247
tam |
.724898
.101434
-2.30
0.023
.54933 .956578
0.191
meno |
1.42128
.276051
1.81
0.072
.968226 2.08633
0.160
------------------------------------------------------------------------------
Summary
• Most studies will have missing data
• MI suitable. Gives less biased estimates,
SE, under MAR and MCAR
• MI is a useful tool for dealing with missing
data.