Lecture 2 - UoB Interactive Server

Download Report

Transcript Lecture 2 - UoB Interactive Server

Lecture 20
Missing Data and random effect
modelling
Lecture Contents
•
•
•
•
•
•
What is missing data?
Simple ad-hoc methods.
Types of missing data. (MCAR, MAR, MNAR)
Principled methods.
Multiple imputation.
Methods that respect the random effect
structure.
Thanks to James Carpenter (LSHTM) for many
slides!!
Dealing with missing data
• Why is this necessary?
• Missing data are common.
• However, they are usually inadequately handled in both
epidemiological and experimental research.
• For example, Wood et al. (2004) reviewed 71 recently
published BMJ, JAMA, Lancet and NEJM papers.
• 89% had partly missing outcome data.
• In 37 trials with repeated outcome measures, 46%
performed complete case analysis.
• Only 21% reported sensitivity analysis.
What do we mean by missing data?
• Missing data are observations that we intended to be made but did
not make. For example, an individual may only respond to certain
questions in a survey, or may not respond at all to a particular wave
of a longitudinal survey. In the presence of missing data, our goal
remains making inferences that apply to the population targeted by
the complete sample - i.e. the goal remains what it was if we had
seen the complete data.
• However, both making inferences and performing the analysis are
now more complex. We will see we need to make assumptions in
order to draw inferences, and then use an appropriate computational
approach for the analysis.
• We will avoid adopting computationally simple solutions (such as
just analysing complete data or carrying forward the last observation
in a longitudinal study) which generally lead to misleading
inferences.
What are missing data?
• In practice the data
consist of (a) the
observations actually
made (where '?'
denotes a missing
observation):
Variable
Unit
1
2
3
4
5
6
7
1
1
2
3.4
4.5
?
10
1.2
2
1
3
?
?
B
12
?
3
2
?
2.6
?
C
15
0
Variable
• and (b) the pattern of
missing values:
Unit
1
2
3
4
5
6
7
1
1
1
1
1
0
1
1
2
1
1
0
0
1
1
0
3
1
0
1
0
1
1
1
Inferential Framework
When it comes to analysis, whether we adopt a
frequentist or a Bayesian approach the
likelihood is central.
In these slides, for convenience, we discuss
issues from a frequentist perspective, although
often we use appropriate Bayesian
computational strategies to approximate
frequentist analyses.
Classical Approach
• The actual sampling
process involves the
'selection' of the
missing values, as
well as the units. So
to complete the
process of inference
in a justifiable way we
need to take this into
account.
Bayesian Framework
• Posterior Belief =
Prior Belief + Likelihood.
Here
• The likelihood is a measure of
comparative support for
different models given the
data. It requires a model for
the observed data, and as with
classical inference this must
involve aspects of the way in
which the missing data have
been selected (i.e. the
missingness mechanism).
What do we mean by valid inference
when we have missing data?
• We have already noted that missing data are
observations we intended to make but did not. Thus, the
sampling process now involves both the selection of the
units, AND ALSO the process by which observations
become missing - the missingness mechanism.
• It follows that for valid inference, we need to take
account of the missingness mechanism.
• By valid inference in a frequentist framework we mean
that the quantities we calculate from the data have the
usual properties. In other words, estimators are
consistent, confidence intervals attain nominal coverage,
p-values are correct under the null hypothesis, and so
on.
Assumptions
• We distinguish between item and unit nonresponse
(missingness). For item missingness, values can be
missing on response (i.e. outcome) variables and/or on
explanatory (i.e. design/covariate/exposure/confounder)
variables.
• Missing data can effect properties of estimators (for
example, means, percentages, percentiles, variances,
ratios, regression parameters and so on). Missing data
can also affect inferences, i.e. the properties of tests and
confidence intervals, and Bayesian posterior
distributions.
• A critical determinant of these effects is the way in which
the probability of an observation being missing (the
missingness mechanism) depends on other variables
(measured or not) and on its own value.
• In contrast with the sampling process, which is usually
known, the missingness mechanism is usually unknown.
Assumptions
• The data alone cannot usually definitively tell us the
sampling process.
• Likewise, the missingness pattern, and its relationship to
the observations, cannot definitively identify the
missingness mechanism.
• The additional assumptions needed to allow the
observed data to be the basis of inferences that would
have been available from the complete data can usually
be expressed in terms of either
• 1. the relationship between selection of missing
observations and the values they would have taken, or
• 2. the statistical behaviour of the unseen data.
• These additional assumptions are not subject to
assessment from the data under analysis; their
plausibility cannot be definitively determined from the
data at hand.
Assumptions
• The issues surrounding the analysis of data sets with
missing values therefore centre on assumptions. We
have to
• 1. decide which assumptions are reasonable and
sensible in any given setting;
- contextual/subject matter information will be central to
this
• 2. ensure that the assumptions are transparent;
• 3. explore the sensitivity of inferences/conclusions to the
assumptions, and
• 4. understand which assumptions are associated with
particular analyses.
Getting computation out of the way
• The above implies it is sensible to use approaches that
make weak assumptions, and to seek computational
strategies to implement them. However, often
computationally simple strategies are adopted, which
make strong assumptions, which are subsequently hard
to justify.
• Classic examples are completers analysis (i.e. only
including units with fully observed data in the analysis)
and last observation carried forward. The latter is
sometimes advocated in longitudinal studies, and
replaces a unit's unseen observations at a particular
wave with their last observed values, irrespective of the
time that has elapsed between the two waves.
Conclusions (1)
Missing data introduce an element of ambiguity into statistical analysis,
which is different from the traditional sampling imprecision. While
sampling imprecision can be reduced by increasing the sample size,
this will usually only increase the number of missing observations!
As discussed in the preceding sections, the issues surrounding the
analysis of incomplete datasets turn out to centre on assumptions
and computation.
• The assumptions concern the relationship between the reason for
the missing data (i.e. the process, or mechanism, by which the data
become missing) and the observations themselves (both observed
and unobserved).
• Unlike say in regression, where we can use the residuals to check
on the assumption of normality, these assumptions cannot be
verified from the data at hand.
• Sensitivity analysis, where we explore how our conclusions change
as we change the assumptions, therefore has a central role in the
analysis of missing data.
Simple, ad-hoc methods and
their shortcomings
• In contrast to principled methods, these usually
create a single 'complete' dataset, which is
analysed as if it were the fully observed data.
• Unless certain, fairly strong, assumptions are
true, the answers are invalid.
• We briefly review the following methods:
• Analysis of completers only.
• Imputation of simple mean.
• Imputation of regression mean.
• Creating an extra category.
Completers analysis
• The data on the right
has one missing
observation on
variable 2, unit 10.
• Completers analysis
deletes all units with
incomplete data from
the analysis (here unit
10).
Variable
Unit
1
2
1
3.4
5.67
2
3.9
4.81
3
2.6
4.93
4
1.9
6.21
5
2.2
6.83
6
3.3
5.61
7
1.7
5.45
8
2.4
4.94
9
2.8
5.73
10
3.6
?
What’s wrong with completers
analysis?
• It is inefficient.
• It is problematic in regression when covariate values are
missing and models with several sets of explanatory
variables need to be compared. Either we keep changing
the size of the data set, as we add/remove explanatory
variables with missing observations, or we use the
(potentially very small, and unrepresentative) subset of
the data with no missing values.
• When the missing observations are not a completely
random selection of the data, a completers analysis will
give biased estimates and invalid inferences.
Simple mean imputation
• We replace missing data with the arithmetic average of
the observed data for that variable. In the table of 10
cases this will be 5.58.
Why not?
• This approach is clearly inappropriate for categorical
variables.
• It does not lead to proper estimates of measures of
association or regression coefficients. Rather,
associations tend to be diluted.
• In addition, variances will be wrongly estimated (typically
under estimated) if the imputed values are treated as
real. Thus inferences will be wrong too.
Regression mean imputation
• Here, we use the completers to calculate the regression of the
incomplete variable on the other complete variables. Then, we
substitute the predicted mean for each unit with a missing value. In
this way we use information from the joint distribution of the
variables to make the imputation.
• To perform regression imputation, we first regress variable 2 on
variable 1 (note, it doesn't matter which of these is the 'response' in
the model of interest). In our example, we use simple linear
regression:
• V2 = α + β V1 + e.
• Using units 1-9, we find that α = 6.56 and β = - 0.366, so the
regression relationship is
• Expected value of V2 = 6.56 - 0.366V1.
• For unit 10, this gives
• 6.56 - 0.366 x 3.6 = 5.24.
Regression mean imputation:
Why/Why Not?
• Regression mean imputation can generate
unbiased estimates of means, associations ad
regression coefficients in a much wider range of
settings than simple mean imputation.
• However, one important problem remains. The
variability of the imputations is too small, so the
estimated precision of regression coefficients will
be wrong and inferences will be misleading.
Creating an extra category
Variable
• When a categorical
variable has missing
values it is common
practice to add an
extra 'missing value'
category. In the
example below, the
missing values,
denoted '?' have been
given the category 3.
Unit
1
2
1
3.4
1
2
3.9
1
3
2.6
1
4
1.9
1
5
2.2
?→3
6
3.3
2
7
1.7
2
8
2.4
2
9
2.8
?→3
10
3.6
?→3
Creating an extra category
This is bad practice because:
• the impact of this strategy depends on how
missing values are divided among the real
categories, and how the probability of a value
being missing depends on other variables;
• very dissimilar classes can be lumped into one
group;
• severe bias can arise, in any direction, and
• when used to stratify for adjustment (or correct
for confounding) the completed categorical
variable will not do its job properly.
Some notation
• The data
We denote the data we intended to collect, by Y, and we
partition this into
• Y = {Yo,Ym}.
• where Yo is observed and Ym is missing. Note that
some variables in Y may be outcomes/responses, some
may be explanatory variables/covariates. Depending on
the context these may all refer to one unit, or to an entire
dataset.
• Missing value indicator
Corresponding to every observation Y, there is a missing
value indicator R, defined as:
• R = 1 if Y observed 0 otherwise.
Missing value mechanism
• The key question for analyses with missing data is,
under what circumstances, if any, do the analyses we
would perform if the data set were fully observed lead to
valid answers? As before, 'valid' means that effects and
their SE's are consistently estimated, tests have the
correct size, and so on, so inferences are correct.
• The answer depends on the missing value mechanism.
• This is the probability that a set of values are missing
given the values taken by the observed and missing
observations, which we denote by
Pr(R | Yo, Ym).
Examples of missing value
mechanisms
1. The chance of non-response to questions about
income usually depend on the person's
income.
2. Someone may not be at home for an interview
because they are at work.
3. The chance of a subject leaving a clinical trial
may depend on their response to treatment.
4. A subject may be removed from a trial if their
condition is insufficiently controlled.
Missing Completely at Random
(MCAR)
• Suppose the probability of an observation being missing
does not depend on observed or unobserved
measurements. In mathematical terms, we write this as
• Pr(R | Yo, Ym) = Pr(R)
• Then we say that the observation is Missing Completely
At Random, which is often abbreviated to MCAR. Note
that in a sample survey setting MCAR is sometimes
called uniform non-response.
• If data are MCAR, then consistent results with missing
data can be obtained by performing the analyses we
would have used had there been no missing data,
although there will generally be some loss of information.
In practice this means that, under MCAR, the analysis of
only those units with complete data gives valid
inferences.
Missing At Random (MAR)
• After considering MCAR, a second question
naturally arises. That is, what are the most
general conditions under which a valid analysis
can be done using only the observed data, and
no information about the missing value
mechanism, Pr(R | Yo, Ym)? The answer to this
is when, given the observed data, the
missingness mechanism does not depend on
the unobserved data. Mathematically,
• Pr(R | Yo, Ym) = Pr(R | Yo).
• This is termed Missing At Random, abbreviated
MAR.
Missing Not At Random (MNAR)
• When neither MCAR nor MAR hold, we say the data are
Missing Not At Random, abbreviated MNAR. In the
likelihood setting (see end of previous section) the
missingness mechanism is termed non-ignorable.
What this means is
• Even accounting for all the available observed
information, the reason for observations being missing
still depends on the unseen observations themselves.
• To obtain valid inference, a joint model of both Y and R is
required (that is a joint model of the data and the
missingness mechanism).
MNAR (continued)
Unfortunately
• We cannot tell from the data at hand whether the
missing observations are MCAR, MNAR or MAR
(although we can distinguish between MCAR and MAR).
• In the MNAR setting it is very rare to know the
appropriate model for the missingness mechanism.
• Hence the central role of sensitivity analysis; we must
explore how our inferences vary under assumptions of
MAR, MNAR, and under various models. Unfortunately,
this is often easier said than done, especially under the
time and budgetary constraints of many applied projects.
Principled methods
• These all have the following in common:
• No attempt is made to replace a missing value directly.
i.e. we do not pretend to 'know' the missing values.
• Rather: available information (from the observed data
and other contextual considerations) is combined with
assumptions not dependent on the observed data.
• This is used to
either generate statistical information about each missing
value,
e.g. distributional information: given what we have
observed, the missing observation has a normal
distribution with mean a and variance b , where the
parameters can be estimated from the data.
and/or generate information about the missing value
mechanism.
Principled methods
• The great range of ways in which these
can be done leads to the plethora of
approaches to missing values. Here are
some broad classes of approach:
• Wholly model based methods.
• Simple stochastic imputation.
• Multiple stochastic imputation.
• Weighted methods. (not covered here)
Wholly model based methods
A full statistical model is written down for the complete data.
Analysis (whether frequentist or Bayesian) is based on the likelihood.
Assumptions must be made about the missing data mechanism:
If it is assumed MCAR or MAR, no explicit model is needed for it.
Otherwise this model must be included in the overall formulation.
Such likelihood analyses requires some form of integration (averaging)
over the missing data. Depending on the setting this can be done
implicitly or explicitly, directly or indirectly, analytically or numerically.
The statistical information on the missing data is contained in the
model. Examples of this would be the use of linear mixed models
under MAR in SAS PROC MIXED or MLwiN.
We will examine this in the practical.
Simple stochastic imputation
• Instead of replacing a value with a mean, a random draw
is made from some suitable distribution.
• Provided the distribution is chosen appropriately,
consistent estimators can be obtained from methods that
would work with the whole data set.
• Very important in the large survey setting where draws
are made from units with complete data that are 'similar'
to the one with missing values (donors).
• There are many variations on this hot-deck approach.
• Implicitly they use non-parametric estimates of the
distribution of the missing data: typically need very large
samples.
Simple stochastic imputation
• Although the resulting estimators can behave well, for
precision (and inference) account must be taken of the
source of the imputations (i.e. there is no 'extra' data).
This implies that the usual complete data estimators of
precision can't be used. Thus, for each particular class of
estimator (e.g. mean, ratio, percentile) each type of
imputation has an associated variance estimator that
may be design based (i.e. using the sampling structure
of the survey) or model based, or model assisted (i.e.
using some additional modelling assumptions). These
variance estimators can be very complicated and are not
convenient for generalization.
Multiple (stochastic) imputation
• This is very similar to the single stochastic imputation
method, except there are many ways in which draws can
be made (e.g. hot-deck non-parametric, model based).
The crucial difference is that, instead of completing the
data once, the imputation process is repeated a small
number of times (typically 5-10). Provided the draws are
done properly, variance estimation (and hence
constructing valid inferences) is much more
straightforward.
• The observed variability among the estimates from each
imputed data set is used in modifying the complete data
estimates of precision. In this way, valid inferences are
obtained under missing at random.
Why do multiple imputation?
• One of the main problems with the single
stochastic imputation methods is the need for
developing appropriate variance formulae for
each different setting. Multiple imputation
attempts to provide a procedure that can get the
appropriate measures of precision relatively
simply in (almost) any setting.
• It was developed by Rubin is a survey setting
(where it feels very natural) but has more
recently been used more widely.
Missing Data and Random effects
models
In the practical we will consider two
approaches:
Model based MCMC estimation of a
multivariate response model.
Generating multiple imputations from this
model (using MCMC) that can then be
used to fit further models using any
estimation method.
Information on practical
• Practical introduces MVN models in
MLwiN using MCMC.
• Two education datasets.
- Firstly two responses that are components
within GCSE science exams in which we
consider model based approaches.
- Secondly a six responses dataset from
Hungary in which we consider multiple
imputation.
Other approaches to missing data
• IGLS estimation of MVN models is available in
MLwiN. Here the algorithm treats the MVN
model as a special case of a univariate Normal
model and so there are no overheads for
missing data (assuming MAR).
• WinBUGS has great flexibility with missing data.
The MLwiN->WinBUGS interface will allow you
to do the same model based approach as in the
practical.
• It can however also be used to incorporate
imputation models as part of the model.
Plug for www.missingdata.org.uk
James Carpenter has developed MLwiN macros
that perform multiple imputation using MCMC.
These build around the MCMC features in the
practical but run an imputation model
independent of the actual model of interest.
See www.missingdata.org.uk for further details
including variants of these slides and WinBUGS
practicals.