Transcript Slide 1

Lecture 9:
Marginal Logistic Regression
Model and GEE
(Chapter 8)
Marginal Logistic Regression Model and GEE
Marginal models are suitable to estimate population
average parameters
• For example, in the Indonesian study, a marginal
model can be used to address questions such as:
– What is the prevalence of respiratory infection in
children as a function of age?
– Is the prevalence of respiratory infection greater
in the sub-population of children with vitamin A
deficiency?
– How does the association of vitamin A deficiency
and respiratory infection change with age?
• The scientific objective is to characterize and
contrast populations of children.
Marginal Models for Binary Responses:
Logistic Regression
Model for the Mean
Model for the Association
Marginal odds ratio,
Marginal Odds Ratio
a greater value indicates positive association
Two possible specifications:
Degree of association is the same for all pairs of
observations from the same subject
Degree of association is inversely proportional to the
time between observations from the same subject
Parameter Interpretation in Logistic
Regression: ICHS Study
Marginal Logistic Regression
Parameter Interpretation in Logistic
Regression: ICHS Study
Logistic Regression with Random Effects
Parameter Interpretation in Logistic
Regression: ICHS Study
Transition Logistic Regression Model
Parameter Interpretation in Logistic
Regression: ICHS Study
Transition Logistic Regression Model (cont’d)
Maximum Likelihood Estimation of β in GLM:
Cross-sectional Data
• If Yi is binary or a count, we specify the likelihood
function and estimate the parameters of interest
using Maximum Likelihood Estimation
Maximum Likelihood Estimation of β in GLM:
Cross-sectional Data
• For example, if Y is binary, i.e.:
…we estimate β0 and β1 by maximizing
Maximum Likelihood Estimation of β in GLM:
Cross-sectional Data
• For example, if Y is a count, i.e.:
…we estimate β0 and β1 by maximizing
Maximum Likelihood Estimation of β in GLM:
Cross-sectional Data
In general, we have:
Solving the
score equation
is equivalent
to maximizing
the likelihood
function.
is called the score equation. Solutions to
the score equation are not available in closed
form, and so require an iterative procedure
called iterative weighted least squares
(IWLS) algorithm…
Maximum Likelihood Estimation of β in GLM:
Cross-sectional Data
Main ideas of IWLS
1. i = EYi, vi = var(Yi) = v(i)
2. Choose
to make
3. Weight yi by vi-1
close to yi on average
GEE Estimation of β in GLM:
Longitudinal Data
In the case of a linear regression model with the
assumption of normality, the extension from ordinary
linear regression to longitudinal problems was
facilitated by thinking about a multivariate normal
distribution.
By specifying a model for the mean E[Yi] and the
model for the covariance matrix Vi, we can fully
specify the multivariate normal distribution:
and use MLE.
GEE Estimation of β in GLM:
Longitudinal Data
Unfortunately, if the elements of Yi are counts or
binary response, we cannot naturally extend the
Bernoulli or Poisson distributions to take into account
of correlation. Multivariate extensions of these
distributions are quite complex (except for biostat
students!).
The main impediments with binary and count data are:
1. There are not multivariate generalizations of the
necessary probability distributions
2. Population-average and subject-specific approaches
do not lead to the same model for the mean
response
GEE
(applies only to marginal models)
• Under a GEE approach, we forget about trying to
specify a model for the whole multivariate
distribution of a data vector. Instead, the idea is
to just model the mean response E[Yi] and the
covariance matrix Vi of a data vector as in the
normal case.
• GEE is based on the concept of “estimating
equations” and provides a very general approach
for analyzing correlated responses that can be
discrete or continuous.
GEE
• The idea behind GEE is to generalize and extend
the usual likelihood equations for a GLM with a
univariate response by incorporating the
covariance matrix of the vector of responses Y
• For the case of linear models, the Generalized
Least Square (GLS) estimator for the vector of
regression coefficients is a special case of the
GEE approach
GEE
• In the absence of a convenient likelihood to work
with, it is sensible to estimate β by solving the
following multivariate equation:
where
Note: with continuous data, the estimate
from this score equation reduces to the
MLE
GEE (cont’d)
• The method of generalized estimating equations
provides consistent estimates for the mean
parameter when a model for the correlation may
not be reliably specified.
•
is a multivariate generalization of the
score equation
used to maximize the
likelihood function under a GLM
GLM for Longitudinal Data (GEE)
In summary:
1. For GEE models, we specify a GLM for the mean
response
2.
independence, completely
unstructured
3. The estimates of β and their standard errors will
be consistent (i.e. unbiased for large sample
size).
4. If the specification of Vi is correct, then the GEE
solution is the maximum likelihood estimate.
GEE
One important property of the GLM family is that the
score function
depends only on the mean
and variance of Yi. Therefore the estimating
equation:
can be used to estimate the regression coefficients
for any choices of link and variance functions,
whether or not they correspond to a particular
member of the exponential family.
is the generalized estimating equation.
GEE Properties
•
is nearly efficient relative to the maximum
likelihood estimate of
, provided that var(Yi)
has been reasonably approximated.
• GEE is the maximum likelihood score equation
for multivariate Gaussian data, and for binary
data, when var(Yi) is correctly specified.
•
is consistent for
is incorrectly specified.
, even when var(Yi)
What we need to specify for
implementing GEE
E[Yij ]  ij
g(ij )  ij  X  Model for the mean
'
ij
Var(Yij )  v(ij ) Known variance function
Corr(Yi )
Working correlation matrix:
model for the pairwise
correlations among the
responses
Working covariance matrix
Vi  A corr(Yi )A
1/ 2
i
1/ 2
i
Ai  diag(v(ij ))
V is called the working covariance matrix
to distinguish it from the true underlying
covariance of Y
GEE
N
 y
i 1
i
  i (  ) Vi
N
 D V y
i 1
1
i
i
i
1
yi   i(  )
  i (  )  0
GEE equations
[ Di ] jk   ij /  k
ˆ
minimize
Solution of the GEE equation
Properties of GEE Estimates
•The GEE estimator is consistent whether or
not the within-subject
associations/correlations have been correctly
modelled.
•That is, for the GEE estimator to provide a
valid estimate of the true β, we only require
that the model for the mean response has
been correctly specified.
Asymptotic distribution of the GEE
estimator
•In large samples, the GEE estimator is multivariate
normal
E[ ˆ ]  
1
1
ˆ
cov( )  B MB
N
B   Di Vi Di
1
i 1
N
True covariance matrix
M   Di Vi cov(Yi )Vi Di
i 1
1
1
Sandwich estimate of
cov(ˆ )
1 ˆ ˆ 1
ˆ
ˆ
covˆ (  )  B MB
N
Bˆ   Dˆ i Vˆi Dˆ i
1
i 1
N

bread
meat

1
1 ˆ
ˆ
ˆ
ˆ
ˆ
M   Di Vi Yi  ˆ i Yi  ˆ i  Vi Di
i 1
Consistent estimate of the true
covariance matrix of Y
Link to stata command xtgee for
continuous data
ui ( )  X i  , Di  X i
substitute into GEE equations, to get…
• xtgee, identity link, corr(exch)
N
N
1
1
ˆ


  ( X iVi X i ) ( X iVi yi )
1
i 1
i 1
N
cov(ˆ )  ( X iVi X i ) 1
1
i 1
• Use Weighted Least Square for Cov(ˆ )
Link to stata command xtgee for
continuous data (cont’d)
• xtgee, identity link, corr(exch), robust
ˆ  (
N
N
1
1


 X iVi X i ) ( X iVi yi )
1
i 1
cov(ˆ )  (
i 1
N
N
N
1
1
1



X
V
X
)
(
X
V
Cov
(
y
)
V
X
)(
X
V
X
)
 ii i  ii
i
i
i 
i i
i
i 1
1
1
1
i 1
• Use Sandwich Estimator for Cov(ˆ )
i 1
Link to stata commands xtgee for
binary data
e Xi
i (  ) 
Xi
1 e
K
K
[ D i ] jk 
ij
 k
 X ijk  k
e k 1

 k
2

 X ijk  k
k 1
X ijk e
K
 X ijk  k
(1  e k 1
)2
• Substitute into GEE equation. But no closed-form
solution, so need iterative procedure.
• Difference between using robust or not, is analogous to
continuous data.
• xtgee,logit link, corr(exch)
• xtgee, logit link, corr(exch), robust
Using GEE
Bottom Line
If the scientific focus is on the regression coefficients β:
1. Focus on modeling the mean structure
2. Use a reasonable approximation of the covariance
structure
3. Check the inferences for β by comparing β’s
robust standard errors with respect to different
covariance assumptions
4. If the β’s standard errors differ substantially, a
more careful treatment of the covariance model
might be necessary.
Maximum Likelihood Estimation for
Binary Data
Example: 2x2 crossover trial
Data from the 2x2 crossover trial on cerebrovascular deficiency adapted
from Jones and Kenward, where treatment A and B are active drug and
placebo, respectively: the outcome indicates whether an
electrocardiogram was judged abnormal (0) or normal (1).
Example: 2x2 crossover trial (cont’d)
Goal: To compare the effect of an active drug (A)
and a placebo (B) on cerebrovascular deficiency
•
•
•
34 patients received A followed by B
33 patients received B followed by A
Yij = 1 if normal electrocardiogram reading
At period 1:
Example: 2x2 crossover trial (cont’d)
Calculate MLE of odds ratios separately for period 1 and period 2.
Odds ratio of being normal for the active drug versus the placebo is:
This estimate is larger than 1, and therefore indicates that the active
drug produces a higher proportion of normal readings.
However, the estimate is not statistically significant.
Should we compare (???) the data for Periods 1 and 2?
Example: 2x2 crossover trial (cont’d)
This approach has several limitations:
1. Ignore the carry-over effect, i.e. the effect of the treatment at period
1 might influence the response at period 2 (treatment x period
interaction)
2. Two responses for the same subject are likely to be correlated
3. In fact, the odds ratio
is estimated to be
So, let’s use GEE to estimate a population average odds ratio, taking
into account within-subject correlation
Example: 2x2 crossover trial (cont’d)
GEE Approach
• We combine data from both periods.
• We can analyze a 2x2 crossover trial as a
longitudinal study with ni = n = 2 and m = 67.
Example: 2x2 crossover trial (cont’d)
GEE Approach (cont’d)
• Fit a logistic regression model:
exp(0.57)-1 = 0.77
 Population average
odds of a normal reading
are estimated to be 77%
higher when using the
drugs as compared to
the placebo
exp(3.56) = 35
 Subjects with normal
responses at the first
visit have odds of normal
reading at the next visit
that are almost 35 times
higher than those whose
first response was
abnormal
In summary
1. Model 1 includes the treatment x period
interaction (little support from the data), and
estimates marginal odds ratio by GEE
2. Model 2 drops the period x treatment interaction
and estimates the marginal odds ratio by GEE
3. Model 3 assumes that the marginal odds ratio is
1, here
= 0.56, with standard error 0.38
(much larger than under Models 1 and 2)
Note: If we fit Model 3, but using robust standard
errors, then we obtain similar results to the GEE
approach.
•The prevalance of respiratory infections in six consecutive
quarters reveals a positive seasonal trend with a summer
maximum
•The prevalence of xerophthalmia also indicates some
seasonality with a winter maximum
Example: Respiratory Infections
•275 children in Indonesia were examined for up
to six consecutive quarters for the presence of
respiratory infections (i=1,…,m=275; j=1,…,6
visits).
•Goals of the analysis
•Determine whether prevalence of respiratory
infection is higher among children who suffer
from xerophthlamia (an ocular manifestation
of chronic vitamin A deficiency)?
•Estimate the change of respiratory infection
with age.
•Consider seasonality as a potential confounder.
Cross-Sectional Analysis
Model 1: First visit only
•Look only at the data from the first visit
•Fit a logistic regression model of respiratory
infection on xerophthalmia and age, adjusting for
other covariates
•We find a strong non-linear cross-sectional age
effect on the prevalence of respiratory infection
•Cross-sectional analysis suggests that the
prevalence of respiratory infection increases from
age 12 months and reaches its peak at age 20
months before starting to decline
Cross-Sectional Analysis
Model 2: All visits + controlling for seasonality
•Look at data from all visits
•Fit a logistic regression model of respiratory
infection on xerophthalmia and age, adjust for
other covariates
•We still find a strong non-linear cross-sectional
age effect on prevalence of respiratory infection
•The age coefficient in Model 2 can be
interpreted as weighted averages of the crosssectional age coefficients for each visit.
Longitudinal Analysis
Here we want to distinguish the contributions of
cross-sectional and longitudinal information to
the estimated relationship of respiratory infection
and age.
Longitudinal Analysis
Model 3: Separate CS from LDA
•Separate differences among sub-populations of
children at different ages and a fixed time (CS)
from changes in children over time (LD)
Longitudinal Analysis
Model 3: Separate CS from LDA
Longitudinal Analysis
Model 4: Separates CS from LDA + controlling for
seasonality
Summary of Results
•Pattern of convex relationship between age and the risk of respiratory
infection appears to coincide with the pattern of seasonality.
•If we include harmonic terms, then the longitudinal parameters
(corresponding to the follow-up in the table) are not statistically
significant.
•The longitudinal information (i.e. variation over time of respiratory
infections versus variations over time of age) is highly confounded by
seasonality.
•Therefore, in the presence of a strong seasonal signal, we can learn little
about the effects of aging from data collected over an 18-month period if
we restrict our attention to longitudinal information.
•However, much can be learned by comparing children at different ages so
long as we can assume that there are not cohort effects confounding the
inferences about age.
•BE CAREFUL – in longitudinal analysis, always look for time-varying
confounders!
Table 8.7. Logistic
regressions of the
prevalence of
respiratory function
on age and
xerophthalmia
adjusting for gender,
season, and height for
age.
Model 1 and 2
estimate crosssectional effects;
Models 3 and 4
distinguish crosssectional from
longitudinal effects.
Models 2-4 are fitted
using the alternating
logistic regression
implementation of
βc
GEE.
βL
CS,
1st
visit only
CS, All
visits
LD
LD
The association between RI and Xero. is positive, although not
statistically significant at the 5% level.
Model 2
Prevalence of respiratory infection increases from age 12 months and
reaches its peak at 20 months before starting to decline.
The age coefficients in
Model 2 can be interpreted
as a weighted average of
the cross-sectional age
coefficients from each visit.
The risk of RI declines in the first 7 to 8 months of follow-up, before
rising later in life.
(ageij – ageik)