Panel on Collaboration Discussion Questions
Download
Report
Transcript Panel on Collaboration Discussion Questions
Introduction to logistic regression
and Generalized Linear Models
Karen Bandeen-Roche, PhD
Department of Biostatistics
Johns Hopkins University
July 14, 2011
Introduction to Statistical
Measurement and Modeling
Data motivation
Osteoporosis data
Scientific question: Can we detect osteoporosis earlier
and more safely?
Some related statistical questions:
How does the risk of osteoporosis vary as a function of
measures commonly used to screen for osteoporosis?
Does age confound the relationship of screening
measures with osteoporosis risk?
Do ultrasound and DPA measurements discriminate
osteoporosis risk independently of each other?
Outline
Why we need to generalize linear models
Generalized Linear Model specification
Systematic, random model components
Maximum likelihood estimation
Logistic regression as a special case of GLM
Systematic model / interpretation
Inference
Example
Regression for categorical outcomes
Why not just apply linear regression to categorical Y’s?
Linear model (A1) will often be unreasonable.
Assumption of equal variances (A3) will nearly always be
unreasonable.
Assumption of normality will never be reasonable
Introduction:
Regression for binary outcomes
Yi = 1{event occurs for sampling unit i}
= 1 if the event occurs
= 0 otherwise.
pi = probability that the event occurs for sampling unit i
:= Pr{Yi = 1}
Begin by generalizing random model (A5):
Probability mass function: Bernoulli
Pr{Yi = 1} = pi; Pr{Yi = 0} = 1-pi
all other yi occur with 0 probability
fYi(y) : Pr{Yiy} piy(1
pi)1y.
Binary regression
By assuming Bernoulli: (A3) is definitely not reasonable
Var(Yi ) = pi(1-pi)
Variance is not constant: rather a function of the mean
Systematic model
Goal remains to describe E[Yi|xi]
Expectation of Bernoulli Yi = pi
To achieve a reasonable linear model (A1): describe some
function of E[Yi|xi] as a linear function of covariates
g(E[Yi|xi]) = xi’β
Some common g: log, log{p/(1-p)}, probit
General framework:
Generalized Linear Models
Random model
Y~a density or mass function, fY, not necessarily normal
Technical aside: fY within the “exponential family”
Systematic model
g(E[Yi|xi]) = xi’β = ηi
“g” = “link function”; “xi’β” = “linear predictor”
Reference: Nelder JA, Wedderburn RWM, Generalized
linear models, JRSSA 1972; 135:370-384.
Types of Generalized Linear Models
Model
(link
function)
Linear
Logistic
Log-linear
Proportional
hazards
Response
Distribution
Regression
Coef Interp
Continuous
Gaussian
Change in
ave(Y) per unit
change in X
Binary
Binomial
Log odds ratio
Times to
events/counts
Poisson
Log relative
rate
Times to
events
Semiparametric
Log hazard
Estimation
Estimation:
maximizes L(β,a;y,X) =
n
f (y ;μ(x ,β),a)
i 1
Yi
i
i
General method: Maximum likelihood (Fisher)
Given {Y1,...,Yn} distributed with joint density or mass
function fY(y;θ), a likelihood function L(θ;y) is any
function (of θ) that is proportional to fY(y;θ).
If sampling is random, {Y1,...,Yn} are statistically
independent, and L(θ;y) α product of individual f.
Maximum likelihood
The maximum likelihood estimate (MLE), ,
maximizes L(θ;y):
L(θ)
Θ
Θ
Under broad assumptions MLEs are asymptotically
Unbiased (consistent)
Efficient (most precise / lowest variance)
Logistic regression
Yi binary with pi = Pr{Yi = 1}
Example: Yi = 1{person i diagnosed with heart disease}
Simple logistic regression (1 covariate)
Random Model: Bernoulli / Binomial
Systematic Model: log{pi/(1- pi)}= β0 + β1xi
log odds; logit(pi)
Parameter interpretation
β0 = log(heart disease odds) in subpopulation with x=0
β1 = log{px+1/(1-px+1)}- log{px/(1-px)}
Logistic regression
Interpretation notes
β1 = log{px+1/(1-px+1)}- log{px/(1-px)}
=
log
exp(β1)log
=
px1/(1px1)
px/(1px)
px1/(1px1)
px/(1px)
= odds ratio for association of prevalent heart disease
with each (say) one year increment in age
= factor by which odds of heart disease increases /
decreases with each 1-year cohort of age
Multiple logistic regression
Systematic Model: log{pi/(1- pi)}= β0 + β1xi1 + … + βpxip
Parameter interpretation
β0 = log(heart disease odds) in subpopulation with all x=0
βj = difference in log outcome odds comparing
subpopulations who differ by 1 on xj, and whose values on
all other covariates are the same
“Adjusting for,” “Controlling for” the other covariates
One can define variables contrasting outcome odds
differences between groups, nonlinear relationships,
interactions, etc., just as in linear regression
Logistic regression - prediction
Translation from ηi to pi
log{pi/(1- pi)}= β0 + β1xi1 + … + βpxip
Then pi
e
β0β1xi1...βpxip
1e
β0β1xi1...βpxip
e
ηi
1e
ηi
= logistic
function of ηi
Graph of pi versus ηi has a sigmoid shape
GLMs - Inference
The negative inverse Hessian matrix of the log likelihood
function characterizes Var( ) (adjunct)
SE( ) obtained as square root of the jth diagonal entry
Typically, substituting
for β
“Wald” inference applies the paradigm from Lecture 2
0 j is asympotically ~ N(0,1) under H0: βj= β0j
Z= j
SE ( j )
Z provides a test statistic for H0: βj= β0j versus HA: βj≠ β0j
± z(1-α/2) SE{ } =(L,U) is a (1-α)x100% CI for βj
{exp(L),exp(U)} is a (1-α)x100% CI for exp(βj)
GLMs: “Global” Inference
Analog: F-testing in linear regression
The only difference: log likelihoods replace SS
Hypothesis to be tested is H0: βj1=...=βjk = 0
Fit model excluding xj1,...,xjpj: Save -2 log likelihood = Ls
Fit “full” (or larger) model adding xj1,...,xjpj to smaller
model. Save -2 log likelihood = LL
Test statistic S = Ls - LL
Distribution under null hypothesis: χ2pj
Define rejection region based on this distribution
Compute S
Reject or not as S is in rejection region or not
GLMs: “Global” Inference
Many programs refer to “deviance” rather than -2 log
likelihood
This quantity equals the difference in -2 log likelihoods
between ones fitted model and a “saturated model”
Deviance measures “fit”
Differences in deviances can be substituted for differences
in -2 log likelihood in the method given on the previous
page
Likelihood ratio tests have appealing optimality
properties
Outline: A few more topics
Model checking: Residuals, influence points
ML can be written as an iteratively reweighted least
squares algorithm
Predictive accuracy
Framework generalizes easily
Main Points
Generalized linear modeling provides a flexible
regression framework for a variety of response types
Continuous, categorical measurement scales
Probability distributions tailored to the outcome
Systematic model to accommodate
Measurement range, interpretation
Logistic regression
Binary responses (yes, no)
Bernoulli / binomial distribution
Regression coefficients as log odds ratios for association
between predictors and outcomes
Main Points
Generalized linear modeling accommodates
description, inference, adjustment with the same
flexibility as linear modeling
Inference
“Wald”- statistical tests and confidence intervals via
parameter estimator standardization
“Likelihood ratio” / “global” – via comparison of log
likelihoods from nested models