Applied Logistic Regression:

Download Report

Transcript Applied Logistic Regression:

Logistic Regression
(An Introduction)
Objective:
Modeling a binary response (success, failure) or probability through
a set of predictors X1…Xp
Let’s consider the following example. In 1846, a group of people
(the Donner Party) attempted to cross the Sierra desert and
mountains. (Journal of Anthropological Research 46 (1990), 223-42
and the Statistical Sleuth (1997) by Pamsey & Sheaffer) .
The available data for the journey of these adults are: survival
(yes/no), age (continuous, in years) and gender (categorical):
Y (survival)
No
Yes
Yes
.
.
Yes
age
23
40
40
.
.
25
Gender
Male
Female
Male
.
.
Female
In formulating the logistic regression model, we need to focus on the
fact that response is binary or a fraction of successes. Assuming each
Y is Bernoulli/binary, then the assumptions are going to be very
different from a typical linear regression models with a continuous
response and errors. Here we have
E (Y | X 1... X p )  
V (Y | X 1... X p )   (1   )
As in multiple linear regression, it advantageous if we can express our
predictors in a linear manner
log[ i /(1   i )]  o  1 X1i  2 X 2i  .... p X pi
The logit function log( /(1   ))is the log-odds or the log of the odds of a
success as opposed to a failure. The linear right-hand side is often
written as  ( x).
The inverse function

exp( ( x))
1  exp( ( x))
is the direct connection between the predictors in  ( x) and the
population proportion .
Since  is E(y|X1…Xp), we are ‘linking’ the mean of Y to the
predictors through the logit function and in this way, it’s a
‘generalized’ linear model.
Why use the logit and logistic functions ?
The logit or log-odds will range from negative infinity to positive
infinity and the probability  itself will range from 0 to 1.
The logit scale is also intuitively reasonable since extremely small
probabilities will result in very negative odds ratios (logits) and very high
probabilities will correspond to very high positive odd ratios (logits).
Note that for the S-shaped logistic function  ( x)  exp[( x)]/[1  exp(( x))] ,
the linear part of the model  ( x) may involve one predictor or several.
Also in support of this S-shaped logistic function is the fact that a
monotonic nonlinear relationship of this form often exists between a
predictor x (eg. age) and the probability of the event  ( x) . In this
case of a single predictor, differing interpretations of the slope
(rate of change) will be based on its sign and magnitude. Linear
approximations for certain values of x are also be possible.
The odds ratio and the meaning of coefficients in
logistic regression
The odds P(‘survive’/’not survive’) or P(Y=1)/P(Y=0) is
 /(1   ) and we’re modeling log( /(1   )) as a function of X1..Xp
Suppose we only originally have one categorical predictor X1
(gender, for example) and hence we’re considering our model to be:
log( /(1   ))  o  1 X1
Assuming we code X1=0 for male and X1 = 1 for female then the logodds that the response (survival) is 1 as compared to 0 (not

survived) for males is  o . The ODDS of survival for males is e o
(written as exp( o ))
Now consider adding a continuous predictor (age, X2,) so that our
model is the additive model log( /(1   )) = o  1 X1  2 X 2 and now
let’s the interpret some possible odds ratios:
The ratio odds of survival for a 45 year old woman as compared to a
30 year old woman:
exp[ o  1 (1)  2 (45) ] /exp[ o  1  2 (30)] = exp[ (45  30)2 ] =exp[15  2 ]
In general, for fixed values of the other predictors (in this case
gender is held constant while age has changed), the odds ratio in
going from A to B of the predictor Xk will be exp[ (A-B)  k ]
Let’s return to the earlier example :
20/45 survived the journey. 10 of 15 females survived while only 10 of
30 males survived. The logistic regression model fit with both predictors
as done by SAS’s PROC LOGISTIC (or GENMOD) is as follows:
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Intercept
1
1.6331
Standard
Error
Wald
Chi-Square
Pr > ChiSq
1.1102
2.1637
0.1413
gender
1
1
1.5973
0.7555
4.4699
0.0345
gender
0
0
0
.
.
.
1
-0.0782
0.0373
4.3988
0.0360
age
The estimated odds of survival for females over male (of
equal/constant age) is exp(ˆage ) = exp(1.5973) = 4.94. This certainly
seems reasonable given the larger fraction of females surviving.
A point estimate for the odds of a 45 year old woman surviving as
opposed to a 30 year old woman: exp[15 ˆage]=exp[(15)(-.0782)] =
.309 (note that gender is held constant)
Looking at the actual probability of survival for a particular gender and
age, what was the estimated probability of surviving the journey for a
25 year old male ?
logit(  i ) = 1.633 + 1.5973(0) -.0782(25) = -0.322
Using our the logistic function (or our inverse link) the estimated
probability for such a man to have survived the journey was
exp(-0.322)/[1 + exp(-0.322)] = 0.420.
Returning to the Donner Survival example, the following estimates were
obtained:
Analysis of Maximum Likelihood Estimates
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Parameter
DF
Estimate
Intercept
1
1.6331
1.1102
2.1637
0.1413
gender
1
1
1.5973
0.7555
4.4699
0.0345
gender
0
0
0
.
.
.
1
-0.0782
0.0373
4.3988
0.0360
age
Maximum likelihood estimation (using a binomial model for the response) is
used by SAS and for the given data, the most likely estimates for
o ,  gender , age are 1.6331,1.5973 and -0.0782 respectively.
What about inference and tests statistics such as those above ?
Testing & Inference
In the long run, we believe that these maximum likelihood estimates will
be approximately normally distributed (for reasonably large samples).
This approximate sampling distribution (and st.error for  j ) leads to tests
of Ho:  j =0 that are analogous to linear multiple regression. An
approximate Z statistic of the form ˆ j / se( ˆ j ) or [ ˆ j / se( ˆ j )]2 is a Wald Z
statistic or as a Wald chi-square on 1 df (above in SAS output) are
possible.
In this example, both predictors (gender,age) are significant using this
approximate test. Assuming this approximate normal distribution for the
estimate, a 95% confidence interval for age(as the log odds for a unit
change in age) will be .078  (1.96)(.0373) or (-.1511,-.00489).
Exponentiating the point estimate -.078 = 0.9249 (odds increase, 1year)
Exponentiating the interval endpoints gives the 95% Wald confidence
limits as (0.860,0.995) for the survival odds (given by SAS’s PROC
LOGISTIC for the fitted model).
Odds Ratio Estimates
Effect
Point
95% Wald
Estimate Confidence Limits
Gender 0 vs
1
4.940
1.124
21.716
Age
0.925
0.860
0.995
Note that this interval (0.860,0.995) doesn’t contain 1, and this
corresponds to the earlier evidence (p=.036) against Ho: age  0. The
endpoints of this approximate confidence interval (C.I.) both being
less than 1 suggests less likely survival with increased age.
What is an approximate C.I. for the odds of survival on this journey
for a difference in 25 years (gender held constant) ?
The approximate C.I. for age (as the log-odds for a unit change)
was (-.1511,-.00489)
The endpoints of the desired interval for log-odds are:
[(25)*( -.1511),25*(-.0048)] =[-3.76,-0.137]. Back-transforming, this
interval gives a 95% confidence interval for the odds ratio of
(.023,.872).
SAS code (additive model):
PROC GENMOD is similar but differences in syntax and capabilities
(eg. model selection) exist
title 'Donner Party Survival Example';
title2 'Proc Logistic Results';
proc logistic ;
class gender / param=GLM descending;
model survival (event ='1')= gender age ;
run;