What you've always wanted to know about logistic

Download Report

Transcript What you've always wanted to know about logistic

What you've always wanted to know about
logistic regression analysis, but were afraid to
ask...
Februari, 1 2010
Gerrit Rooks
Sociology of Innovation
Innovation Sciences & Industrial Engineering
Phone: 5509
email: [email protected]
1
This Lecture
•
•
•
•
•
Why logistic regression analysis?
The logistic regression model
Estimation
Goodness of fit
An example
2
What's the difference between 'normal'
regression and logistic regression?
Regression
analysis:
– Relate one
or more
independent
(predictor)
variables to
a dependent
(outcome)
variable
3
What's the difference between 'normal'
regression and logistic regression?
• Often you will be confronted with outcome
variables that are dichotomic:
–
–
–
–
–
success vs failure
employed vs unemployed
promoted or not
sick or healthy
pass or fail an exam
4
Example
Relationship between hours studied for exam and success
Hours
# Failed
exam
# Passed
exam?
Total # Prob. pass
students
exam
28
4
2
6
.33
29
3
2
5
.40
30
2
7
9
.78
31
2
7
9
.78
32
4
16
20
.80
33
1
14
15
.93
5
Linear regression analysis
Why is this wrong?
6
Logistic Regression
The better alternative
7
8
The logistic regression equation
predicting probabilities
P(Y ) 
1
1 e
predicted
probability
(always between
0 and 1)
 ( b0  b1 X 1  1 )
similar to
regression
analysis
9
The Logistic function
Sometimes authors rearrange the model
P(Y ) 
( b0  b1 X 1 1 )
1
1 e
( b0  b1 X 1 1 )
e

( b0  b1 X 1 1 )
1 e
or also
p( y  1)
ln
 c0  c1 x1  c2 x2  ...  cn xn
1  p( y  1)
10
How do we estimate coefficients?
Maximum-likelihood estimation
• Parameters are estimated
by `fitting' models, based
on the available
predictors, to the observed
data
• The chosen model fits the
data best, i.e. is closest to
the data
• Fit is determined by the
so-called log likelihood
statistic
11
Maximum likelihood estimation
The log-likelihood statistic
N
LL  {Yi ln(P(Yi ))  (1  Yi ) ln[1  P(Yi )]}
i 1
Large values of LL indicate poor fit of the model
HOWEVER, THIS STATISTIC CANNOT BE USED TO
EVALUATE THE FIT OF A SINGLE MODEL
12
An example to
illustrate
maximum
likelihood and
the log
likelihood
statistic
Suppose we know hours spent
studying and the outcome of an exam
Quantity of
Study Hours
Outcome
3
0
34
1
17
0
6
0
12
0
15
1
26
1
29
1
13
In ML different values
for the parameters are
`tried'
Lets look at two
possibilities: 1; b0 = 0 & b1=
0.05; 2, b0 = 0 & b1= 0.05
Quantity of
Study Hours
P(Y ) 
1
1 e
 ( 0.05 X 1 )
P(Y ) 
1
1 e
 ( 6.44  0.39 X 1 )
Outcome
Predicted probability
(b0=0; b1 = 0.05)
Predicted probability
(b0=-6.44; b1 = 0.39)
3
0
.53
.01
34
1
.85
.99
17
0
.71
.53
6
0
.57
.02
12
0
.65
.14
15
1
.68
.34
26
1
.79
.97
29
1
.81
.99
14
We are now able to calculate the log likelihood statistic
N
LL  {Yi ln(P(Yi ))  (1  Yi ) ln[1  P(Yi )]}
i 1
Quantity of
Study Hours
Outcome
Predicted probability
(b0=0; b1 = 0.05)
LL
(b0=0; b1 = 0.05)
3
0
.53
-.75
34
1
.85
-.16
17
0
.71
-1.24
6
0
.57
-.84
12
0
.65
-1.05
15
1
.68
-.39
26
1
.79
-.24
29
1
.81
-.21
15
Two models and their log likelihood statistic
Based on a clever algorithm the model with the best fit (LL
closest to 0) is chosen
Outcome
Pr
(b0=0;
b1 = 0.05)
LL
(b0=0; b1 =
0.05)
Pr
(b0=-6.44;
b1 = 0.39)
LL
(b0=-6.44; b1 =
0.39)
0
.53
-.75
.01
-.01
1
.85
-.16
.99
-.01
0
.71
-1.24
.53
-.75
0
.57
-.84
.02
-.02
0
.65
-1.05
.14
-.15
1
.68
-.39
.34
-1.08
1
.79
-.24
.97
-.03
1
.81
-.21
.99
-.01
∑
-4.88
-2.0716
After estimation
How do I determine significance?
• Two major issues
P(Y ) 
1
1 e
 ( 6 , 44  0 , 39*studyhours )
• Obviously SPSS does all
the work for you
• How to interpret output of
SPSS
1. Overall model fit
– Between model
comparisons
– Pseudo R-square
– Predictive accuracy /
classification test
2. Coefficients
– Wald test
– Likelihood ratio test
– Odds ratios
17
Model fit: Between model
comparison
The log-likelihood ratio test statistic can be used to
test the fit of a model
  2[ LL( New)  LL(baseline)]
2
The test statistic has a
chi-square distribution
Model fit full model
Model fit reduced model
18
Model fit
The log-likelihood ratio test statistic can be used to
test the fit of a model
  2[ LL( New)  LL(baseline)]
2
Model fit full model
P(Y ) 
1
1 e
 ( b0  b1 X 1 )
Model fit reduced model
1
P(Y ) 
1  e ( b0 )
19
Between model comparison
• Estimate a null model
  2[LL(New)  LL(baseline)]
2
•
• Estimate an improved
model
•
Model fit full
model
P(Y ) 
1
1  e (b0 b1 X 1 b2 X 2 )
Model fit
reduced model
This model contains more
variables
• Assess the difference in 2LL between the models
•
P(Y ) 
Baseline model
1
1  e ( b0 b1 X 1 )
•
This difference follows a
chi-square distribution
degrees of freedom = #
estimated parameters in
proposed model – #
estimated parameters in
20
null model
Overall model fit
R and R2
SS due to regression
R2 
2
ˆ
(
y

y
)
 i
 ( y  y)
2
i
Total SS
R2 in multiple regression is a measure
of the variance explained by the model
21
Overall model fit
pseudo R2
log-likelihood of the model
that you want to test
R
2
LOGIT
 2 LL( Model)

 2 LL(Original)
Just like in multiple
regression, logit R2
ranges 0.0 to 1.0
– Cox and Snell
• cannot theoretically
reach 1
– Nagelkerke
log-likelihood of model
before any predictors were
entered
• adjusted so that it can
reach 1
NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression
22
What is a small or large R and R2?
Strength of correlation
Small
0.10 to 0.29
Medium
0.30 to 0.49
Large
0.50 to 1.00
23
Overall model fit
Classification table
How well does the model predict outcomes?
spss output
Classification Tablea
Predicted
Step 1
Observed
Result of Penalty
Kick
Missed Penalty
Scored Penalty
Overall Percentage
Result of Penalty Kick
Missed
Scored
Penalty
Penalty
30
5
7
33
Percentage
Correct
85,7
82,5
84,0
a. The cut value is ,500
This means that we assume that if our model predicts
that a player will score with a probability of .51 (above .5)
the prediction will be a score (lower than .50 is a miss).
24
Testing significance of coefficients
The Wald statistic: not really good
estimate
b
Wald 
SE b
t-distribution
standard error of estimate
• In linear regression
analysis this statistic is
used to test significance
• In logistic regression
something similar
exists
• however, when b is
large, standard error
tends to become
inflated, hence
underestimation
(Type II errors are
more likely)
25
Likelihood ratio test
an alternative way to test significance of a coefficient
To avoid type II errors for some variables you
best use the Likelihood ratio test
  2[ LL(With)  LL(Without)]
2
model with variable
P(Y ) 
1
1  e ( b0 b1 X 1 )
model without variable
P(Y ) 
1
1  e ( b0 )
26
Before we go to the example
A recap
• Logistic regression
– dichotomous outcome
– logistic function
– log-likelihood / maximum likelihood
• Model fit
–
–
–
–
likelihood ratio test (compare LL of models)
Pseudo R-square
Classification table
Wald test
27
Illustration with SPSS
• Penalty kicks data, variables:
– Scored: outcome variable,
• 0 = penalty missed, and 1 = penalty scored
– Pswq: degree to which a player worries
– Previous: percentage of penalties scored by a
particulare player in their career
28
SPSS OUTPUT Logistic Regression
Case Processing Summary
Unweig hted Cases
Selected Cases
a
N
Included in Analysis
Missing Cases
Total
Unselected Cases
Total
75
0
75
0
75
Percent
100,0
,0
100,0
,0
100,0
a. If weight is in effect, see classification table for the total
number of cases.
Dependent Variable Encoding
Original Value
Missed Penalty
Scored Penalty
Internal Value
0
1
Tells you something
about the number of
observations and
missings
29
Block 0: Beginning Block
this table is based on
the empty model, i.e. only
the constant in the model
Classification Tablea,b
P(Y ) 
Predicted
Step 0
Observed
Result of Penalty
Kick
Missed Penalty
Scored Penalty
Result of Penalty Kick
Missed
Scored
Penalty
Penalty
0
35
0
40
Overall Percentage
1
1  e ( b0 )
Percentage
Correct
,0
100,0
53,3
a. Constant is included in the model.
b. The cut value is ,500
Variables in the Equation
B
Step 0
Constant
,134
S.E.
,231
Wald
,333
df
1
Sig .
,564
Variables not in the Equation
Step
0
Variables
Overall Statistics
previous
pswq
Score
34,109
34,193
41,558
df
1
1
2
Sig .
,000
,000
,000
Exp(B)
1,143
these variables
will be entered
in the model
later on
30
Block is useful to check significance of
individual coefficients, see Field
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Step 1
Step
Block
Model
Chi-square
54,977
54,977
54,977
df
2
2
2
Sig .
,000
,000
,000
this is the test
statistic
 2  2[ LL( New)  LL(baseline)]
Note: Nagelkerke
is larger than Cox
after dividing by -2
Model Summary
New
model
Step
1
-2 Log
likelihood
48,662a
Cox & Snell
R Sq uare
,520
Nag elkerke
R Sq uare
,694
a. Estimation terminated at iteration number 6 because
parameter estimates changed by less than ,001.
31
Block 1: Method = Enter (Continued)
Classification Tablea
Predicted
Step 1
Observed
Result of Penalty
Kick
Result of Penalty Kick
Missed
Scored
Penalty
Penalty
30
5
7
33
Missed Penalty
Scored Penalty
Overall Percentage
Percentage
Correct
85,7
82,5
84,0
a. The cut value is ,500
Predictive accuracy has
improved (was 53%)
Variables in the Equation
B
Step
a
1
previous
pswq
Constant
,065
-,230
1,280
S.E.
,022
,080
1,670
Wald
8,609
8,309
,588
df
1
1
1
Sig .
,003
,004
,443
Exp(B)
1,067
,794
3,598
a. Variable(s) entered on step 1: previous, pswq.
estimates
standard error
estimates
significance
based on
Wald statistic
change in odds
32
How is the classification table constructed?
Classification Tablea
Predicted
Step 1
Observed
Result of Penalty
Kick
Missed Penalty
Scored Penalty
Result of Penalty Kick
Missed
Scored
Penalty
Penalty
30
5
7
33
Overall Percentage
a. The cut value is ,500
oops wrong prediction
Percentage
Correct
85,7
82,5
84,0
oops wrong prediction
Variables in the Equation
B
Step
a
1
previous
pswq
Constant
,065
-,230
1,280
S.E.
,022
,080
1,670
Wald
8,609
8,309
,588
df
1
1
1
Sig .
,003
,004
,443
Exp(B)
1,067
,794
3,598
a. Variable(s) entered on step 1: previous, pswq.
Pred. P(Y ) 
1
1  e (1, 28 0, 065* previous 0, 230* pswq )
33
How is the classification table constructed?
Pred. P(Y ) 
1
1  e (1, 28 0, 065* previous 0, 230* pswq )
pswq
previous
scored
18
56
1
Predict.
prob.
.68
17
35
1
.41
20
45
0
.40
10
42
0
.85
34
How is the classification table constructed?
pswq
previo
us
scored
18
17
20
10
56
35
45
42
1
1
0
0
Predict predict
. prob.
ed
.68
.41
.40
.85
1
0
0
1
Classification Tablea
Predicted
Step 1
Observed
Result of Penalty
Kick
Overall Percentage
Missed Penalty
Scored Penalty
Result of Penalty Kick
Missed
Scored
Penalty
Penalty
30
5
7
33
Percentage
Correct
85,7
82,5
84,0
a. The cut value is ,500
35