Transcript Slide 1

Statistical Analysis
SC504/HS927
Spring Term 2008
Introduction to Logistic Regression
Dr. Daniel Nehring
1
Outline







Preliminaries: The SPSS syntax
Linear regression and logistic regression
OLS with a binary dependent variable
Principles of logistic regression
Interpreting logistic regression coefficients
Advanced principles of logistic regression (for self-study)
Source:
http://privatewww.essex.ac.uk/~dfnehr
2
PRELIMINARIES
3
The SPSS syntax





Simple programming language allowing access
to all SPSS operations
Access to operations not covered in the main
interface
Accessible through syntax windows
Accessible through ‘Paste’ buttons in every
window of the main interface
Documentation available in ‘Help’ menu
4
Using SPSS syntax files





Saved in a separate file format through the
syntax window
Run commands by highlighting them and
pressing the arrow button.
Comments can be entered into the syntax.
Copy-paste operations allow easy learning of
the syntax.
The syntax is preferable at all times to the main
interface to keep a log of work and identify and
correct mistakes.
5
PART I
6
Simple linear regression

Relation between 2 continuous variables
y
Slope
y  α  β1x1
x
Regression coefficient b1
 Measures association between y and x
 Amount by which y changes on average
when x
changes by one unit
 Least squares method
7
Multiple linear regression

Relation between a continuous variable and a
set of i continuous variables
y  α  β1x1  β2 x 2  ...  βixi

Partial regression coefficients bi
 Amount
by which y changes on average when xi
changes by one unit and all the other xis remain
constant
 Measures association between xi and y adjusted for
all other xi
8
Multiple linear regression
y

Predicted
Response variable
Dependent
α  β1x1  β2 x 2  ...  βixi
Predictor variables
Explanatory variables
Independent variables
9
OLS with a binary dependent
variable

Binary variables can take only 2 possible values:






yes/no (e.g. educated to degree level, smoker/non-smoker)
success/failure (e.g. of a medical treatment)
Coded 1 or 0 (by convention 1=yes/ success)
Using OLS for a binary dependent variable  predicted
values can be interpreted as probabilities; expected to lie
between 0 and 1
But nothing to constrain the regression model to predict
values between 0 and 1; less than 0 & greater than 1 are
possible and have no logical interpretation
Approaches which ensure that predicted values lie between
0 & 1 are required such as logistic regression
10
Fitting equation to the data
Linear regression: Least squares
 Logistic regression: Maximum likelihood
 Likelihood function

 Estimates
parameters with property that
likelihood (probability) of observed data is
higher than for any other values
 Practically easier to work with log-likelihood
n
L()  lnl ()   yi ln ( xi )  (1  yi ) ln1   ( xi )
i 1
Maximum Likelihood Estimation
(MLE)



OLS cannot be used for logistic regression since
the relationship between the dependent and
independent variable is non-linear
MLE is used instead to estimate coefficients on
independent variables (parameters)
Of all possible values of these parameters, MLE
chooses those under which the model would
have been most likely to generate the observed
sample
12
Logistic regression

Models relationship between set of
variables xi
 dichotomous (yes/no)
 categorical (social class,
 continuous (age, ...)
... )
and
 dichotomous
(binary) variable Y
13
PART II
14
Logistic regression (1)






‘Logistic regression’ or ‘logit’
p is the probability of an event occurring
1-p is the probability of the event not occurring
p can take any value from 0 to 1
the odds of the event occurring = p
1 p
the dependent variable in a logistic regression is
the natural log of the odds: ln p 
 1 p 
15
Logistic regression (2)


ln (.) can take any value, p will always
range from 0 to 1
the equation to be estimated is:
 p 
ln
  a  b1x1  b2 x 2  . . ..  bk x k
 1 p 
16
Logistic regression (3)
 P( y x ) 
ln
    bx
1 P( y x) 
{
Logistic transformation
e  bx
P( y x) 
  bx
1 e
logit of P(y|x)
Predicting p
let
z  a  b1x1  b2 x2  . . ..  bk xk
then to predict p for individual i,
pi
ab x
e
1  pi
1
e
1i
 b2 x 2 i  .. ..  bk x ki
zi
zi
e
pi 
z
1 e
i
18
Logistic function (1)
Probability
of event y
1.0
0.8
e  bx
P( y x ) 
1  e  bx
0.6
0.4
0.2
0.0
x
PART III
20
Interpreting logistic regression
coefficients



intercept is value of ‘log of the odds’ when all independent
variables are zero
each slope coefficient is the change in log odds from a 1-unit
increase in the independent variable, controlling for the effects
of other variables
two problems:



log odds not easy to interpret
change in log odds from 1-unit increase in one independent depends
on values of other independent variables
but the exponent of b (eb) is not dependent on values of other
independent variables and is the odds ratio
21
Odds ratio
odds ratio for coefficient on a dummy
variable, e.g. female=1 for women, 0 for
men
 odds ratio = ratio of the odds of event
occurring for women to the odds of its
occurring for men
 odds for women are eb times odds for men

22
General rules for interpreting
logistic regression coefficients
if b1 > 0, X1 increases p
if b1 < 0, X1 decreases p
if odds ratio >1, X1 increases p
if odds ratio < 1, X1 decreases p
if CI for b1 includes 0, X1 does not have a
statistically significant effect on p
if CI for odds ratio includes 1, X1 does not have
a statistically significant effect on p
23
An example: modelling the relationship
between disability, age and income in the
65+ population
dependent variable = presence of disability
(1=yes,0=no)
 independent variables:

X1 age in years (in excess of 65 i.e. 650, 70
 5)
X2 whether has low income (in lowest 3rd of the
income distribution)

data: Health Survey for England, 2000
24
Example: logistic regression estimate for
probability of being disabled, people aged 65+
constant
age
has low
income
Coeff
Odds p-
(b)
ratio value
-0.912
0.000
95% CI on
95% CI on
coeff
odds ratio
-0.696 -1.129
0.078 1.081 0.000
0.060 0.095 1.062 1.100
-0.270 0.764 0.003
-0.024 -0.515 0.597 0.976
source: estimated from the Health Survey for England, 2000
25
PART IV
26
Odds, log odds, odds ratios and
probabilities
p
odds 
1 p
 p 
log odds  ln

 1 p 
 a  b1x1  b2 x2  . . ..  bk xk
odds ratio,variable k  eb
k
( a  b1x1  b2 x2 ... bk xk )
e
probabilty 
( a  b x  b x ... b x )
1 e
1
1
2
2
k
k
27
Odds, odds ratios and probability of disability among non low income people aged 65+
8
odds
7.029
7
probabilities/odds/odds ratios
probability
6
odds ratio compared with age 65+
5
4
3
2.82
2
1.0001.081
1
0.40
0.74
0.29
0
65
70
80
75
85
90
age
28
Odds, odd ratios and
probabilities








pj = 0.2 i.e. a 20% probability
oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25
pk = 0.4
oddsk = 0.4/0.6 = 0.67
relative probability/risk pj/pk = 0.2/0.4 = 0.5
odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37
odds ratio is not equal to relative probability/risk
except approximately if pj and pk are small………
29
Points to note from logit
example.xls
if you see an odds ratio of e.g. 1.5 for a
dummy variable indicating female, beware
of saying ‘women have a probability 50%
higher than men’. Only if both p’s are
small can you say this.
 better to calculate probabilities for
example cases and compare these

30
Predicting p
let
z  a  b1x1  b2 x2  . . ..  bk xk
then to predict p for individual i,
pi
ab x
e
1  pi
1
e
1i
 b2 x 2 i  .. ..  bk x ki
zi
zi
e
pi 
z
1 e
i
31
E.g.: Predicting a probability from
our model
Predict disability for someone on low income
aged 75:
 Add up the linear equation
a(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27
=-0.402
 Take the exponent of it to get to the odds of
being disabled
=.669
 Put the odds over 1+the odds to give the
probability
=c.0.4 – or a 40 per cent chance of being disabled

32
Goodness of fit in logistic regressions


based on improvements in the likelihood of observing the
sample
use a chi-square test with the test statistic =
 LR 
  2 ln 
 LU 
2



where R and U indicate restricted and unrestricted models
unrestricted – all independent variables in model
restricted – all or a subset of variables excluded from the
model (their coefficients restricted to be 0)
33
Statistical significance of coefficient
estimates in logistic regressions

Calculated using standard errors as in OLS
bˆ
t bˆ 
sebˆ

for large n, t > 1.96 means that there is a 5%
or lower probability that the true value of the
coefficient is 0.
or
p  0.05
34
95% confidence intervals for logistic
regression coefficient estimates
bˆ  1.96  sebˆ

For CIs of odds ratios calculate CIs for
coefficients and take their exponents
35