Logistic Regression I Outline  Introduction to maximum likelihood estimation (MLE)  Introduction to Generalized Linear Models  The simplest logistic regression (from a 2x2 table)—illustrates.

Download Report

Transcript Logistic Regression I Outline  Introduction to maximum likelihood estimation (MLE)  Introduction to Generalized Linear Models  The simplest logistic regression (from a 2x2 table)—illustrates.

Logistic Regression I
Outline

Introduction to maximum likelihood
estimation (MLE)
 Introduction to Generalized Linear Models
 The simplest logistic regression (from a 2x2
table)—illustrates how the math works…
 Step-by-step examples
 Dummy variables
– Confounding and interaction
Introduction to Maximum
Likelihood Estimation
a little coin problem….
You have a coin that you know is biased towards
heads and you want to know what the probability
of heads (p) is.
YOU WANT TO ESTIMATE THE UNKNOWN
PARAMETER p
Data
You flip the coin 10 times and the coin comes
up heads 7 times. What’s you’re best guess
for p?
Can we agree that your best guess for is .7
based on the data?
The Likelihood Function
What is the probability of our data—seeing 7 heads in
10 coin tosses—as a function p?
The number of heads in 10 coin tosses is a binomial
random variable with N=10 and p=(unknown) p.
10  7
10! 7
3
 P(7 heads)    p (1  p) 
p (1  p) 3
7!3!
7 
This function is called a LIKELIHOOD FUNCTION.
It gives the likelihood (or probability) of our data as a
function of our unknown parameter p.
The Likelihood Function
10  7
10! 7
3
 P(7 heads)    p (1  p) 
p (1  p) 3
7!3!
7 
We want to find the p that maximizes the probability of our
data (or, equivalently, that maximizes the likelihood
function).
THE IDEA: We want to find the value of p that makes our
data the most likely, since it’s what we saw!
Maximizing a function…
Here comes the calculus…
Recall: How do you maximize a function?
1. Take the log of the function
--turns a product into a sum, for ease of taking derivatives. [log
of a product equals the sum of logs:
log(a*b*c)=loga+logb+logc and log(ac)=cloga]
2. Take the derivative with respect to p.
--The derivative with respect to p gives the slope of the tangent
line for all values of p (at any point on the function).
3. Set the derivative equal to 0 and solve for p.
--Find the value of p where the slope of the tangent line is 0—
this is a horizontal line, so must occur at the peak or the trough.
1. Take the log of the likelihood function.
10 
10! 7
Likelihood    p 7 (1  p) 3 
p (1  p) 3
7!3!
7 
10!
log Likelihood  log
 7 log p  3 log(1  p)
7!3!
2. Take the derivative with respect to p.
d
7
3
log Likelihood  0  
dp
p 1 p
Jog your memory
*derivative of a constant is 0
*derivative 7f(x)=7f '(x)
*derivative of log x is 1/x
*chain rule
3. Set the derivative equal to 0 and solve for p.
7
3
7(1  p )  3 p

 0

 0
 7(1  p )  3 p
p 1 p
p (1  p )
7  7 p  3p 
 7  10 p
7
p
10
RECAP:
10!
 7 log p  3 log(1  p)
7!3!
d
7
3
log Lik elihood  0  
dp
p 1 p
7
3

0
p 1 p
7(1  p )  3 p
0
p(1  p)
7(1  p)  3 p
log Lik elihood  log
7  7 p  3p
7  10 p
7
p
10
The actual maximum value of the likelihood might not be very high.
10 
Value of the Likelihood   (.7) 7 (.3) 3  120 (.7) 7 (.3) 3  .267
7 
Here, the –2 log likelihood (which will become useful later) is:
 2(log likelihood )  2(ln(. 267 ))  2.64
Thus, the MLE of p is .7
So, we’ve managed to prove the obvious here!
But many times, it’s not obvious what your best guess for
a parameter is!
MLE tells us what the most likely values are of regression
coefficients, odds ratios, averages, differences in
averages, etc.
{Getting the variance of that best guess estimate is much
trickier, but it’s based on the second derivative, for
another time ;-) }
Generalized Linear Models

Twice the generality!
 The generalized linear model is a
generalization of the general linear model
 SAS uses PROC GLM for general linear
models
 SAS uses PROC GENMOD for generalized
linear models
Recall: linear regression

Require normally distributed response variables
and homogeneity of variances.
 Uses least squares estimation to estimate
parameters
– Finds the line that minimizes total squared error
around the line:
– Sum of Squared Error (SSE)= (Yi-( + x))2
– Minimize the squared error function:
derivative[(Yi-( + x))2]=0 solve for ,
Why generalize?

General linear models require normally
distributed response variables and homogeneity
of variances. Generalized linear models do not.
The response variables can be binomial,
Poisson, or exponential, among others.
Example : The Bernouilli (binomial)
distribution
Lung
cancer;
yes/no
y
n
Smoking
(cigarettes/day)
Could model probability of lung cancer….
p=  + 1*X
1
The
probability
of lung
cancer (p)
0
Smoking
(cigarettes/day)
[
]
But why might
this not be best
modeled as
linear?
Alternatively…
log(p/1- p)
=  + 1*X
Logit function
The Logit Model
Bolded variables represent
vectors
P( D / X i )
ln(
)    r ( X i , β)
1  P( D / X i )
Logit function (log odds)
Baseline odds
Linear function of
risk factors and
covariates for
individual i:
1x1 + 2x2 + 3x3 +
4x4 …
Example
ln(
P(D/smokes; 23 years old; 140 lbs)
)     age (23)   weight (140)   smoke (1)
1  P(D/smokes; 23 years old; 140 lbs)
Logit function (log odds
of disease or outcome)
Baseline
odds
Linear function of
risk factors and
covariates for
individual i:
1x1 + 2x2 + 3x3 +
4x4 …
 odds of disease for 23 - year old, 140 - lb, smoker...
P(D/smokes; 23 years old; 140 lbs)
   age( 23)   weight(140)   smoke(1)

e
1  P(D/smokes; 23 years old; 140 lbs)
Relating odds to probabilities
odds
algebra
probability
P( D / X i )
  r ( X i ,β )
e
1  P( D / X i )
algebraicm anipulation 
P(D / X )  e
i
  r ( X i ,β )
P( D / X )  P( D / X )e
i
i
 P( D / X )e
i
  r ( X i ,β )
 e
  r ( X i ,β )
  r ( X i ,β )
  r ( X i ,β )
e
 P( D / X i ) 
  r ( X i ,β )
1 e
Relating odds to probabilities
odds
algebra
probability
P(D/smokes; 23 years old; 140 lbs)
   ( 23)   weight(140)   smoke(1)
 e age
1  P(D/smokes; 23 years old; 140 lbs)
algebra 

   age ( 23)   weight(140)   smoke(1)
P(D)  (1  P(D))e
   age( 23)   weight(140)   smoke(1)
P(D)  e
   age( 23)   weight(140)   smoke(1)
P(D)  P(D)e
P(D) 
   age ( 23)   weight(140)   smoke(1)
 P(D)e
   age( 23)   weight(140)   smoke(1)
e
   age( 23)   weight(140)   smoke(1)
e
   age( 23)   weight(140)   smoke(1)
1 e
 P(D/smokes; 23 years old; 140 lbs) 
   age( 23)   weight(140)   smoke(1)
e
   age ( 23)   weight(140)   smoke(1)
1 e
Individual Probability Functions
Probabilities associated with each individual’s outcome:
i developed disease :
e  r ( X i ,β )
P( D / X i ) 
1  e  r ( X i ,β )
i did NOT develop disease :
e  r ( X i ,β )
1
P(~ D / X i )  1 

  r ( X i ,β )
1 e
1  e  r ( Xi ,β )
Example:
P(D/smokes; 23 years old; 140 lbs) 
   age ( 23)   weight(140)   smoke(1)
e
   age( 23)   weight(140)   smoke(1)
1 e
The Likelihood Function
The likelihood function is an equation for the joint probability
of the observed events as a function of 
Likelihood Function :
 P( D  1 / X )  P( D  0 / X )
i
all cases

all controls
e
  r ( X i ,β )
 1 e 
all cases
i
 r ( X i ,β )

all controls1 
1
e
  r ( X i ,β )
Maximum Likelihood
Estimates of 
Take the log of the likelihood function to change product to sum:
Maximize the function (just basic calculus):
Take the derivative of the log likelihood function
Set the derivative equal to 0
Solve for 
“Adjusted” Odds Ratio
Interpretation
odds of disease for the exposed
OR 
odds of disease for the unexposed


   alcohol(1)   smoking(1)
e
   alcohol( 0 )   smoking(1)
e
  alcohol(1)  smoking(1)
e e
e
  alcohol( 0 )  smoking(1)
e e
e

e
 alcohol(1)
1
e
 alcohol(1)
Adjusted odds ratio,
continuous predictor
odds of disease for the exposed
OR 
odds of disease for the unexposed


   alcohol(1)   smoking(1)   age( 29)
e
   alcohol(1)   smoking(1)   age(19)
e
  alcohol(1)  smoking(1)  age( 29)
e e
e
e
  alcohol(1)  smoking(1)  age(19)
e e
e
e
e

e
 age( 29)
 age(19)
e
 age(10)
Practical Interpretation
e
ˆ rf ( x )
 ORrisk factor of interest
The odds of disease increase multiplicatively by eß
for every one-unit increase in the exposure,
controlling for other variables in the model.
Simple Logistic Regression
2x2 Table
(courtesy Hosmer and Lemeshow)
Exposure=1
  1
Exposure=0

e
P( D / ~ E ) 
1  e
Disease = 1
e
P( D / E ) 
1  e  1
Disease = 0
1
1
P(~ D / E ) 
P(~ D / ~ E ) 

  1
1 e
1 e
Odds Ratio for simple 2x2 Table
  1
e
1

  1

1 e 
OR  1  e
1
e1

  1
1  e
1 e
e
  1
e

e
(  1 ) 
(courtesy Hosmer and Lemeshow)
e
1
Example 1: CHD and Age
(2x2)
(from Hosmer and Lemeshow)
=>55 yrs
<55 years
CHD Present
21
22
CHD Absent
6
51
The Logit Model
P( D)
log(
)    1 X 1
1  P( D)
1 if exposed (older)
X1  
0 if unexposed (younger)
The Likelihood

e  1 21
1
e
1 51
6
22
L( , 1 )  (
) x(
) x(
) x(
)
  1
  1


1 e
1 e
1 e
1 e
The Log Likelihood
recall :
  1
log e
 1
 log e e

 log e  log e
1
   1

e  1 21
1
e
1 51
6
22
L( , 1 )  (
) x(
) x(
) x(
)
  1
  1


1 e
1 e
1 e
1 e
 log L( , 1 ) 
  1
21(  1 )  21log(1  e

  1
)  0  6 log(1  e

22  22 log(1  e )  0  51log(1  e )
)
Derivative(s) of the log
likelihood
 log L( , 1 ) 
21(  1 )  21log(1  e  1 )  0  6 log(1  e  1 ) 
22  22 log(1  e )  0  51log(1  e )
d [log L( 1 )] 
d1
  1
  1
21e
6e
21 

  1
  1
1 e
1 e
d [log L ( )] 
d


22e
51e
22 



1 e
1 e
Maximize 
22e 51e
22 

0


1 e 1 e

73e
22 

1 e


22(1  e )  73e

22  51e
22
e 
51

=Odds of disease in
the unexposed (<55)
Maximize 1
  1
27e
21 
0
  1
1 e
27e  1  21(1  e  1 )
6e  1  21
21
e

6
21
21
21x51
1
6
6
e 


 OR

22
e
6 x 22
51
  1
Hypothesis Testing
H0: =0
Null value of
1. The Wald test:
beta is 0 (no
association)
ˆ  0
Z
asymptotic standard error ( ˆ )
2. The Likelihood Ratio test:
Reduced=reduced model with k parameters; Full=full model with k+p parameters
 2 ln
L(reduced)

L( full)
 2 ln( L(reduced))  [2 ln( L( full))] ~  2p
Hypothesis Testing
H0: =0
1. What
is the Wald Test here?
Z
51x 21
)
6 x 22
 3.96
1 1 1
1
  
51 6 21 22
ln(
2. What is the Likelihood Ratio test here?
– Full model = includes age variable
– Reduced model = includes only intercept

Maximum likelihood for reduced model ought to be (.43)43x(.57)57
(57 cases/43 controls)…does MLE yield this?…
The Reduced Model
P ( D)
log(
) 
1  P ( D)
Likelihood value for reduced model
e 43
1 57
L( )  (
)
x
(
)


1 e
1 e
log L( )  43 log e  43(1  e )  57 (1  e )
d log L( )
100 e
 43 
0

d
1 e
43  43e  100 e
43  57 e
43
e 
 .75 = marginal odds of CHD!
57
  ln(. 75)  .28

.75 43
1 57
L(  .28)  (
) x(
) 
1.75
1.75
(.43) 43 x(.57)57  2.1x10 30
Likelihood value of full model
21
22
1 6
1 51
21
22
6
51
L ( 1 )  (
) x(
) x(
) x(
) 
21
21
22
22
1
1
1
1
6
6
51
51
3.5 21 1 6 .43 22
1 51
( ) x( ) x(
) x(
)  2.43 x10  26
4.5
4.5
1.43
1.43
Finally the LR…
L(reduced )
 2 ln

L( full )
 2 ln( 2.1x10 30 )  [2 ln( 2.43x10  26 )]  136.7  117.96  18.7
18.7  (3.96) 2
Example 2:
>2 exposure levels
*(dummy coding)
CHD
status
White
Black
Hispanic
Other
Present
5
20
15
10
Absent
20
10
10
10
(From Hosmer and Lemeshow)
SAS CODE
data race;
input chd race_2 race_3 race_4 number;
datalines;
0 0 0 0 20
Note the use of “dummy
1 0 0 0 5
variables.”
0 1 0 0 10
1 1 0 0 20
“Baseline” category is
0 0 1 0 10
white here.
1 0 1 0 15
0 0 0 1 10
1 0 0 1 10
end;
run;
proc logistic data=race descending;
weight number;
model chd = race_2 race_3 race_4;
run;
What’s the likelihood here?
e white 5
1
e white  black 20
1
20
10
L(β)  (
)
x
(
)
x
(
)
x
(
)
1  e white
1  e white
1  e white  black
1  e white  black
 white  hisp
 white  other
e
1
15
10
10
10
x(
)
x
(
)
(
)
x
(
)




1  e white  other
1  e white  other
1  e white hisp
1  e white hisp
e
1
In this case there is more
than one unknown beta
(regression coefficient)—
so this symbol represents a
vector of beta coefficients.
SAS OUTPUT – model fit
Criterion
AIC
SC
-2 Log L
Intercept
Only
Intercept
and
Covariates
140.629
140.709
138.629
132.587
132.905
124.587
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
14.0420
13.3333
11.7715
3
3
3
0.0028
0.0040
0.0082
SAS OUTPUT – regression
coefficients
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
race_2
race_3
race_4
1
1
1
1
-1.3863
2.0794
1.7917
1.3863
0.5000
0.6325
0.6455
0.6708
Wald
Chi-Square
Pr > ChiSq
7.6871
10.8100
7.7048
4.2706
0.0056
0.0010
0.0055
0.0388
SAS output – OR estimates
The LOGISTIC Procedure
Odds Ratio Estimates
Effect
Point
Estimate
race_2
race_3
race_4
8.000
6.000
4.000
95% Wald
Confidence Limits
2.316
1.693
1.074
27.633
21.261
14.895
Interpretation:
8x increase in odds of CHD for black vs. white
6x increase in odds of CHD for hispanic vs. white
4x increase in odds of CHD for other vs. white
Example 3: Prostrate Cancer Study
(same data as from lab 3)

Question: Does PSA level predict tumor
penetration into the prostatic capsule (yes/no)?
(this is a bad outcome, meaning tumor has spread).

Is this association confounded by race?

Does race modify this association (interaction)?
1. What’s the relationship
between PSA (continuous
variable) and capsule
penetration (binary)?
Capsule (yes/no) vs. PSA (mg/ml)
psa vs. capsule
capsule
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
10
20
30
40
50
60
70
psa
80
90
100
110
120
130
140
Mean PSA per quintile vs. proportion capsule=yes
 S-shaped?
proportion
with
capsule=yes
0.70
0.68
0.66
0.64
0.62
0.60
0.58
0.56
0.54
0.52
0.50
0.48
0.46
0.44
0.42
0.40
0.38
0.36
0.34
0.32
0.30
0.28
0.26
0.24
0.22
0.20
0.18
0
10
20
30
PSA (mg/ml)
40
50
logit plot of psa predicting capsule, by quintiles
 linear in the logit?
Est. logit
0.17
0.16
0.15
0.14
0.13
0.12
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0
10
20
30
psa
40
50
psa vs. proportion, by decile…
proportion
with
capsule=yes
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
20
30
40
PSA (mg/ml)
50
60
70
logit vs. psa, by decile
Estimated logit plot of psa predicting capsule in the data set kristin.psa
Est. logit
0.44
0.42
0.40
0.38
0.36
0.34
0.32
0.30
0.28
0.26
0.24
0.22
0.20
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0
10
20
30
40
psa
m = numer of events M = number of cases
50
60
70
model: capsule = psa
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
49.1277
41.7430
29.4230
1
1
1
<.0001
<.0001
<.0001
Likelihood Ratio
Score
Wald
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Intercept
psa
1
1
-1.1137
0.0502
Standard
Error
0.1616
0.00925
Wald
Chi-Square
47.5168
29.4230
Pr > ChiSq
<.0001
<.0001
Model: capsule = psa race

Analysis of Maximum Likelihood Estimates

Parameter
DF
Estimate
Standard
Error

Intercept
1
-0.4992
0.4581
1.1878

psa
1
0.0512
0.00949
29.0371

race
1
-0.5788
0.4187
1.9111


Wald
Chi-Square
Pr > ChiSq
0.2758

<.0001
0.1668
No indication of confounding by race since the
regression coefficient is not changed in
magnitude.
Model:
capsule = psa race psa*race
DF
Wald
Estimate
Error
Chi-Square
Pr > ChiSq

Intercept
psa
race
1
1
1
-1.2858
0.0608
0.0954
0.6247
0.0280
0.5421
4.2360
11.6952
0.0310
0.0396
0.0006
0.8603

psa*race
1
-0.0349
0.0193
3.2822
0.0700


Standard
Parameter



Evidence of effect modification by race (p=.07).
STRATIFIED BY RACE:
---------------------------- race=0 ----------------------------
Parameter
DF
Estimate
Standard
Error
Intercept
psa
1
1
-1.1904
0.0608
0.1793
0.0117
Wald
Chi-Square
Pr > ChiSq
44.0820
26.9250
<.0001
<.0001
---------------------------- race=1 ----------------------------
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
psa
1
1
-1.0950
0.0259
0.5116
0.0153
Wald
Chi-Square
Pr > ChiSq
4.5812
2.8570
0.0323
0.0910
How to calculate ORs from
model with interaction term
DF
Wald
Estimate
Error
Chi-Square
Pr > ChiSq

Intercept
psa
race
1
1
1
-1.2858
0.0608
0.0954
0.6247
0.0280
0.5421
4.2360
11.6952
0.0310
0.0396
0.0006
0.8603

psa*race
1
-0.0349
0.0193
3.2822
0.0700


Standard
Parameter



Increased odds for every 5 mg/ml increase in
PSA:
If white (race=0):
If black (race=1):
e
( 5*.0608)
e
 1.36
( 5*(.0608.0349))
 1.14