9. Logistic regression

Download Report

Transcript 9. Logistic regression

Logistic Regression
for binary outcomes
In Linear Regression, Y is continuous
In Logistic, Y is binary (0,1). Average Y is P.
Can’t use linear regression since:
1. Y can’t be linearly related to Xs.
2. Y does NOT have a Gaussian (normal)
distribution around “mean” P.
We need a “linearizing” transformation and a
non Gaussian error model
Since 0 <= P <= 1
Might use odds = P/(1-P)
Odds has no “ceiling” but has “floor” of zero.
So we use the logit transformation
ln(P/(1-P)) = ln(odds) = logit(P)
Logit does not have a floor or ceiling.
Model: logit =
ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk
or
Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit
Since P=odds/(1 + odds) & odds = elogit
P = elogit/(1 + elogit) = 1/(1 + e-logit)
P vs logit
P vs logit
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
P=risk
P=risk
1.0
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0.0
-4
-3
-2
-1
0
1
logit =log odds
2
3
4
-4
-3
-2
-1
0
1
logit=log odds
2
3
4
If ln(odds)= β0+ β1X1 + β2X2+…+βkXk
then
odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk)
or
odds = (base odds) OR1 OR2 … ORk
Model is multiplicative on the odds scale
(Base odds are odds when all Xs=0)
ORi = odds ratio for the ith X
Interpreting β coefficients
Example: Dichotomous X
X = 0 for males, X=1 for females
logit(P) = β0 + β1 X
M: X=0, logit(Pm)= β0
F: X=1, logit(Pf) = β0 + β1
logit(Pf) – logit(Pm) = β1
log(OR) = β1, eβ1 = OR
Example: P is proportion with disease
logit(P) = β0 + β1 age + β2 sex
“sex” is coded 0 for M, 1 for F
OR for F vs M for disease is eβ2 if both are
the same age.
eβ1 is the increase in the odds of disease
for a one year increase in age.
(eβ1)k = ekβ1 is the OR for a ‘k’ year change
in age in two groups with the same
gender.
Example: P is proportion with a MI
Predictors: age in years
htn = hypertension (1=yes, 0=no)
smoke = smoking (1=yes, 0=no)
Logit(P) = β0+ β1age + β2 htn + β3 smoke
Q: Want OR for a 40 year old with
hypertension vs otherwise identical 30
year old without hypertension.
A:β0+β140+β2+β3smoke–
(β0+β130+β3smoke)
= β110+β2=log OR.
OR = e[10 β1+β2].
Interactions
P is proportion with CHD
S:1= smoking, 0=non. D:1=drinking, 0 =non
Logit(P)= β0+ β1S + β2 D + β3 SD
Referent category is S=0, D=0
S D odds
OR
0 0 eβ0
OR00=1= eβ0/ eβ0
1 0 eβ0+β1
OR10= eβ1
0 1 eβ0+β2
OR01= eβ2
1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3)
When will OR11=OR10 x OR01? IFF β3=0
Interpretation example
Potential predictors (13) of in hospital
infection mortality (yes or no)
Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148
Gender (female or male)
Age in years
APACHE score (0-129)
Diabetes (y/n)
Renal insufficiency / Hemodyalysis (y/n)
Intubation / mechanical ventilation (y/n)
Malignancy (y/n)
Steroid therapy (y/n)
Transfusions (y/n)
Organ transplant (y/n)
WBC - count
Max temperature - degrees
Days from admission to treatment (> 7 days)
Factors Associated With Mortality for All Infections
Characteristic Odds Ratio (95% CI) p value
Incr APACHE score 1.15 (1.11-1.18) <.001
Transfusion (y/n)
4.15 (2.46-6.99) <.001
Increasing age
1.03 (1.02-1.05) <.001
Malignancy
2.60 (1.62-4.17) <.001
Max Temperature
0.70 (0.58-0.85) <.001
Adm to treat>7 d
1.66 (1.05-2.61) 0.03
Female (y/n)
1.32 (0.90-1.94) 0.16
*APACHE = Acute Physiology & Chronic Health Evaluation Score
Diabetes complications -Descriptive stats
Table of obese by diabetes complication
obese
diabetes complication
Freq | no- 0|yes- 1| Total % yes
-----+------+------+
no 0|
56 |
28 | 84
28/84=33%
-----+------+------+
yes 1|
20 |
41 | 61
41/61=67%
-----+------+------+
Total
76
69
145
%obese
26%
59%
RR=2.0, OR=4.1
Fasting glucose (“fast glu”) mg/dl
n min median mean
No complication 76 70.0
90.0
91.2
Complication
69 75.0 114.0 155.9
, p < 0.001
max
112.0
353.0, p=
Steady state glucose (“steady glu”) mg/dl
n min
median
mean
max
No complication 76 29.0
105.0 114.0 273.0
Complication
69 60.0
257.0 261.5 480.0, p=
Diabetes complication
Parameter DF beta SE(b) Chi-Square p
Intercept 1 -14.70 3.231 20.706 <.0001
obese
1 0.328 0.615 0.285 0.5938
Fast glu 1 0.108 0.031 2.456 0.0004
Steady glu 1 0.023 0.005 18.322 <.0001
Log odds diabetes complication =
-14.7+0.328 obese+0.108 fast glu +
0.023 steady glu
Statistical sig of the βs
Linear regr t = b/SE -> p value
Logistic regr Χ2 = (b/SE)2 -> p value
Must first form (95%) CI for β on log scale
b – 1.96 SE, b + 1.96 SE
Then take antilogs of each end
e[b – 1.96 SE], e[b + 1.96 SE]
Diabetes complications
Odds Ratio Estimates
Point
Effect
Estimate
obese
e0.328=1.388
Fast glu
e0.108=1.114
Steady glu e0.023=1.023
95% Wald
Confidence Limits
0.416
4.631
1.049
1.182
1.012
1.033
Model fit-Linear vs Logistic regression
k variables, n observations
Variation
Model
Error
Total
df
k
n-k
n-1
sum square or deviance
G
D
T <-fixed
Yi= ith observation, Ŷi=prediction for ith obs
statistic
Linear regr
Logistic regr
D/(n-k)
Σ[(Yi-Ŷi)/Ŷ]2
Corr(Y,Ŷ)2
G/T
Residual SDe
-R2
R2
Mean deviance
Hosmer-L χ2
Cox-Snell R2
Pseudo R2
Good regression models have large G and
small D. For logistic regression, D/(n-k),
the mean deviance, should be near 1.0.
There are two versions of the R2 for
logistic regression.
Goodness of fit:Deviance
Deviance in logistic is like SS in linear regr
df -2log L p value
Model (G) 3
117.21 < 0.001
Error (D) 141
83.46
total (T) 144 200.67
mean deviance =83.46/141=0.59
(want mean deviance to be ≤ 1)
R2pseudo=G/total =117/201= 0.58, R2cs =0.554
Goodness of fit:H-L chi sq
Compare observed vs model predicted
(expected) frequencies by pred. decile
decile total obs y exp y obs no exp no
1
16
0
0.23
16
15.8
2
15
0
0.61
15
14.4
3
15
0
1.31
15
13.7
…
8
16 15 15.6
1
0.40
9
23 23 23.0
0
0.00
chi-square=9.89, df=7, p = 0.1946
Goodness of fit vs R2
Interpretation when goodness of fit is
acceptable and R2 is poor.
Need to include interactions or make
transformation on X variables in model?
Need to obtain more X variables?
Sensitivity & Specificity
True pos
True neg
Classify pos
a
b
Classify neg
c
d
total
a+c
b+d
Sensitivity=a/(a+c), false neg=c/(a+c)
Specificity=d/(b+d), false pos=b/(b+d)
Accuracy = W sensitivity + (1-W) specificity
Any good classification rule,
including a logistic model, should
have high sensitivity & specificity.
In logistic, we choose a cutpoint,
Pc,
Predict positive if P > Pc
Predict negative if P < Pc
Diabetes complication
logit(Pi) = -14.7+0.328 obese+0.108 fast glu
+0.023 steady glu
Pi = 1/(1+ exp(-logit))
Compute Pi for all observations, find value of
Pi (call it P0) that maximizes
accuracy=0.5 sensitivity + 0.5 specificity
This is an ROC analysis using the logit (or Pi)
ROC for logistic model
Diabetes model accuracy
Logit =0.447, P0=e0.447/(1+e0.447) = 0.61
True comp
True no comp
Pred yes
55
11
Pred no
14
65
total
69
76
Sens=55/69= 79.7%, Spec=65/76=85.5%
Accuracy = (81.2% + 85.5%)/2 = 83.4%
C statistic (report this)
n0=num negative, n1=num positive
Make all n0 x n1 pairs (1,0)
Concordant if
predicted P for Y=1 > predicted P for Y=0
Discordant if
predicted P for Y=1 < predicted P for Y=0
C = num concordant + 0.5 num ties
n0 x n1
C=0.949 for diabetes complication model
Logistic model is also a
discriminant model (LDA)
0.60
0.50
freq
0.40
0.30
0.20
0.10
0.00
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
logit(P)
Histograms of logit scores for each group
Poisson Regression
Y is a low positive integer, 0, 1,2, …
Model:
ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk
so
mean Y = exp(β0+ β1X1 + β2X2+…+βkXk)
dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y
100 βi is the percent change per unit
change in Xi
End
Equation for logit = log odds=depr “score”
logit = -1.8259 + 0.8332 female +
0.3578 chron ill -0.0299 income
odds depr = elogit, risk = odds/(1+odds)
coding:
Female: 0 for M, 1 for F
Chron ill: 0 for no, 1 for yes
Income in 1000s
Example: Depression (y/n)
Model for depression
term
coeff=β
Intercept -1.8259
female
0.8332
chron ill
0.3578
income
-0.0299
SE
0.4495
0.3882
0.3300
0.0135
p value
0.0001
0.0319
0.2782
0.0268
Female, chron ill are binary, income in 1000s
ORs
term
Intercept
female
chron ill
income
coeff=β
-1.8259
0.8332
0.3578
-0.0299
OR = eβ
--2.301
1.430
0.971