Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.

Download Report

Transcript Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.

Introduction to Logistic
Regression Analysis
Dr Tuan V. Nguyen
Garvan Institute of Medical Research
Sydney, Australia
Introductory example 1

Gender difference in preference for white wine. A
group of 57 men and 167 women were asked to make
preference for a new white wine. The results are as
follows:
Gender
Like
Dislike
ALL
Men
23
34
57
Women
35
132
167
ALL
58
166
224
Question: Is there a gender effect on the preference ?
Introductory example 2
Fat concentration and preference. 435 samples of a sauce of
various fat concentration were tasted by consumers. There were
two outcome: like or dislike. The results are as follows:
Concentration
Like
Dislike
ALL
1.35
13
0
13
1.60
19
0
19
1.75
67
2
69
1.85
45
5
50
1.95
71
8
79
2.05
50
20
70
2.15
35
31
66
2.25
7
49
56
2.35
1
12
13
Question: Is there an effect of fat concentration on the preference ?
Consideration …

The question in example 1 can be addressed by “traditional”
analysis such as z-statistic or Chi-square test.

The question in example 2 is a bit difficult to handle as the
factor (fat concentration ) was a continuous variable and the
outcome was a categorical variable (like or dislike)

However, there is a much better and more systematic method
to analysis these data: Logistic regression
Odds and odds ratio

Let P be the probability of preference, then the odds of
preference is: O = P / (1-P)
Gender
Like
Dislike
ALL
P(like)
Men
23
34
57
0.403
Women
35
132
167
0.209
ALL
58
166
224
0.259


Omen = 0.403 / 0.597 = 0.676
Owomen = 0.209 / 0.791 = 0.265
Odds ratio: OR = Omen / Owomen = 0.676 / 0.265 = 2.55
(Meaning: the odds of preference is 2.55 times higher in men than in women)
Meanings of odds ratio

OR > 1: the odds of preference is higher in men
than in women

OR < 1: the odds of preference is lower in men than
in women

OR = 1: the odds of preference in men is the same
as in women

How to assess the “significance” of OR ?
Computing variance of odds ratio

The significance of OR can be tested by calculating
its variance.

The variance of OR can be indirectly calculated by
working with logarithmic scale:
Convert OR to log(OR)
 Calculate variance of log(OR)
 Calculate 95% confidence interval of log(OR)
 Convert back to 95% confidence interval of OR

Computing variance of odds ratio
Gender
Like Dislike
Men
23
34
Women
35
132
ALL
58
166


OR = (23/34)/ (35/132) = 2.55
Log(OR) = log(2.55) = 0.937

Variance of log(OR):
V = 1/23 + 1/34 + 1/35 + 1/132 = 0.109

Standard error of log(OR)
SE = sqrt(0.109) = 0.330

95% confidence interval of log(OR)
0.937 + 0.330(1.96) = 0.289 to 1.584

Convert back to 95% confidence interval of OR
Exp(0.289) = 1.33 to Exp(1.584) = 4.87
Logistic analysis by R
sex <- c(1, 2)
like <- c(23, 35)
dislike <- c(34, 132)
total <- like + dislike
prob <- like/total
logistic <- glm(prob ~ sex,
family=”binomial”, weight=total)
Gender
Like Dislike
Men
23
34
Women
35
132
ALL
58
166
> summary(logistic)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
0.5457
0.5725
0.953 0.34044
sex
-0.9366
0.3302 -2.836 0.00456 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7.8676e+00
Residual deviance: 2.2204e-15
AIC: 13.629
on 1
on 0
degrees of freedom
degrees of freedom
Logistic regression model for continuous factor
Concentr
ation
Like
Dislike
% like
1
0.9
13
0
1.00
0.8
1.60
19
0
1.00
0.7
1.75
67
2
0.971
1.85
45
5
0.900
1.95
71
8
0.899
2.05
50
20
0.714
0.3
2.15
35
31
0.530
0.2
2.25
7
49
0.125
0.1
2.35
1
12
0.077
Probability of liking
1.35
0.6
0.5
0.4
0
1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05 2.15 2.25 2.35 2.45
Fat concentration
Analysis by using R
1.0
conc <- c(1.35, 1.60, 1.75, 1.85,
1.95, 2.05, 2.15, 2.25,
2.35)
like <- c(13, 19, 67, 45, 71, 50,
35, 7, 1)
dislike <- c(0, 0, 2, 5, 8, 20,
31, 49, 12)
0.8
total <- like+dislike
0.2
0.4
prob
plot(prob ~ conc, pch=16,
xlab="Concentration")
0.6
prob <- like/total
1.4
1.6
1.8
Concentration
2.0
2.2
Logistic regression model for continuous
factor - model


Let p = probability of preference
Logit of p is:
 p 

logit  p   log
 1 p 
Model: Logit(p) = a + b(FAT)
where a is the intercept, and b is the slope that have to be
estimated from the data
Analysis by using R
logistic <- glm(prob ~ conc, family="binomial",
weight=total)
summary(logistic)
Deviance Residuals:
Min
1Q
-1.78226 -0.69052
Median
0.07981
3Q
0.36556
Max
1.36871
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
22.708
2.266 10.021
<2e-16 ***
conc
-10.662
1.083 -9.849
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 198.7115
Residual deviance:
8.5568
AIC: 37.096
on 8
on 7
degrees of freedom
degrees of freedom
Logistic regression model for continuous
factor – Interpretation

The odds ratio associated with each 0.1 increase in
fat concentration was 2.90 (95% CI: 2.34, 3.59)

Interpretation: Each 0.1 increase in fat concentration
was associated with a 2.9 odds of disliking the
product. Since the 95% confidence interval exclude
1, this association was statistically significant at the
p<0.05 level.
Multiple logistic regression
Fracture (0=no, 1=yes)
Dependent variables: age, bmi, bmd, ictp, pinp
Question: Which variables are important for fracture?
id
1
2
3
4
5
6
7
8
9
10
...
137
138
139
fx
1
1
1
1
1
0
0
0
0
0
0
1
0
age
79
89
70
88
85
68
70
69
74
79
bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594
bmd
0.818
0.871
1.358
0.714
0.748
0.935
1.040
1.002
0.987
0.863
ictp
9.170
7.561
5.347
7.354
6.760
4.939
4.321
4.212
5.605
5.204
pinp
37.383
24.685
40.620
56.782
58.358
67.123
26.399
47.515
26.132
60.267
64
80
67
38.0762
23.3887
25.9455
1.086
0.875
0.983
5.043
4.086
4.328
32.835
23.837
71.334
Multiple logistic regression: R analysis
setwd(“c:/works/stats”)
fracture <- read.table(“fracture.txt”,
header=TRUE, na.string=”.”)
names(fracture)
fulldata <- na.omit(fracture)
attach(fulldata)
temp <- glm(fx ~ ., family=”binomial”,
data=fulldata)
search <- step(temp)
summary(search)
Bayesian Model Average (BMA) analysis
Library(BMA)
xvars <- fulldata[, 3:7]
y <- fx
bma.search <- bic.glm(xvars, y, strict=F,
OR=20, glm.family="binomial")
summary(bma.search)
imageplot.bma(bma.search)
Bayesian Model Average (BMA) analysis
> summary(bma.search)
Call:
Best
5
models (cumulative posterior probability =
Intercept
age
bmi
bmd
ictp
pinp
nVar
BIC
post prob
p!=0
100
15.3
21.7
39.7
100.0
5.7
EV
-2.85012
0.00845
-0.02302
-1.34136
0.64575
-0.00037
SD
2.8651
0.0261
0.0541
1.9762
0.1699
0.0041
0.8836 ):
model 1
-3.920
.
.
.
0.606
.
model 2 model 3
-1.065
-1.201
.
.
.
-0.116
-3.499
.
0.687
0.680
.
.
model 4
-8.257
0.063
.
.
0.554
.
model 5
-0.072
.
-0.070
-2.696
0.714
.
1
-525.044
0.307
2
2
-524.939 -523.625
0.291
0.151
2
-522.672
0.094
3
-521.032
0.041
Bayesian Model Average (BMA) analysis
> imageplot.bma(bma.search)
Models selected by BMA
age
bmi
bmd
ictp
pinp
1
2
Model #
3
4
5
7
9
Summary of main points

Logistic regression model is used to analyze the
association between a binary outcome and one or
many determinants.

The determinants can be binary, categorical or
continuous measurements

The model is logit(p) = log[p / (1-p)] = a + bX,
where X is a factor, and a and b must be estimated
from observed data.
Summary of main points

Exp(b) is the odds ratio associated with an increment
in the determinant X.

The logistic regression model can be extended to
include many determinants:
logit(p) = log[p / (1-p)] = a + bX1 + gX2 + dX3 +
…