Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Download ReportTranscript Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia Introductory example 1 Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Gender Like Dislike ALL Men 23 34 57 Women 35 132 167 ALL 58 166 224 Question: Is there a gender effect on the preference ? Introductory example 2 Fat concentration and preference. 435 samples of a sauce of various fat concentration were tasted by consumers. There were two outcome: like or dislike. The results are as follows: Concentration Like Dislike ALL 1.35 13 0 13 1.60 19 0 19 1.75 67 2 69 1.85 45 5 50 1.95 71 8 79 2.05 50 20 70 2.15 35 31 66 2.25 7 49 56 2.35 1 12 13 Question: Is there an effect of fat concentration on the preference ? Consideration … The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) However, there is a much better and more systematic method to analysis these data: Logistic regression Odds and odds ratio Let P be the probability of preference, then the odds of preference is: O = P / (1-P) Gender Like Dislike ALL P(like) Men 23 34 57 0.403 Women 35 132 167 0.209 ALL 58 166 224 0.259 Omen = 0.403 / 0.597 = 0.676 Owomen = 0.209 / 0.791 = 0.265 Odds ratio: OR = Omen / Owomen = 0.676 / 0.265 = 2.55 (Meaning: the odds of preference is 2.55 times higher in men than in women) Meanings of odds ratio OR > 1: the odds of preference is higher in men than in women OR < 1: the odds of preference is lower in men than in women OR = 1: the odds of preference in men is the same as in women How to assess the “significance” of OR ? Computing variance of odds ratio The significance of OR can be tested by calculating its variance. The variance of OR can be indirectly calculated by working with logarithmic scale: Convert OR to log(OR) Calculate variance of log(OR) Calculate 95% confidence interval of log(OR) Convert back to 95% confidence interval of OR Computing variance of odds ratio Gender Like Dislike Men 23 34 Women 35 132 ALL 58 166 OR = (23/34)/ (35/132) = 2.55 Log(OR) = log(2.55) = 0.937 Variance of log(OR): V = 1/23 + 1/34 + 1/35 + 1/132 = 0.109 Standard error of log(OR) SE = sqrt(0.109) = 0.330 95% confidence interval of log(OR) 0.937 + 0.330(1.96) = 0.289 to 1.584 Convert back to 95% confidence interval of OR Exp(0.289) = 1.33 to Exp(1.584) = 4.87 Logistic analysis by R sex <- c(1, 2) like <- c(23, 35) dislike <- c(34, 132) total <- like + dislike prob <- like/total logistic <- glm(prob ~ sex, family=”binomial”, weight=total) Gender Like Dislike Men 23 34 Women 35 132 ALL 58 166 > summary(logistic) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5457 0.5725 0.953 0.34044 sex -0.9366 0.3302 -2.836 0.00456 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 7.8676e+00 Residual deviance: 2.2204e-15 AIC: 13.629 on 1 on 0 degrees of freedom degrees of freedom Logistic regression model for continuous factor Concentr ation Like Dislike % like 1 0.9 13 0 1.00 0.8 1.60 19 0 1.00 0.7 1.75 67 2 0.971 1.85 45 5 0.900 1.95 71 8 0.899 2.05 50 20 0.714 0.3 2.15 35 31 0.530 0.2 2.25 7 49 0.125 0.1 2.35 1 12 0.077 Probability of liking 1.35 0.6 0.5 0.4 0 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05 2.15 2.25 2.35 2.45 Fat concentration Analysis by using R 1.0 conc <- c(1.35, 1.60, 1.75, 1.85, 1.95, 2.05, 2.15, 2.25, 2.35) like <- c(13, 19, 67, 45, 71, 50, 35, 7, 1) dislike <- c(0, 0, 2, 5, 8, 20, 31, 49, 12) 0.8 total <- like+dislike 0.2 0.4 prob plot(prob ~ conc, pch=16, xlab="Concentration") 0.6 prob <- like/total 1.4 1.6 1.8 Concentration 2.0 2.2 Logistic regression model for continuous factor - model Let p = probability of preference Logit of p is: p logit p log 1 p Model: Logit(p) = a + b(FAT) where a is the intercept, and b is the slope that have to be estimated from the data Analysis by using R logistic <- glm(prob ~ conc, family="binomial", weight=total) summary(logistic) Deviance Residuals: Min 1Q -1.78226 -0.69052 Median 0.07981 3Q 0.36556 Max 1.36871 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 22.708 2.266 10.021 <2e-16 *** conc -10.662 1.083 -9.849 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 198.7115 Residual deviance: 8.5568 AIC: 37.096 on 8 on 7 degrees of freedom degrees of freedom Logistic regression model for continuous factor – Interpretation The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level. Multiple logistic regression Fracture (0=no, 1=yes) Dependent variables: age, bmi, bmd, ictp, pinp Question: Which variables are important for fracture? id 1 2 3 4 5 6 7 8 9 10 ... 137 138 139 fx 1 1 1 1 1 0 0 0 0 0 0 1 0 age 79 89 70 88 85 68 70 69 74 79 bmi 24.7252 25.9909 25.3934 23.2254 24.6097 25.0762 19.8839 25.0593 25.6544 19.9594 bmd 0.818 0.871 1.358 0.714 0.748 0.935 1.040 1.002 0.987 0.863 ictp 9.170 7.561 5.347 7.354 6.760 4.939 4.321 4.212 5.605 5.204 pinp 37.383 24.685 40.620 56.782 58.358 67.123 26.399 47.515 26.132 60.267 64 80 67 38.0762 23.3887 25.9455 1.086 0.875 0.983 5.043 4.086 4.328 32.835 23.837 71.334 Multiple logistic regression: R analysis setwd(“c:/works/stats”) fracture <- read.table(“fracture.txt”, header=TRUE, na.string=”.”) names(fracture) fulldata <- na.omit(fracture) attach(fulldata) temp <- glm(fx ~ ., family=”binomial”, data=fulldata) search <- step(temp) summary(search) Bayesian Model Average (BMA) analysis Library(BMA) xvars <- fulldata[, 3:7] y <- fx bma.search <- bic.glm(xvars, y, strict=F, OR=20, glm.family="binomial") summary(bma.search) imageplot.bma(bma.search) Bayesian Model Average (BMA) analysis > summary(bma.search) Call: Best 5 models (cumulative posterior probability = Intercept age bmi bmd ictp pinp nVar BIC post prob p!=0 100 15.3 21.7 39.7 100.0 5.7 EV -2.85012 0.00845 -0.02302 -1.34136 0.64575 -0.00037 SD 2.8651 0.0261 0.0541 1.9762 0.1699 0.0041 0.8836 ): model 1 -3.920 . . . 0.606 . model 2 model 3 -1.065 -1.201 . . . -0.116 -3.499 . 0.687 0.680 . . model 4 -8.257 0.063 . . 0.554 . model 5 -0.072 . -0.070 -2.696 0.714 . 1 -525.044 0.307 2 2 -524.939 -523.625 0.291 0.151 2 -522.672 0.094 3 -521.032 0.041 Bayesian Model Average (BMA) analysis > imageplot.bma(bma.search) Models selected by BMA age bmi bmd ictp pinp 1 2 Model # 3 4 5 7 9 Summary of main points Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. The determinants can be binary, categorical or continuous measurements The model is logit(p) = log[p / (1-p)] = a + bX, where X is a factor, and a and b must be estimated from observed data. Summary of main points Exp(b) is the odds ratio associated with an increment in the determinant X. The logistic regression model can be extended to include many determinants: logit(p) = log[p / (1-p)] = a + bX1 + gX2 + dX3 + …