Logistic Regression and Generalized Linear Models:

Download Report

Transcript Logistic Regression and Generalized Linear Models:

Logistic Regression and
Generalized Linear Models:
Blood Screening,
Women’s Role in Society,
and Colonic Polyps
Blood Screening
• ESR measurements (erythrocyte
sedimentation rate)
– Study checks for the rate of increase of ESR
when (blood proteins) fibrinogen and globulin
increase
• Looking for an association between the probability
of an ESR reading > 20mm/hr and the levels of the
two plasma proteins.
• Less than 20mm/hr indicates a healthy individual
Multiple Regression Doesn’t Work
• Not Normally distributed
– The response variable is binary.
– In fact, the distribution is Binomial. The can be
seen by looking at how the error term relates
to the probability. If y = 1 then the error term is
1- P(y=1). We will assume for our Random
Variables Y that
Logistic Regression, a Generalized
Linear Model
• Modelling the expected value of the
response requires a transformation
– This chapter is about the logit transformation
of a model, which is of the form
• If the response variable is 1, the log odds of the
response is the logit function of a probability.
• Fixing all explanatory variables but xj, exp(Bj)
represents the odds that the response variable is 1
when xj increases by 1.
R commands
• Plasma_glm_1 <- glm(ESR ~
fibrinogen, data = plasma,
family = binomial())
–
•
Layout(matrix(1:2, ncol = 2))
–
•
Fits the model, in this example
specifying the logistic function is
implied because of the binomial()
parameter is already passed.
Lets you choose the layout of your
graph screen.
Cdplot(ESR ~ fibrinogen, data =
plasma)
• Cdplot(ESR ~ globulin, data =
plasma)
–
cdplot “plots conditional densities
describing how the conditional
distribution of a categorical variable y
changes over a numerical variable x”
(help(“cdplot”))
Interpretation
• The area of the dark region is the
probability that ESR < 20 mm/hr, and this
decreases as the protein levels increase
• There is not much shape to the globulin
density function.
R Commands
• Confint(plasma_glm_1, parm = “fibrinogen”)
– Gives a confidence interval
– Output:
• Waiting for profiling to be done...
•
2.5 % 97.5 %
• 0.3389465 3.9988602
• Exp(coef(plasma_glm_1)[“fibrinogen”])
– This is necessary to do the reverse
transformation to get the model coefficients
– Output:
• fibrinogen
• 6.215715
R Commands
• Summary(plasma_glm_1)
– Output:
• Deviance Residuals:
•
Min
1Q Median
3Q
Max
• -0.9298 -0.5399 -0.4382 -0.3356 2.4794
•
•
•
•
•
•
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8451 2.7703 -2.471 0.0135 *
fibrinogen 1.8271 0.9009 2.028 0.0425 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• (Dispersion parameter for binomial family taken to be 1)
•
Null deviance: 30.885 on 31 degrees of freedom
• Residual deviance: 24.840 on 30 degrees of freedom
R Commands
• Exp(confint(plasma_glm_1, parm =
“fibrinogen”))
– Output:
• 2.5 % 97.5 %
• 1.403468 54.535954
• Plasma_glm_2 <- glm(ESR ~ fibrinogen +
globulin, data = plasma, family = binomial())
– Use logistic regression for both variables
fibrinogen and globulin
• Summary(Plasma_glm_2)
R Commands
• Summary(Plasma_glm_2)
– Output
• Deviance Residuals:
•
Min
1Q Median
3Q
Max
• -0.9683 -0.6122 -0.3458 -0.2116 2.2636
•
•
•
•
•
•
•
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.7921 5.7963 -2.207 0.0273 *
fibrinogen 1.9104 0.9710 1.967 0.0491 *
globulin
0.1558 0.1195 1.303 0.1925
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• (Dispersion parameter for binomial family taken to be 1)
•
Null deviance: 30.885 on 31 degrees of freedom
• Residual deviance: 22.971 on 29 degrees of freedom
Interpretation
• large confidence interval is because there
are not many observations in all, and even
fewer ESR > 20mm/hr
• The globulin coefficient is basically zero
R Commands
•
Anova(plasma_glm_1,
plasma_glm_2, test = “Chisq”)
– Compares the two models one of just
fibrinogen, the other of both
fibrinogen and globulin
– Why Chisquared test?
• ANOVA assumes normal distribution
for each data set. Why is this OK?
– Output
• Analysis of Deviance Table
• Model 1: ESR ~ fibrinogen
• Model 2: ESR ~ fibrinogen +
globulin
• Resid. Df Resid. Dev Df Deviance
P(>|Chi|)
• 1
30 24.8404
• 2
29 22.9711 1 1.8692
0.1716
•
Comparing the residual deviance
between the two models, we see that
the difference is not significant. We
must surmise that there is no
association between globulin and
ESR level.
R Commands
• Prob <- predict(plasma_glm_1, type = “response”
– The model of just fibrinogen useful in creating a neat looking
bubble plot. First we must take the predicted values from the first
model and use them to determine the size of the bubbles
• Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6),
ylim = c(25, 50), pch = “.”)
– Plots the second model
• Symbols(plasma$fibrinogen, plasma$globulin, circles =
prob, add = TRUE)
– Uses the values of the first model to create different bubble
sizes.
Bubbles size is Probability
• The plot shows how
the probability of
having an ESR >
20mm/hr increases
as fibrinogen
increases
Generalized Linear Model
• Unifies the logistic regression, Analysis of
Variance and multiple regression
techniques.
• GLMs have three essential parts
– An error distribution- the distribution of the
response variable.
• Normal for Analysis of Variance and Multiple
Regression
• Binomial for Logistic Regression
Main parts of GLM (continued)
• A link function which links the explanatory
variables to the expected value of the
response.
– Logit function for logistic regression
– Identity function for ANOVA and multiple
regression
• A variance function which shows the
dependency of the response variable
variability on the mean
Measure of Fit
• The deviance shows how well the model
fits the data
• Comparing two model’s deviances
– Use a likelihood ratio test
– Compare using Chi-square distribution
Women’s Role in Society
• Response to “Women should take care of
running their homes and leave the running
the country up to men”.
• Factors: Education, Sex
• Response: Agree or Disagree
– data is presented as categories with counts
for each education, sex combination
R Commands
• Womensrole_glm_1 <- glm(cbind(agree,
disagree) ~ sex + education, data =
womensrole, family= binomial())
– This uses the cbind function to change data
from two responses to one response that is a
matrix of agree, disagree counts.
• Summary(womensrole_glm_1)
– We can see that education is a significant
factor to the response.
Summary Output
•
•
•
Deviance Residuals:
Min
1Q Median
3Q
Max
-2.72544 -0.86302 -0.06525 0.84340 3.13315
•
•
•
•
•
•
•
•
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.50937 0.18389 13.646 <2e-16 ***
Why is there a P-value for the intercept?
sexFemale -0.01145 0.08415 -0.136 0.892
education -0.27062 0.01541 -17.560 <2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
•
(Dispersion parameter for binomial family taken to be 1)
•
•
Null deviance: 451.722 on 40 degrees of freedom
Residual deviance: 64.007 on 38 degrees of freedom
Declaring a function in R
• Myplot <- function(role.fitted) {
– Declares the function myplot and passes it an object
“role.fitted” to be used in the function
• f <- womensrole$sex == “Female”
– Stores everything in the data with sex = femail in f
• plot(womensrole$education, role.fitted, type =
"n", ylab = "probability of agreeing", xlab =
"Education", ylim = c(0,1))
– Plots education against the role.fitted object which will
be some predicted values from our GLM defined as
• Role.fitted1 <- predict(womensrole_glm_1, type =
“response”)
Myplot function (cont)
• lines(womensrole$education[!f], role.fitted[!f], lty
= 1)
• lines(womensrole$education[f], role.fitted[f], lty =
2)
– These graph the lines, one for males and one for
femails. Lty indicates the kind of line (in this case solid
or dotted)
• lgtxt <- c("Fitted (Males)", "Fitted (Females)")
• legend("topright", lgtxt, lty = 1:2, bty = "n")
– These add a legend for each line.
Myplot Function (cont)
• y<-womensrole$agree/
(womensrole$agree +
womensrole$disagree)
– A basic calculation of the proportion of women
that agree
• text(womensrole$education, y, ifelse(f,
"\\VE", "\\MA"), family = "HersheySerif",
cex = 1.25)
– Plots y and not y using male/female symbols
Interpretation
• The two fitted lines
indicate that sex does
not change a
probability of agreeing
vs education
• The symbols of
unfitted observations
may indicate an
interaction between
sex and education
MyPlot Sex:Education
• By running the same analysis,
e.g.
– Womensrole_glm_2 <glm(cbind(agree, disagree) ~
sex*education, data =
womensrole, family = binomial())
– Summary(womensrole_glm_2)
• This summary shows a
significant p-value for the
sex:education interaction
– Role.fitted2 <predict(womensrole_glm_1, type
= “response”)
– Myplot(role.fitted2)
• The plot shows that less
education is associated with
agreement that women belong in
the home.
The Deviance Residual
•
One of the many methods for
checking the adequacy of the
model fit
– The deviance residual is the
square root of the part of each
observation that contributes to the
deviance
•
Res <- residuals(
womensrole_glm_2, type =
“deviance”)
– Pulls the residuals for many other
models as well
•
Plot(predict(womensrole_glm_2),
res, xlab = “fitted values”, ylab =
“Residuals”, lim =
max(abs(res))*c(-1,1))
– No visible pattern: fit appears ok
•
Abline(h = 0, lty = 2)
– Adds a dotted line at height 0
Drug Treatment Testing
Famlilial Andenomatous Polyposis (FAP)
• Counts of colonic polyps after 12 months
of treatment
– Don’t want to be the guy that did that
• Placebo controlled (Binary factor)
• Age
GLM Analysis
• Multiple Regression won’t work.
– Count data strictly positive
– Normality is not probable
• Poisson Regression- GLM with a log link
function
– Ensures a Poisson Distribution
– Ensures positive fitted amounts
• R command:
– Polyps_glm_1 <- glm(number ~ treat + age, data =
polyps, family = poisson())
Model Summary
•
•
Call:
glm(formula = number ~ treat + age, family = poisson(), data = polyps)
•
•
•
Deviance Residuals:
Min
1Q Median
3Q
Max
-4.2212 -3.0536 -0.1802 1.4459 5.8301
•
•
•
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.529024 0.146872 30.84 < 2e-16 ***
• treatdrug -1.359083 0.117643 -11.55 < 2e-16 ***
• age
-0.038830 0.005955 -6.52 7.02e-11 ***
•
•
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
•
(Dispersion parameter for poisson family taken to be 1)
•
•
•
Null deviance: 378.66 on 19 degrees of freedom
Residual deviance: 179.54 on 17 degrees of freedom
AIC: 273.88
•
Number of Fisher Scoring iterations: 5
Model Problem: Overdispersion
• In previous models the variance can be seen as
dependent solely on the mean
– E.G. Binomial, Poisson
• In practice, this doesn’t always work.
– Sometimes the raw data points are not independent,
there is some correlation, in this case, possible
“clustering”
• Compare residual deviance and degrees of
freedom to determine
– These should be basically equal
Model Solution: Quasi-Likelihood
• This procedure estimates the other factors
that might contribute to the variance
• R Command
– Polyps_glm_2 <- glm(number ~ treat + age,
data = polyps, family = quasipoisson())
– Summary(polyps_glm_2)
• The coefficients are still significant, but
less so
Homework
• Please run through the commands for the
myplot function (pg 100) and send me the
command script.
– Please send me what you thought of the
presentation, give me a grade and add any
constructive criticism.
• [email protected]