No Slide Title

Download Report

Transcript No Slide Title

Logistic Regression using SAS
prepared by
Voytek Grus
for
SAS user group, Halifax
February 24, 2006
What is Logistic Regression?
• Regression Analysis where the response variable Y is discrete and represents
either categories or counts. There are no restrictions on predictors.
Y
X1
green
yellow
red
46.8
15.9
51.8
X2
X3
1
0
0
No
No
Yes
• Linear regression equation of the type yi=α+βxi+ εi is not appropriate …
• … but like in linear regression analysis logistic regression is used to
– test statistical significance of relationship between response and predictor
variables
– predict the category of outcomes given its predictors
• Falls into the category of generalized linear models and either complements
or offers flexible alternative to
–
–
–
–
Multiple linear regression – similarity in equations, statistical diagnostics
Contingency tables (cross tabulation)
Loglinear models
Discriminant analysis – answers similar questions but is less restrictive
• Relatively New statistical tool for the analysis of categorical data
–
–
–
–
Contingency tables – 1900’s
Regression Analysis – 1970’s
Loglinear modes – 1975
Logistic Regression – late 70’s early 80’s but became more popular in the 90’s
Fields of application.
•
•
Health sciences - questions about disease: yes or no?
Social Sciences: deals with great deal of dichotomous variables:
employed vs unemployed, married vs unmarried,etc
–
–
•
Political science:
•
•
•
•
Which party voters will vote for and why?
Which voters will vote for a particular party?
Public Opinion Polls
Used in economics and marketing to study consumer choice.
–
–
–
•
Attitude to work as based on demographic or behavioral predictors
Racial bias in judicial decisions, etc
Banks use it to assess credit rating of customers
Some regulators require that utilities submit customer choice studies on
energy conservation options.
Choice of mode of transportation
Used in demand forecasting
PART I
Conceptual Framework of
Logistic Regression
Why not to use OLS for the estimation of the categorical
response equation?
•
Multiple Linear Regression of categorical response variables
does not satisfy two assumptions of a Linear Model necessary
to produce unbiased and efficient coefficients.
1.
2.
3.
4.
5.
Linearity of coefficients: yi=α+βxi+ εi
E(εi)=0
Heteroscedasticity: var(εi)≠σ2
• E(yi)=1*P(yi=1)+0*P(yi=0)=pi= α+βxi
• var(εi)= var(yi)=pi*(1- pi)=(α+βxi)*(1-α-βxi)
Errors are uncorrelated: cov(εi, εj)=0
Errors are not normally distributed: εi ~ Binomial
– Errors take on only two values: εi=1-α-βxi or εi=0- α-βxi and are bounded by 0
and 1.
6. As a result
–
–
–
coefficient estimates are no longer efficient
Standard error estimates are no longer consistent
Estimated values of the response variable Y may be implausible because
•
Linear function is unbounded (estimates will be outside of the (0, 1) interval
but the Binary regression is a linear probability model: E(yi)=pi= α+βxi
Logit Transformation a remedy to violation of OLS
assumptions
•
•
Instead of estimating this linear equation: yi=α+ βxi1+βxi1 + …+ βxk1 + εi
can apply logit transformation: log[pi/(1- pi)] =α+β1xi1+β2xi2 +. + βkxk1
where pi/(1- pi) is an odds ratio that an event of y=1 will occur.
Consequences:
–
–
–
–
pi=exp(α+β1xi1+β2xi2 +. + βkxk1 )/(1+exp(α+β1xi1+β2xi2 +. + βkxk1)) happens
to be a cumulative logistic distribution function.
No matter what the coefficients are pi is always between 0 and 1
Absence of εi complicates stats analysis: standardized coefficients?
Derivative of x is a function of p: Dpi/dxi= βpi(1-pi) and reflects changing
slope of the S curve making interpreation of coefficients difficult.
Need to be
cautious
when
interpreting
coefficients
from
the prob.
perspective
Alternatives to logit transformation in the context of
latent variables: probit and complementary log log
• In a perfect world there is a model for a continuous response variable zi. The
dichotomous logit model is only its simplification. There is a true equation
zi=α0+ α1xi1+ α2xi1 + …+ α3xk1 + σεi but it can not be observed. It is latent.
Instead we observe dichotomous y whose values of 1 and 0 depend on
probability z. Y’s relationship with predictors X’s depends on the probability
distribution of ε.
• Assumption of distribution of ε help determine standardized coefficients.
Link
Distribution
of ε
Standard
deviation of ε
Link function
= (inverse of
CDF of ε ~
fcn)
CDF of ε
Logit
ε ~ Logistic
Distribution:
f(ε) = eε/(1+
eε)2
σ =π/3=
1.8138
ƒ(p)=log(p/(1p))
F(x)=ex/(1+ ex)
Probit
ε ~ Normal Std
Distribution
σ =1
ƒ(p)=Φ-1(p)
Φ(x)=(2π)-1/2 ∫- ∞x
exp(-z2/2) dz
Complemen
tary Log Log
ε ~ double
exponential
Distribution
σ = π/√6=
1.28
ƒ(p)=log(log(1-p))
F(x)=1-exp(exp(x))
Logistic Regression in the context of the generalized
linear models.
Type of
regression
Link
Link Function
Distribution
of the
response
variable Y
Regression
Model
Error
Distribution
Estimation
Procedure
Linear
Regression
Indentity
E(Y)=XTβ
Normal
E(Y)=XTβ
Normal
OLS
Logistic
Regression
Logit
ƒ(p)=log(p/(1p))
Binomial or
Multinomial
E(Y)=exp(XTβ)
/(1+ exp(XTβ))
Binomial
ML
sometimes
WLS
Logistic
Regression
Probit
ƒ(p)=Φ-1(p)
Binomial or
Multinomial
Φ(y)=(2π)-1/2 ∫y
2
∞ exp(-z /2)
dz
Binomial
ML
sometimes
WLS
Logistic
Regression
Complemen
tary Log
Log
ƒ(p)=log(-log(1p))
Gompertz
(extreme
value)
E(Y)=1-exp(exp(x))
Distr=Poiss
on,
Binomial,
etc
ML
Poisson
Regression
Log-linear
ƒ(p)=log(y)
Poisson
E(y)=exp(XTβ)
Poisson
ML
?
inverse
E(y)=1/(y)
Gamma
E(Y)=1/( XTβ)
Gamma
ML
Log linear
Regression
Cumulative
Log Log
ƒ(p)=log(-log(1p))
Gompertz
(extreme
value)
E(Y)=1-exp(exp(x))
Distr=Poiss
on,
Binomial,
etc
ML
I Logistic Regression compared to ordinary linear
regression
Analytical
tools
Ordinary Linear
Regression
Logistic Regression
Coefficient Interpretation
In general, have meaning.
Have no intuitive meaning except for the sign. Use
the adjusted ODDs ratio instead.(ecoefficient)
Coefficient Confidence
Intervals and hypothesis
testing
t test, partial and sequential F tests
Wald Confidence Interv./ Profile Likelihood
confidence interv./ Max. likelihood interval
Global Hypotesis Testing
Ho vs H1
F test= SSreg/1/SSres/(n-2).
Likelihood ratio test:
Λ = max[lik(θ)] θ to ωo/ max[lik(θ)] θ to Ω
Under Null H0: -2log Λ ~ χ21
Wald Chi-sqr statistic
Score
Goodness of fit
R2=SSreg/SStotal or =1-SSres/SStotal
AIC, Rsqradj, Press
Deviance=Is there a better model than this one?
-2log(max[lik(θ)] fitted/ max[lik(θ)] saturated
Global Chi-sqr=Is this model better than nothing?
Σcells(Oi-Ei)2/Ei ~ χ21,
Hosmer-Lemeshow test
ROC curve
Model (Variable)
Selection Method
Direct, Forward, Backward, Stepwise,
Maxr, Minr, Rsqr,Rsqradf, Mallows Cp
Direct, Forward, Backward, Stepwise, Score
Multicollinearity
Detect collinear variables or group of
variables using PROC REG: TOL, VIF,
COLLNOINT. And/or PROC CORR.
The same.
II Logistic Regression compared to ordinary linear
regression
Analytical
tools
Ordinary Linear
Regression
Logistic Regression
Influence diagnostics,
Residuals, predictive
powers etc
DFBETAS, DFFITS, Cook distance,
studentized residuals, partial residual
plots, Predicted values,
DFBETAS, DIFCHISQ, DIFDEV, residuals:
deviance rs, pearson rs, raw rs. Predicted values,
non-constant error
variance
May transform response Y to stabilize
variance( log(y), 1/Y, sqr(Y)) or run
WLS.
Autocorrelation
Dependence of
observations
Cause: correlation of ε’s in time
series regression.
Durbin Watson to diagnose, use
autoregression to combat.
Over-dispersion/underdispersion
Estimation
Unobserved
Heterogeneity
(heterogeneity
shrinkage)
Cause: Clustered or Longitudinal data
Use of GEE Estimation or conditional logit
analysis.
Lack of fit: due to underspecified model or
dependence of observations.
OLS
ML, WLS, OLS
coefficients are related to the underlying
continuous model βj = αj/σ. The random
disturbance may reflect omitted explanatory
variables. Include predictors known to be
important even in absence of stat. Significance.
PART II
SAS Application of Logistic
Regression
Summary of SAS procedures for logistic regression
analysis
• Binary Logit Analysis:
–
PROCS: LOGISTIC; GENMOD; CATMOD; PROBIT, MDC, NLMIXED.
• Multinomial Logit Analysis
– Predictors are characteristics of the individual
• Nominal (no ordering of Y’s): proc logistic; proc catmod
• Ordinal (inherent ordering of Y’s): proc logistic; proc catmod; proc
genmod.
• Conditional Logit Analysis
– Predictors are the characteristics of the response variable
• Can use mdc proc & phreg proc.
– Logit Analysis of Clustered data:
• Proc Logistic or (Proc Phreg)
• Proc Genmod (gee)
Binary Logit Models
•
•
PROC LOGISTIC at its simplest: Main effect Model
1.
Individual-level data:
PROC LOGISTIC DATA=input;
FREQ frequency; /* optional */
MODEL y=X1 X2;RUN; or
2.
Grouped data:
PROC LOGISTIC DATA=input;
MODEL events/trials=X1 X2;RUN;
PROC LOGISTIC with more features
–
PROC LOGISTIC DATA=lrdata.penalty DESCENDING;
•
CLASS culp;
–
MODEL death=blackd|whitvic|culp / STB LACKFIT
AGGREGATE RSQ link=logit technique=newton
CLODDS=PL CLODDS=WALD SELECTION=stepwise
SCALE=WILLIAMS CORRB influence iplots;
•
UNITS culp=2 / DEFAULT=1;Output out=results
pred=phat lower=lb upper=up reschi=stres
dfbetas=dfs;RUN;
•
PROC GENMOD at its simplest
–
–
PROC GENMOD DATA=lrdata.penalty;
MODEL y=X1 X2 /Dist=Binomial;RUN;
Multinomial Logit Models
•
Multinomial logit for nominal response (Generalized Logit)
–
–
•
The logit transformation of the type log (pi/(1-pi)) for more than 2 categories does
not work because Σi=1kpi ≠1
K-1 equations are estimated: log (pij/(pik)= +βjxi where j=1,2, … k-1.
Multinomial logit for ordinal response (Cumulative, adjacent
categories, continuation ratio)
–
–
–
Inherent ordering of Y responses allows to relax the assumption of
multiple odds equations.
Estimate k-1 equations of odds of Cum. Probabilities Fij
• Log (Fij/(1-Fij)= αj+βxi - all coefficients except for intercept stay the same
Because there is a hierarchy in the categories of response variable
•
•
•
The model is easier to estimate and interpret
Hypothesis test are more powerful
one coefficient of each predictor but k-1 intercepts.
Available tools in SAS:
1.
2.
PROC LOGISTIC DATA=lrdata.wallet;
MODEL wallet = male business punish explain / link=glogit; /* or link=clogit */ RUN;
PROC CATMOD DATA=lrdata.wallet;
DIRECT male business punish explain;
MODEL wallet = male business punish explain / NOITER PRED; RUN;
Conditional logit Models
• Consumer Choice Studies
– Consumer taste preferences, choice of mode of transportation, locational
characteristics for a retail store,
– Conditional Logit: proc mdc; model decision = x1 x2 / type=clogit
choice=(mode 1 2 3); id pid; run;
– Nested Logit: proc mdc data=newdata; model decision = ttime / type=nlogit
choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1
2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run;
• Analysis of clustered data
– Observations within clusters can often be dependent: longitudinal data,
students clustered in classrooms or schools, husbands & wives clustered in
families, etc
– Dependent observations produce underestimated errors and overestimated
test statistics and coefficient estimates which are inefficient.
– Remedies: Can use GEE (PROC GENMOD) or Conditional Logit (PROC
LOGISTIC or PROC PHREG) and other methods such as Mixed Models or
hybrids of the above.
Consumer choice Modeling: Nested Logit Example
Decision Tree
Top
1 (Public)
1 (plane)
2 (train)
3 (bus)
2 (private)
Level 2
4 (car)
Level 1
• Example
• proc mdc data=travel2 maxit=200 outest=a;
•
model choice = ttime time cost / type=nlogit choice=(mode);
id id;
•
utility u(1,1 2 3 @ 1) = ttime time cost,
•
u(1,4 @ 2) = time cost;
•
nest
level(1) = (1 2 3 @ 1, 4 @ 2),
•
level(2) = (1 2 @ 1); run;
Literature
• Logistic Regression Using The SAS system by
Paul D. Allison (4th edition August, 2003)
• Categorical Data Analysis Using The SAS
System by Maura E. Stokes, Charles S. Davis,
Gary G. Koch. (4th edition January, 2005)
• Multivariate Statistical Methods by B. Tabachnik
(1996)
• SAS Help Examples
Questions?