Transcript LRDV Models

Regression Models
for Predicting Limited
Range Dependent Variables
David A. Harrison
Smeal College of Business
[email protected]
CARMA Consortium Webcast, 2006-2007 Program
Limited Range Dependent Variables
• Organizational research questions often involve
criteria (DVs) that don’t form smooth, tidy,
unimodal piles, especially in short time periods
– Firm level
• mergers, acquisitions, spin-offs, bankruptcies, other strategic
and structural changes
– Team level
• member exits and entries, dissolution, task completion,
settlement or impasse
– Individual level
• grievance-filing, accidents or injuries, promotion,
commendation, dismissal or turnover
David A. Harrison ©
7/18/2015
2
Problematic Data
• Such LRDVs create estimation, prediction, and
interpretation problems when used as criteria in
typical, multiple linear regression (MLR) models
– nonsensical point predictions (negative counts,
probabilities > 1.0)
– incorrect standard errors, which create ...
– (mostly) overly optimistic statistical tests
• LRDV regression models fit the bill
– built for these troublesome situations
– now in all major stat packages, but still hard to use
correctly, because of quirky distinctions from MLR
David A. Harrison ©
7/18/2015
3
What are LRDV Model Situations?
• Criterion is discrete or bounded in severe way;
small number of values; truncated or “censored”
• Dichotomous (binary: 0, 1)
– predicting “if”; variance reflects happens (1) vs. not (0)
across entities (ind’s, grp’s, org’s) being studied
• Count (0, 1, 2, 3, ...)
– predicting “how often” up to finite point
• As dichotomous combined w/“time until” event
occurs, becomes an event history LRDV model
– predicting “when”
David A. Harrison ©
7/18/2015
4
LRDV Models Are Related
• All (can) deal with low base rate processes in any
single time period
– only small proportion of entities have event occur
• few 1’s, lotsa 0’s; “if”
– sometimes events can repeat (absenteeism)
• form counts, but many low numbers: “how often”
– observation window covers event for most or all,
interested in how long ti takes
• lotsa 1’s; if and time until: “when”
• All are part of generalized linear model, most are
from family of log-linear models
David A. Harrison ©
7/18/2015
5
Why Not Just Transform?
• LRDV models not merely techniques for “uglylooking,” lopsided, or skewed DVs
– discreteness, choppiness of DV is cardinal feature;
truncation (data looked “sliced” at floor or ceiling)
possible
– if DV only takes 0, 1 then no reasonable transformation
creates more than two values
– if DV only takes 0, 1, 2, 3, 4, ... k, still can’t create more
than k transformed values; regression will predict
impossible values below 0, above k
– usually, whopper number of 0’s or “ski-slope,”
scree-pile look of data, not just asymmetric
David A. Harrison ©
7/18/2015
6
Why Not Just Use MLR?
• Use of standard MLR model can and does lead to
incorrect conclusions
– MLR model with j = 1, 2 ... p predictors is familiar:
Y-hat = b0 + b1X1 + b2X2 + ... bpXp
or more typically:
Yi = b0 + b1X1i + b2X2i + ... bpXpi + ei
• For LRDVs,
• link function not linear
• errors don’t follow proper shape
David A. Harrison ©
7/18/2015
7
What the #%!@ is a Link Function?
•
All regression models, including MLR, are really
two sets of assumptions, about a ....
1) “Link Function”, g, how expected value of DV (Y-hat)
depends on combo of predictors; what has to be done so
Y-hat is connected to X's in a linear, additive way

link
function

David A. Harrison ©
7/18/2015
8
for MLR, the link function is so simple it's usually not
even mentioned; g equals the "identity function", which
means nothing gets done to transform Y-hat
g(Y-hat) = Y-hat = b0 + b1X1 + b2X2 + ... bpXp
Not so for LRDV data and models. Their link functions
are non-linear and more complex.
Dude. More assumptions?
•
... and an ...
2) “Error Structure”, ~ei, what stack of residuals looks like
Yi = b0 + b1X1i + b2X2i + . . . bpXpi + ei
error structure


David A. Harrison ©
7/18/2015
9
for MLR, assumed error structure for ei‘s is normal,
bell-shaped, with mean of zero and constant variance;
necessary for constructing proper statistical tests
Again, not so for LRDV data and models. Their error
structure is not constant over X, double-lumpy, even a
“martingale” distribution (think wings)
Basic Models for LRDVs
• Dichotomous DVs
– Logistic Regression: “Logit” is Y-hat
– Normal Probability Regression: “Probit” is Y-hat
• Count DVs
– Poisson Regression
– Negative Binomial Regression
• Event History DVs
– non-parametric: survival curves
– parametric: Weibull, Gompertz, other
– *semi-parametric: Cox regression, most flexible
David A. Harrison ©
7/18/2015
10
Dichotomous DVs
• Criterion takes only two values, no ordering (Y is
nominal: yes/no; happened/didn’t; 0/1)
• Applying MLR to dichotomous DVs is “linear
probability model,” assumes change in X has same
slope of effect on Y, regardless of where on X the
change occurs; not possible when Y approaches 0 or
1 (no longer linear)
• Logistic or probit regression broadly applied
• Logistic regression more popular in applied
psychology, management literatures
David A. Harrison ©
7/18/2015
11
Logistic Regression Model
• Predicting “if”; Y can be either 1 or 0
• Probability (Y=1|X) divided by Probability (Y=0|X)
is “odds ratio”
• Log of odds ratios or log-odds is “logit”:
logit[Y] = log[Prob(Y=1)/Prob(Y=0)]
• Logit is “link function” for DV regression
logit[Y] = b0 + b1X1 + b2X2 + ... bpXp + e
• Another way to write it:
odds for Y = exp[b0 + b1X1 + b2X2 + ... bpXp + e]
David A. Harrison ©
7/18/2015
12
Probit Regression Model
• Predicting “if”; Y can be either 1 or 0
• Probability (Y=1|X) divided by Probability (Y=0|X)
is still “odds ratio”
• Cumulative normal distribution, Ф, is “probit”:
probit[Y] = Ф[Prob(Y=1)/Prob(Y=0)]
• Probit is “link function” for DV regression
Ф[Y] = b0 + b1X1 + b2X2 + ... bpXp + e
• Slopes and shapes very similar to (re-scaled by 1.7)
logit; differences in extreme tails
David A. Harrison ©
7/18/2015
13
Comparing Slopes and Shapes
1.2
1.1
1
Y = (Predicted) Probability
0.9
0.8
0.7
0.6
0.5
Linear
Standard Logistic
Re-scaled Logistic
Probit
0.4
0.3
0.2
0.1
0
-0.1
-0.2
David A. Harrison ©
7/18/2015
14
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5
0.0
0.5
1.0
1.5
X = Predictor (Centered on 0)
2.0
2.5
3.0
LRDV Assumptions
• Same as MLR:
– random sampling from defined population
– Yi are independent observations
– no complete dependencies / perfect correlations among
linear combinations of X’s
– decent Yi= 1 base rate (.5 is optimal) and no L.O.V.E.
(unobserved heterogeneity is a booger)
• Different from MLR
– link function non-linear& correct (Box-Tidwell check)
– errors are heteroskedastic (largest variance when
Prob[Y=1] approaches .5, smaller in tails)
David A. Harrison ©
7/18/2015
15
Residuals and Diagnostics
• Most programs create “Pearson” or “standardized”
(SPSS) or “chi-square” (SAS) residuals; squared and
summed to make badness-of-fit index
• Badness-of-fit is distributed as χ2 with n-p-1 degrees
of freedom; big values are ... well ... bad
• Hosmer & Lemeshow (1989) χ2 test “discretizes”
predictors; but, non-significant test is not a sign of
nirvana, esp. in small samples with continuous X’s
• Diagnostics such as Cook’s distance, leverage,
influence similar to MLR
David A. Harrison ©
7/18/2015
16
Goodness of Fit
• “Deviance” statistic is like sum of squared errors in
MLR; distributed as a χ2
– deviance for target model: Dp; deviance for null model
(no predictors): D0
– because null model is “nested” within target model, D0 –
Dp is test of overall fit for including p predictors
– D0 – Dp = G2, want large; likelihood ratio statistic
• “model chi-square” in SPSS
• “chi-square for covariates” in SAS
• No R2; best substitute: pseudo R2 =1-[exp-(G2/N]
David A. Harrison ©
7/18/2015
17
Statistical Tests
• For each b-hat in model, three types of tests:
– Wald, Score, and G2 (likelihood ratio) difference
– Wald and Score are χ2 (1 df) for that predictor, Xp, but can
have overly large standard errors when the metric of X
large; hence, make more Type II errors
– G2 difference means running model with, without Xp and
subtracting χ2s
• z-test often seen in stat package output is just square
root of chi-square
• G2 difference can be used to tests sets of predictors,
such as in mediation tests
David A. Harrison ©
7/18/2015
18
Interpreting Coefficients
• Signs of coefficient (b-hat) means same as MLR
– if Xp reduces chance of Y=1, negative sign
– watch out for switch coefficients in SAS!
• Magnitude of coefficient is not very meaningful;
must exponentiate to convert back to odds ratio:
– expected change in odds ratio for unit increase in Xp is
exp[b-hat]; always positive, compare to 1.0
– if exp[b-hat] = 1.2, a unit change in Xp yields a 1.2/1.0
odds ratio, or 20% increase in odds of Y=1
– if exp[b-hat] = .8, a unit change in Xp yields a .8/1.0 odds
ratio, or 20% decrease in odds of Y=1
David A. Harrison ©
7/18/2015
19
First Empirical Example
• Turnover of 298 employees working in bakery-delis
in region of national grocery chain
• Fairly high turnover rate, even for industry
• Collected demographic (from archives) and
questionnaire (linked by ID code) data
• Followed turnover for 18 mos. afterward
• Special interest in unit cohesiveness (“Grp
cohesion”) as turnover driver in snowball effect:
when one plans to quit (“TO Cognitions”), others
also go in close work teams
David A. Harrison ©
7/18/2015
20
Employee Turnover Main Effects
LINEAR
Coeff Effect
Predictor
Tenure (mos.)
b
-.01
LOGISTIC
Test
Coeff Effect
ry.x
t
b
-.05
.90
-.03**
Odds
.98
PROBIT
Test
χ2
Coeff Effect
b
6.01**
-.01**
Odds
.98
Test
χ2
5.93**
TO Cognitions
.04**
.38
7.17**
.44**
1.55 43.68**
.24**
1.46 43.36**
Grp Cohesion
.29**
.15
2.95**
.09**
1.10
8.28**
.05**
1.09
D0 (null)
223.58
219.03
Dp (target)
144.82
146.23
F or G2
(pseudo) R2
David A. Harrison ©
7/18/2015
21
16.14**
78.76**
72.80**
.25
.23
.22
8.32**
Tricky Interactions
• One exception to rule of equivalent signs for MLR
and LRDVs is higher-order terms (see Huselid &
Day, 1991; Ganzach, 2000, examples)
– in LRDV, Xp already interacts with itself in predicting Y,
because slope isn’t constant
– for MLR, an interaction term shows change in absolute
risk for Xp over levels of moderator Z
– for LRDVs, an interaction term shows change in relative
risk for Xp over one unit less on Xp, across levels of
moderator Z
• Bottom line is signs can differ; stick with LRDV
David A. Harrison ©
7/18/2015
22
Higher-Order Effects
LINEAR
Predictor
LOGISTIC
PROBIT
Coeff
Effect
Test
Coeff
Effect
Test
Coeff
Effect
Test
b
ry.x
t
b
Odds
χ2
b
Odds
χ2
Logit TO Cogs
-.14**
-.20
4.04**
-.51
.60
.98
-.25
.68
.94
Box TO Cogs
.09**
.23
4.60**
.38
1.47
2.08
.19
1.36
2.01
TO Cogs x GC
.02**
.60
2.77**
-.00
1.00
.01
.00
1.00
.01
(snowball effect?)
D0 (null)
223.58
219.03
Dp (target)
144.82
146.23
F or G2
(pseudo) R2
David A. Harrison ©
7/18/2015
23
16.14**
78.76**
72.80**
.25
.23
.22
Basic Models for Count DV’s
• What if modeling repeated event? (e.g., multiple
turnover spells across individual history)
• Poisson Regression
– highest count (mode) is lowest number; “ski slope” of
monotone decrease
– mean of count is same as variance
• Overdispersed Poisson Regression
– variance larger than mean (frequently the case)
• Negative Binomial Regression
– mode can be higher than minimum; early bump
David A. Harrison ©
7/18/2015
24
Second Empirical (Count) Example
120
99
Frequency of Stores
100
Exponential curve fitted to national, storelevel turnover distribution for bakery-delis
80
64
59
60
40
23
17
20
8
7
2
1
1
2
7
8
9
10
0
0
1
2
3
4
5
6
Number of Employees Quitting (within classroom)
David A. Harrison ©
7/18/2015
25
What if Timing of Event Matters?
• Realm of event history models
– “failure time” models (engineering)
– “survival” models (medicine)
– “hazard” or “risk” models (from many disciplines)
• Not same as event studies
– latter models use dichotomous event as independent
variable, explain variance in continuous DV
– former models use continuous and other independent
variables to explain time-dependent rate of a
dichotomous DV
David A. Harrison ©
7/18/2015
26
Models for Event History DV’s
• Event can be any change in the observed entity
• Predicting “when” (the if takes place); Yi can be
either 0 or 1, also have Ti: time until Y changes from
0 to 1 for entity i
• Non-parametric
– compare “survival curves” across levels of Xp; essentially
comparing pile of Ti’s for each Xp value
• Parametric
– fully specified model that has link function and mandates
particular pattern over time; Weibull, Gompertz, others
David A. Harrison ©
7/18/2015
27
Major Choices about Time
• Is time measured continuously or discretely?
• What are finest increments of time (ties between
entities in times creates estimation problems)?
• How is T defined? Clearly state starting point, ts,
and ending point, te, of one’s design: ts and te are
edges of “observation window”
• Best if ts could be defined at “origin point” of time
(e.g., first day of work; founding of firm); reduced
left censoring
• Best if te could be defined to reduce right censoring
as much as possible
David A. Harrison ©
7/18/2015
28
Time-Based Censoring
2003
Start
Date
2004
2005
Quit
Date
Observation
Type
Person 1
Aug 2003 -- Mar 2004
Uncensored
Person 2
Jan 2004 -- Nov 2004
Uncensored
Person 3
Mar 2003 -- Jul 2004
LeftCensored
Person 4
Mar 2004 -- May 2005
RightCensored
Person 5
Apr 2003 -- Jan 2005
DoublyCensored
t
t
S
David A. Harrison ©
7/18/2015
29
E
Observation Window
Hazard Rates
• LRDV in these time-dependent models is really a
“hazard rate”: hi(t)
• In continuous time, it is the instantaneous rate of
change from Yi = 0 to 1 in any succeeding (but
really, really small) interval of time
• In discrete time, it is the probability or risk that the
timing of Yi = 1 will occur in period between t and t
+ 1, given it hasn’t already happened yet
• Discrete-time hazard rate models can be arranged as
logistic or probit regressions (see Singer & Willett,
1991; Harrison et al., 1996)
David A. Harrison ©
7/18/2015
30
(Baseline) Hazard Rate Example
Empirical Hazard Rate for Turnover
(with 95% Confidence Intervals)
0.10
0.08
0.06
0.04
0.02
0
-0.02
1
2
3
4
5
6
7
8
9
10
11
12
13
14 15
Months from Start of Study (or Post-Study Employment)
David A. Harrison ©
7/18/2015
31
Cox Regression
• Most flexible event history model
• One of 100 most cited papers in science history
• Assumes two individuals have proportional baseline
hazard rates; allows clever estimation
log[ hi(t)] = log[h0(t)] + b1X1i + b2X2i + ... bpXpi + ei
baseline hazard rate
• Baseline hazard can be wild!; is not estimated, but is
a kind of Y – intercept changing over time; every
entity’s hazard rate is a shift up or down from it,
shift reflects linear combo of b’s and X’s
David A. Harrison ©
7/18/2015
32
Proportional Hazards
Empirical Hazard Rate for Turnover
(with 95% Confidence Intervals)
0.10
person i
0.08
0.06
0.04
0.02
0
-0.02
person j
1
2
3
4
5
6
7
8
9
10
11
12
13
14 15
Months from Start of Study (or Post-Study Employment)
David A. Harrison ©
7/18/2015
33
Assumptions and Estimation
• Proportionality of hazards is a biggie; can be tested
with residual plots
• No L.O.V.E. also a biggie, can create mistaken
appearance of changing hazard rate
• Cox regression uses partial likelihood, but proven to
have nearly as optimal properties
• Treatment of ties (with same Ti value) can be
important, esp. in small sample; use Efron option
• Left censoring is trouble; minimize with ts; right
censoring not so bad, but need small proportion of
censored Ti’s or large samples
David A. Harrison ©
7/18/2015
34
Tests and Interpretations
• Tests, caveats same as logistic regression (phew!)
• Interpretation also same, but effect is on hazard ratio
instead of odds ratio
– negative coefficient: Xp decreases hazard rate (shifts
entire curve downward across time)
– positive coefficient: Xp increases hazard rate
(shifts entire curve downward across time)
– hazard rate shifts up or down for Xp in the amount of
exp[b-hat]
• What if Xp changes over time? Need time-varying
covariate model; easy to build, harder to interpret
David A. Harrison ©
7/18/2015
35
Third Empirical Example
• Turnover of 2194 children attending day-care
centers in a regional chain
• Again fairly high turnover rate, even for industry
• Observation window is ts = customer start date; all
other data from organizational archives
• Followed turnover for 12 mos. afterward
• Special interest in firm-customer (parent)
relationship: younger children get broader care;
multiple, prior attending children from family means
stronger tie; teacher turnover weakens tie
David A. Harrison ©
7/18/2015
36
Customer Turnover Results
LINEAR
Coeff Effect
Predictor
b
ry.x
LOGISTIC
Test
Coeff Effect
t
b
Odds
COX
Test
χ2
Coeff Effect
b
Test
χ2
Hazard
Infant Class
-9.58
-.01
-1.68
-2.34**
.10
12.42** -2.04**
.13
13.32**
Sib Enrolled
-.62
-.00
-.26
-.29**
.75
10.48**
.54
6.65**
Tchr Turnover
1.52*
.02
2.33*
.22** 1.24
8.37**
.21** 1.24
15.35**
-.61**
D0 (null)
326.08
933.06
Dp (target)
275.85
863.25
F or G2
(pseudo) R2
David A. Harrison ©
7/18/2015
37
10.80**
50.23**
69.71**
.35
.19
.26
Some Overall Limitations
• Single, observable criterion
– there are some multivariate models, but not as well
developed and not as likely in major packages
– STATA and LIMDEP have advanced techniques
– hierarchical LRDV models now being worked out
• No L.O.V.E. is huge, consequential assumption
• Effects of measurement error largely unexplored or
pinned down
• Takes bit more explaining for most org audiences
David A. Harrison ©
7/18/2015
38
A Few General References
• Aldrich, J.H., & Nelson, F.D. 1984. Linear probability, logit, and
probit models. Newbury Park, CA: Sage.
• Allison, P.D. 1984. Event history analysis: Regression for
longitudinal event data. Newbury Park, CA: Sage.
• Harrison, D. A. 2001. Structure and timing in limited range
dependent variables: Regression models for predicting if and when.
In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing
behavior in organizations, 531-568. San Francisco: Jossey- Bass.
• Hosmer, D.W., & Lemeshow, S. 1989. Applied logistic regression.
New York: Wiley.
• Long, J. S. 1997. Regression models for categorical and limited
dependent variables. Sage: Newbury Park, CA.
• Menard, S. 1995. Applied logistic regression. Thousand Oaks,
CA: Sage.
David A. Harrison ©
7/18/2015
39