Transcript LRDV Models
Regression Models for Predicting Limited Range Dependent Variables David A. Harrison Smeal College of Business [email protected] CARMA Consortium Webcast, 2006-2007 Program Limited Range Dependent Variables • Organizational research questions often involve criteria (DVs) that don’t form smooth, tidy, unimodal piles, especially in short time periods – Firm level • mergers, acquisitions, spin-offs, bankruptcies, other strategic and structural changes – Team level • member exits and entries, dissolution, task completion, settlement or impasse – Individual level • grievance-filing, accidents or injuries, promotion, commendation, dismissal or turnover David A. Harrison © 7/18/2015 2 Problematic Data • Such LRDVs create estimation, prediction, and interpretation problems when used as criteria in typical, multiple linear regression (MLR) models – nonsensical point predictions (negative counts, probabilities > 1.0) – incorrect standard errors, which create ... – (mostly) overly optimistic statistical tests • LRDV regression models fit the bill – built for these troublesome situations – now in all major stat packages, but still hard to use correctly, because of quirky distinctions from MLR David A. Harrison © 7/18/2015 3 What are LRDV Model Situations? • Criterion is discrete or bounded in severe way; small number of values; truncated or “censored” • Dichotomous (binary: 0, 1) – predicting “if”; variance reflects happens (1) vs. not (0) across entities (ind’s, grp’s, org’s) being studied • Count (0, 1, 2, 3, ...) – predicting “how often” up to finite point • As dichotomous combined w/“time until” event occurs, becomes an event history LRDV model – predicting “when” David A. Harrison © 7/18/2015 4 LRDV Models Are Related • All (can) deal with low base rate processes in any single time period – only small proportion of entities have event occur • few 1’s, lotsa 0’s; “if” – sometimes events can repeat (absenteeism) • form counts, but many low numbers: “how often” – observation window covers event for most or all, interested in how long ti takes • lotsa 1’s; if and time until: “when” • All are part of generalized linear model, most are from family of log-linear models David A. Harrison © 7/18/2015 5 Why Not Just Transform? • LRDV models not merely techniques for “uglylooking,” lopsided, or skewed DVs – discreteness, choppiness of DV is cardinal feature; truncation (data looked “sliced” at floor or ceiling) possible – if DV only takes 0, 1 then no reasonable transformation creates more than two values – if DV only takes 0, 1, 2, 3, 4, ... k, still can’t create more than k transformed values; regression will predict impossible values below 0, above k – usually, whopper number of 0’s or “ski-slope,” scree-pile look of data, not just asymmetric David A. Harrison © 7/18/2015 6 Why Not Just Use MLR? • Use of standard MLR model can and does lead to incorrect conclusions – MLR model with j = 1, 2 ... p predictors is familiar: Y-hat = b0 + b1X1 + b2X2 + ... bpXp or more typically: Yi = b0 + b1X1i + b2X2i + ... bpXpi + ei • For LRDVs, • link function not linear • errors don’t follow proper shape David A. Harrison © 7/18/2015 7 What the #%!@ is a Link Function? • All regression models, including MLR, are really two sets of assumptions, about a .... 1) “Link Function”, g, how expected value of DV (Y-hat) depends on combo of predictors; what has to be done so Y-hat is connected to X's in a linear, additive way link function David A. Harrison © 7/18/2015 8 for MLR, the link function is so simple it's usually not even mentioned; g equals the "identity function", which means nothing gets done to transform Y-hat g(Y-hat) = Y-hat = b0 + b1X1 + b2X2 + ... bpXp Not so for LRDV data and models. Their link functions are non-linear and more complex. Dude. More assumptions? • ... and an ... 2) “Error Structure”, ~ei, what stack of residuals looks like Yi = b0 + b1X1i + b2X2i + . . . bpXpi + ei error structure David A. Harrison © 7/18/2015 9 for MLR, assumed error structure for ei‘s is normal, bell-shaped, with mean of zero and constant variance; necessary for constructing proper statistical tests Again, not so for LRDV data and models. Their error structure is not constant over X, double-lumpy, even a “martingale” distribution (think wings) Basic Models for LRDVs • Dichotomous DVs – Logistic Regression: “Logit” is Y-hat – Normal Probability Regression: “Probit” is Y-hat • Count DVs – Poisson Regression – Negative Binomial Regression • Event History DVs – non-parametric: survival curves – parametric: Weibull, Gompertz, other – *semi-parametric: Cox regression, most flexible David A. Harrison © 7/18/2015 10 Dichotomous DVs • Criterion takes only two values, no ordering (Y is nominal: yes/no; happened/didn’t; 0/1) • Applying MLR to dichotomous DVs is “linear probability model,” assumes change in X has same slope of effect on Y, regardless of where on X the change occurs; not possible when Y approaches 0 or 1 (no longer linear) • Logistic or probit regression broadly applied • Logistic regression more popular in applied psychology, management literatures David A. Harrison © 7/18/2015 11 Logistic Regression Model • Predicting “if”; Y can be either 1 or 0 • Probability (Y=1|X) divided by Probability (Y=0|X) is “odds ratio” • Log of odds ratios or log-odds is “logit”: logit[Y] = log[Prob(Y=1)/Prob(Y=0)] • Logit is “link function” for DV regression logit[Y] = b0 + b1X1 + b2X2 + ... bpXp + e • Another way to write it: odds for Y = exp[b0 + b1X1 + b2X2 + ... bpXp + e] David A. Harrison © 7/18/2015 12 Probit Regression Model • Predicting “if”; Y can be either 1 or 0 • Probability (Y=1|X) divided by Probability (Y=0|X) is still “odds ratio” • Cumulative normal distribution, Ф, is “probit”: probit[Y] = Ф[Prob(Y=1)/Prob(Y=0)] • Probit is “link function” for DV regression Ф[Y] = b0 + b1X1 + b2X2 + ... bpXp + e • Slopes and shapes very similar to (re-scaled by 1.7) logit; differences in extreme tails David A. Harrison © 7/18/2015 13 Comparing Slopes and Shapes 1.2 1.1 1 Y = (Predicted) Probability 0.9 0.8 0.7 0.6 0.5 Linear Standard Logistic Re-scaled Logistic Probit 0.4 0.3 0.2 0.1 0 -0.1 -0.2 David A. Harrison © 7/18/2015 14 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 X = Predictor (Centered on 0) 2.0 2.5 3.0 LRDV Assumptions • Same as MLR: – random sampling from defined population – Yi are independent observations – no complete dependencies / perfect correlations among linear combinations of X’s – decent Yi= 1 base rate (.5 is optimal) and no L.O.V.E. (unobserved heterogeneity is a booger) • Different from MLR – link function non-linear& correct (Box-Tidwell check) – errors are heteroskedastic (largest variance when Prob[Y=1] approaches .5, smaller in tails) David A. Harrison © 7/18/2015 15 Residuals and Diagnostics • Most programs create “Pearson” or “standardized” (SPSS) or “chi-square” (SAS) residuals; squared and summed to make badness-of-fit index • Badness-of-fit is distributed as χ2 with n-p-1 degrees of freedom; big values are ... well ... bad • Hosmer & Lemeshow (1989) χ2 test “discretizes” predictors; but, non-significant test is not a sign of nirvana, esp. in small samples with continuous X’s • Diagnostics such as Cook’s distance, leverage, influence similar to MLR David A. Harrison © 7/18/2015 16 Goodness of Fit • “Deviance” statistic is like sum of squared errors in MLR; distributed as a χ2 – deviance for target model: Dp; deviance for null model (no predictors): D0 – because null model is “nested” within target model, D0 – Dp is test of overall fit for including p predictors – D0 – Dp = G2, want large; likelihood ratio statistic • “model chi-square” in SPSS • “chi-square for covariates” in SAS • No R2; best substitute: pseudo R2 =1-[exp-(G2/N] David A. Harrison © 7/18/2015 17 Statistical Tests • For each b-hat in model, three types of tests: – Wald, Score, and G2 (likelihood ratio) difference – Wald and Score are χ2 (1 df) for that predictor, Xp, but can have overly large standard errors when the metric of X large; hence, make more Type II errors – G2 difference means running model with, without Xp and subtracting χ2s • z-test often seen in stat package output is just square root of chi-square • G2 difference can be used to tests sets of predictors, such as in mediation tests David A. Harrison © 7/18/2015 18 Interpreting Coefficients • Signs of coefficient (b-hat) means same as MLR – if Xp reduces chance of Y=1, negative sign – watch out for switch coefficients in SAS! • Magnitude of coefficient is not very meaningful; must exponentiate to convert back to odds ratio: – expected change in odds ratio for unit increase in Xp is exp[b-hat]; always positive, compare to 1.0 – if exp[b-hat] = 1.2, a unit change in Xp yields a 1.2/1.0 odds ratio, or 20% increase in odds of Y=1 – if exp[b-hat] = .8, a unit change in Xp yields a .8/1.0 odds ratio, or 20% decrease in odds of Y=1 David A. Harrison © 7/18/2015 19 First Empirical Example • Turnover of 298 employees working in bakery-delis in region of national grocery chain • Fairly high turnover rate, even for industry • Collected demographic (from archives) and questionnaire (linked by ID code) data • Followed turnover for 18 mos. afterward • Special interest in unit cohesiveness (“Grp cohesion”) as turnover driver in snowball effect: when one plans to quit (“TO Cognitions”), others also go in close work teams David A. Harrison © 7/18/2015 20 Employee Turnover Main Effects LINEAR Coeff Effect Predictor Tenure (mos.) b -.01 LOGISTIC Test Coeff Effect ry.x t b -.05 .90 -.03** Odds .98 PROBIT Test χ2 Coeff Effect b 6.01** -.01** Odds .98 Test χ2 5.93** TO Cognitions .04** .38 7.17** .44** 1.55 43.68** .24** 1.46 43.36** Grp Cohesion .29** .15 2.95** .09** 1.10 8.28** .05** 1.09 D0 (null) 223.58 219.03 Dp (target) 144.82 146.23 F or G2 (pseudo) R2 David A. Harrison © 7/18/2015 21 16.14** 78.76** 72.80** .25 .23 .22 8.32** Tricky Interactions • One exception to rule of equivalent signs for MLR and LRDVs is higher-order terms (see Huselid & Day, 1991; Ganzach, 2000, examples) – in LRDV, Xp already interacts with itself in predicting Y, because slope isn’t constant – for MLR, an interaction term shows change in absolute risk for Xp over levels of moderator Z – for LRDVs, an interaction term shows change in relative risk for Xp over one unit less on Xp, across levels of moderator Z • Bottom line is signs can differ; stick with LRDV David A. Harrison © 7/18/2015 22 Higher-Order Effects LINEAR Predictor LOGISTIC PROBIT Coeff Effect Test Coeff Effect Test Coeff Effect Test b ry.x t b Odds χ2 b Odds χ2 Logit TO Cogs -.14** -.20 4.04** -.51 .60 .98 -.25 .68 .94 Box TO Cogs .09** .23 4.60** .38 1.47 2.08 .19 1.36 2.01 TO Cogs x GC .02** .60 2.77** -.00 1.00 .01 .00 1.00 .01 (snowball effect?) D0 (null) 223.58 219.03 Dp (target) 144.82 146.23 F or G2 (pseudo) R2 David A. Harrison © 7/18/2015 23 16.14** 78.76** 72.80** .25 .23 .22 Basic Models for Count DV’s • What if modeling repeated event? (e.g., multiple turnover spells across individual history) • Poisson Regression – highest count (mode) is lowest number; “ski slope” of monotone decrease – mean of count is same as variance • Overdispersed Poisson Regression – variance larger than mean (frequently the case) • Negative Binomial Regression – mode can be higher than minimum; early bump David A. Harrison © 7/18/2015 24 Second Empirical (Count) Example 120 99 Frequency of Stores 100 Exponential curve fitted to national, storelevel turnover distribution for bakery-delis 80 64 59 60 40 23 17 20 8 7 2 1 1 2 7 8 9 10 0 0 1 2 3 4 5 6 Number of Employees Quitting (within classroom) David A. Harrison © 7/18/2015 25 What if Timing of Event Matters? • Realm of event history models – “failure time” models (engineering) – “survival” models (medicine) – “hazard” or “risk” models (from many disciplines) • Not same as event studies – latter models use dichotomous event as independent variable, explain variance in continuous DV – former models use continuous and other independent variables to explain time-dependent rate of a dichotomous DV David A. Harrison © 7/18/2015 26 Models for Event History DV’s • Event can be any change in the observed entity • Predicting “when” (the if takes place); Yi can be either 0 or 1, also have Ti: time until Y changes from 0 to 1 for entity i • Non-parametric – compare “survival curves” across levels of Xp; essentially comparing pile of Ti’s for each Xp value • Parametric – fully specified model that has link function and mandates particular pattern over time; Weibull, Gompertz, others David A. Harrison © 7/18/2015 27 Major Choices about Time • Is time measured continuously or discretely? • What are finest increments of time (ties between entities in times creates estimation problems)? • How is T defined? Clearly state starting point, ts, and ending point, te, of one’s design: ts and te are edges of “observation window” • Best if ts could be defined at “origin point” of time (e.g., first day of work; founding of firm); reduced left censoring • Best if te could be defined to reduce right censoring as much as possible David A. Harrison © 7/18/2015 28 Time-Based Censoring 2003 Start Date 2004 2005 Quit Date Observation Type Person 1 Aug 2003 -- Mar 2004 Uncensored Person 2 Jan 2004 -- Nov 2004 Uncensored Person 3 Mar 2003 -- Jul 2004 LeftCensored Person 4 Mar 2004 -- May 2005 RightCensored Person 5 Apr 2003 -- Jan 2005 DoublyCensored t t S David A. Harrison © 7/18/2015 29 E Observation Window Hazard Rates • LRDV in these time-dependent models is really a “hazard rate”: hi(t) • In continuous time, it is the instantaneous rate of change from Yi = 0 to 1 in any succeeding (but really, really small) interval of time • In discrete time, it is the probability or risk that the timing of Yi = 1 will occur in period between t and t + 1, given it hasn’t already happened yet • Discrete-time hazard rate models can be arranged as logistic or probit regressions (see Singer & Willett, 1991; Harrison et al., 1996) David A. Harrison © 7/18/2015 30 (Baseline) Hazard Rate Example Empirical Hazard Rate for Turnover (with 95% Confidence Intervals) 0.10 0.08 0.06 0.04 0.02 0 -0.02 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Months from Start of Study (or Post-Study Employment) David A. Harrison © 7/18/2015 31 Cox Regression • Most flexible event history model • One of 100 most cited papers in science history • Assumes two individuals have proportional baseline hazard rates; allows clever estimation log[ hi(t)] = log[h0(t)] + b1X1i + b2X2i + ... bpXpi + ei baseline hazard rate • Baseline hazard can be wild!; is not estimated, but is a kind of Y – intercept changing over time; every entity’s hazard rate is a shift up or down from it, shift reflects linear combo of b’s and X’s David A. Harrison © 7/18/2015 32 Proportional Hazards Empirical Hazard Rate for Turnover (with 95% Confidence Intervals) 0.10 person i 0.08 0.06 0.04 0.02 0 -0.02 person j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Months from Start of Study (or Post-Study Employment) David A. Harrison © 7/18/2015 33 Assumptions and Estimation • Proportionality of hazards is a biggie; can be tested with residual plots • No L.O.V.E. also a biggie, can create mistaken appearance of changing hazard rate • Cox regression uses partial likelihood, but proven to have nearly as optimal properties • Treatment of ties (with same Ti value) can be important, esp. in small sample; use Efron option • Left censoring is trouble; minimize with ts; right censoring not so bad, but need small proportion of censored Ti’s or large samples David A. Harrison © 7/18/2015 34 Tests and Interpretations • Tests, caveats same as logistic regression (phew!) • Interpretation also same, but effect is on hazard ratio instead of odds ratio – negative coefficient: Xp decreases hazard rate (shifts entire curve downward across time) – positive coefficient: Xp increases hazard rate (shifts entire curve downward across time) – hazard rate shifts up or down for Xp in the amount of exp[b-hat] • What if Xp changes over time? Need time-varying covariate model; easy to build, harder to interpret David A. Harrison © 7/18/2015 35 Third Empirical Example • Turnover of 2194 children attending day-care centers in a regional chain • Again fairly high turnover rate, even for industry • Observation window is ts = customer start date; all other data from organizational archives • Followed turnover for 12 mos. afterward • Special interest in firm-customer (parent) relationship: younger children get broader care; multiple, prior attending children from family means stronger tie; teacher turnover weakens tie David A. Harrison © 7/18/2015 36 Customer Turnover Results LINEAR Coeff Effect Predictor b ry.x LOGISTIC Test Coeff Effect t b Odds COX Test χ2 Coeff Effect b Test χ2 Hazard Infant Class -9.58 -.01 -1.68 -2.34** .10 12.42** -2.04** .13 13.32** Sib Enrolled -.62 -.00 -.26 -.29** .75 10.48** .54 6.65** Tchr Turnover 1.52* .02 2.33* .22** 1.24 8.37** .21** 1.24 15.35** -.61** D0 (null) 326.08 933.06 Dp (target) 275.85 863.25 F or G2 (pseudo) R2 David A. Harrison © 7/18/2015 37 10.80** 50.23** 69.71** .35 .19 .26 Some Overall Limitations • Single, observable criterion – there are some multivariate models, but not as well developed and not as likely in major packages – STATA and LIMDEP have advanced techniques – hierarchical LRDV models now being worked out • No L.O.V.E. is huge, consequential assumption • Effects of measurement error largely unexplored or pinned down • Takes bit more explaining for most org audiences David A. Harrison © 7/18/2015 38 A Few General References • Aldrich, J.H., & Nelson, F.D. 1984. Linear probability, logit, and probit models. Newbury Park, CA: Sage. • Allison, P.D. 1984. Event history analysis: Regression for longitudinal event data. Newbury Park, CA: Sage. • Harrison, D. A. 2001. Structure and timing in limited range dependent variables: Regression models for predicting if and when. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations, 531-568. San Francisco: Jossey- Bass. • Hosmer, D.W., & Lemeshow, S. 1989. Applied logistic regression. New York: Wiley. • Long, J. S. 1997. Regression models for categorical and limited dependent variables. Sage: Newbury Park, CA. • Menard, S. 1995. Applied logistic regression. Thousand Oaks, CA: Sage. David A. Harrison © 7/18/2015 39