Qualitative and Limited Dependent Variable Analysis III

Download Report

Transcript Qualitative and Limited Dependent Variable Analysis III

ECON 6002 Econometrics Memorial University of Newfoundland

Qualitative and Limited Dependent Variable Models

Adapted from Vera Tabakova’s notes

16.1 Models with Binary Dependent Variables

16.2 The Logit Model for Binary Choice

16.3 Multinomial Logit

16.4 Conditional Logit

16.5 Ordered Choice Models

16.6 Models for Count Data

16.7 Limited Dependent Variables

Principles of Econometrics, 3rd Edition Slide 16-2

When the dependent variable in a regression model is a count of the number of occurrences of an event, the outcome variable is

y

= 0, 1, 2, 3, … These numbers are actual counts, and thus different from the ordinal numbers of the previous section. Examples include:  The number of trips to a physician a person makes during a year.

 The number of fishing trips taken by a person during the previous year.

 The number of children in a household.

 The number of automobile accidents at a particular intersection during a month.

 The number of televisions in a household.

 The number of alcoholic drinks a college student takes in a week.

Principles of Econometrics, 3rd Edition Slide16-3

If

Y

is a Poisson random variable, then its probability function is

y

!

y

 

y

y

e



y

y

!

,

1

y

 0,1, 2,

“rate” (16.27)

   exp

 

1 2

x

 Also equal To the variance (16.28) This choice defines the

Poisson regression model

for count data.

Principles of Econometrics, 3rd Edition Slide16-4

If we observe 3 individuals: one faces one event, the other two two events each:

L

  

1

,

2

ln

L

  

1

,

2

ln ln  0

 ln  

 ln    

e

exp



y

 !

y

       

1 2

x

y

ln

y

1

2

2

 ln ln  2

2

x

 ln

y

ln

L

  

1

,

2

i N

  1 

 exp

  

1 2

x

i

y

i

1 2

x

i

 ln

y

i

Principles of Econometrics, 3rd Edition Slide16-5

  0

  

0 exp 

 

1

x

2 0 

Pr

Y

y

 exp

y

!

y

0

,

y

 0,1, 2,

So now you can calculate the predicted probability of a certain number y of events

Principles of Econometrics, 3rd Edition Slide16-6

 

i

x

i

  

i

2 You may prefer to express this marginal effect as a %:

% 

 

x

i

 100 

   

i i

x

i

 100 

2

%

(16.29)

Principles of Econometrics, 3rd Edition Slide16-7

   

i

  

i

exp

  

1 2

x

i

 

D

i

i i

 0

 exp

  

1 2

x

i

i i

exp

  

1 2

x

i

 

100    exp

  

1 2

x

i

exp

  

1

exp

2

x

i

 

  

1 2

x

i

  

Principles of Econometrics, 3rd Edition

If there is a dummy Involved, be careful, remember

e

 

Which would be identical to the effect of a dummy In the log-linear model we saw under OLS

Slide16-8

Extensions: overdispersion Under a plain Poisson the mean of the count is assumed to be equal to the average (equidispersion) This will often not hold Real life data are often overdispersed For example: • a few women will have many affairs and many women will have few • a few travelers will make many trips to a park and many will make few • etc.

Principles of Econometrics, 3rd Edition Slide16-9

Extensions: overdispersion use "C:\bbbECONOMETRICS\Rober\GRAD\GROSMORNE.dta", clear . poisson visits Travelcost educat income Iteration 0: log likelihood = -1321.4696 . poisson persontrip Travelcost educat income, nolog Iteration 2: log likelihood = -1321.4665 LR chi2(3) = 671.71

Prob > chi2 = 0.0000

Log likelihood = -1321.4665 Pseudo R2 = 0.0210

Log likelihood = -2541.5165 Pseudo R2 = 0.1167

persontrip Coef. Std. Err. z P>|z| [95% Conf. Interval] income -.0019933 .0007191 -2.77 0.006 -.0034027 -.0005839

income -.0014578 .0004404 -3.31 0.001 -.002321 -.0005946

_cons 2.144476 .0688666 31.14 0.000 2.0095 2.279452

Principles of Econometrics, 3rd Edition Slide16-10

Extensions: negative binomial Under a plain Poisson the mean of the count is assumed to be equal to the average (equidispersion) The Poisson will inflate your t-ratios in this case, making you think that your model works better than it actually does  Or use a Negative Binomial model instead (

nbreg

) or even a Generalised Negative Binomial (

gnbreg

) , which will allow you to model the overdispersion parameter as a function of covariates of our choice You can also test for overdispersion, to test whether the problem is significant

Principles of Econometrics, 3rd Edition Slide16-11

Extensions: negative binomial sum visits Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------------- visits | 966 1.416149 1.718147 1 26

Principles of Econometrics, 3rd Edition Slide16-12

Extensions: negative binomial . nbreg persontrip Travelcost educat income, nolog Negative binomial regression Number of obs = 919 LR chi2(3) = 236.04

Dispersion = mean Prob > chi2 = 0.0000

Log likelihood = -2038.1155 Pseudo R2 = 0.0547

persontrip Coef. Std. Err. z P>|z| [95% Conf. Interval] Travelcost -.7135986 .0489137 -14.59 0.000 -.8094676 -.6177295

educat -.0218888 .0248201 -0.88 0.378 -.0705353 .0267578

income -.0014357 .0006578 -2.18 0.029 -.0027249 -.0001465

_cons 1.994577 .1037 19.23 0.000 1.791329 2.197826

/lnalpha -1.190022 .0724583 -1.332038 -1.048006

alpha .3042145 .0220429 .2639388 .3506361

Likelihood-ratio test of alpha=0: chibar2(01) = 1006.80 Prob>=chibar2 = 0.000

Principles of Econometrics, 3rd Edition Slide16-13

Extensions: excess zeros Often the numbers of zeros in the sample cannot be accommodated properly by a Poisson or Negative Binomial model They would underpredict them too There is said to be an “excess zeros” problem You can then use

hurdle models

or

zero inflated

models to accommodate the extra zeros or

zero augmented

Principles of Econometrics, 3rd Edition Slide16-14

Extensions: excess zeros Often the numbers of zeros in the sample cannot be accommodated properly by a Poisson or Negative Binomial model They would underpredict them too

nbvargr

Is a very useful command

Principles of Econometrics, 3rd Edition

0 2 4 6 k mean = 3.296; overdispersion = 5.439 8 observed proportion poisson prob neg binom prob 10

Slide16-15

Extensions: excess zeros You can then use

hurdle models

or

zero inflated

or

zero augmented

models to accommodate the extra zeros They will also allow you to have a different process driving the value of the strictly positive count and whether the value is zero or strictly positive EXAMPLES: •Number of extramarital affairs versus gender •Number of children before marriage versus religiosity In the continuous case, we have similar models (e.g. Cragg’s Model) and an example is that of size of Insurance Claims from fires versus the age of the building

Principles of Econometrics, 3rd Edition Slide16-16

Extensions: excess zeros You can then use

hurdle models

or

zero inflated

or

zero augmented

models to accommodate the extra zeros

Hurdle Models

A hurdle model is a modified count model in which there are two processes, one generating the zeros and one generating the positive values. The two models are not constrained to be the same. In the hurdle model a binomial probability model governs the binary outcome of whether a count variable has a zero or a positive value. If the value is positive, the "hurdle is crossed," and the conditional distribution of the positive values is governed by a zero-truncated count model.

Example: smokers versus non-smokers, if you are a smoker you will smoke!

Principles of Econometrics, 3rd Edition Slide16-17

Extensions: excess zeros

Hurdle Models

In Stata Joseph Hilbe’s downloadable ado HPLOGIT will work, although it does not allow for two different sets of variables, just two different sets of coefficients Example: smokers versus non-smokers, if you are a smoker you will smoke!

Principles of Econometrics, 3rd Edition Slide16-18

Extensions: excess zeros You can then use

hurdle models

or

zero inflated

or

zero augmented

models to accommodate the extra zeros Zero-inflated models (initially suggested by D. Lambert) attempt to account for excess zeros in a subtly different way.

In this model there are two kinds of zeros, "true zeros" and excess zeros. Zero-inflated models estimate also two equations, one for the count model and one for the excess zero's. The key difference is that the count model allows zeros now. It is not a truncated count model, but allows for “corner solutions” Example: meat eaters (who sometime just did not eat meat that week) versus vegetarians who never ever do

Principles of Econometrics, 3rd Edition Slide16-19

Extensions: excess zeros webuse fish We want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish at all, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish (and admitted it  ) so there are excess zeros in the data because of the people that did not fish.

Principles of Econometrics, 3rd Edition Slide16-20

Extensions: excess zeros . histogram count, discrete freq Lots of zeros!

0

Principles of Econometrics, 3rd Edition

50 count 100 150

Slide16-21

Extensions: excess zeros . zip naffairs age male relig , inflate( age male relig ) vuong nolog Zero-inflated Poisson regression Number of obs = 601 Nonzero obs = 150 Zero obs = 451 Inflation model = logit LR chi2(3) = 29.67

Log likelihood = -810.055 Prob > chi2 = 0.0000

naffairs Coef. Std. Err. z P>|z| [95% Conf. Interval] naffairs age .015609 .0038029 4.10 0.000 .0081555 .0230625

male -.1598035 .0686006 -2.33 0.020 -.2942583 -.0253487

relig -.0971114 .0292688 -3.32 0.001 -.1544772 -.0397456

_cons 1.581638 .1577305 10.03 0.000 1.272492 1.890784

inflate age -.019041 .0104841 -1.82 0.069 -.0395895 .0015075

male -.1791471 .1948003 -0.92 0.358 -.5609488 .2026546

relig .2884574 .0841492 3.43 0.001 .1235281 .4533867

_cons .9322364 .3901503 2.39 0.017 .1675558 1.696917

Vuong test of zip vs. standard Poisson: z = 11.66 Pr>z = 0.0000

Vuong test

Principles of Econometrics, 3rd Edition Slide16-22

Extensions: excess zeros . zinb naffairs age male relig , inflate( age male relig ) vuong nolog Zero-inflated negative binomial regression Number of obs = 601 Nonzero obs = 150 Zero obs = 451 Inflation model = logit LR chi2(3) = 8.92

Log likelihood = -726.405 Prob > chi2 = 0.0304

naffairs Coef. Std. Err. z P>|z| [95% Conf. Interval] naffairs age .0258188 .0107692 2.40 0.017 .0047115 .046926

male -.2214886 .1660362 -1.33 0.182 -.5469135 .1039364

relig -.1472717 .0749567 -1.96 0.049 -.2941842 -.0003593

_cons 1.273196 .3874106 3.29 0.001 .5138849 2.032506

inflate age -.014892 .0113465 -1.31 0.189 -.0371308 .0073468

male -.2309299 .2091759 -1.10 0.270 -.6409071 .1790474

relig .274744 .0904315 3.04 0.002 .0975014 .4519865

_cons .6673066 .433002 1.54 0.123 -.1813618 1.515975

/lnalpha -.2743069 .2532933 -1.08 0.279 -.7707527 .2221388

alpha .7600988 .1925279 .4626647 1.248745

Vuong test of zinb vs. standard negative binomial: z = 2.82 Pr>z = 0.0024

Vuong test

Principles of Econometrics, 3rd Edition Slide16-23

Extensions: truncation • Count data can be truncated too (usually at zero) • So

ztp

and

ztnb

can accommodate that • Example: you interview visitors at the recreational site, so they all made at least that one trip •In the continuous case we would have to use the truncreg command

Principles of Econometrics, 3rd Edition Slide16-24

Extensions: truncation • This model works much better and showcases the bias in the previous estimates: . ztp persontrip Travelcost educat income, nolog Zero-truncated Poisson regression Number of obs = 919 LR chi2(3) = 885.68

Prob > chi2 = 0.0000

Log likelihood = -2412.6552 Pseudo R2 = 0.1551

persontrip Coef. Std. Err. z P>|z| [95% Conf. Interval] Travelcost -1.380461 .0571736 -24.15 0.000 -1.492519 -1.268403

educat -.0170332 .0175026 -0.97 0.330 -.0513376 .0172712

income -.0013521 .000473 -2.86 0.004 -.0022791 -.0004251

_cons 2.278878 .0728394 31.29 0.000 2.136116 2.421641

Principles of Econometrics, 3rd Edition

Smaller now estimated Consumer Surplus

Slide16-25

Extensions: truncation This model works much better and showcases the bias in the previous estimates: • Now accounting for overdispersion . ztnb persontrip Travelcost educat income, nolog Zero-truncated negative binomial regression Number of obs = 919 LR chi2(3) = 263.89

Dispersion = mean Prob > chi2 = 0.0000

Log likelihood = -1866.326 Pseudo R2 = 0.0660

persontrip Coef. Std. Err. z P>|z| [95% Conf. Interval] Travelcost -1.079011 .068793 -15.68 0.000 -1.213843 -.9441795

educat -.0216377 .0322941 -0.67 0.503 -.084933 .0416576

income -.0016369 .0008563 -1.91 0.056 -.0033152 .0000413

_cons 2.015503 .1344308 14.99 0.000 1.752024 2.278983

/lnalpha -.6368613 .101849 -.8364818 -.4372409

alpha .52895 .053873 .433232 .6458158

Likelihood-ratio test of alpha=0: chibar2(01) = 1092.66 Prob>=chibar2 = 0.000

Principles of Econometrics, 3rd Edition Slide16-26

Extensions: truncation and endogenous stratification • Example: you interview visitors at the recreational site, so they all made at least that one trip • You interview patients at the doctors’ office about how often they visit the doctor • You ask people in George St. how often the go to George St… •Then you are oversampling “frequent visitors” and biasing your estimates, perhaps substantially

Principles of Econometrics, 3rd Edition Slide16-27

Extensions: truncation and endogenous stratification •Then you are oversampling “frequent visitors” and biasing your estimates, perhaps substantially •It turns out to be supereasy to deal with a Truncated and Endogenously Stratified Poisson Model (as shown by Shaw, 1988): Simply run a plain Poisson on “Count-1” and that will work (In STATA:

poisson

on the corrected count) It is more complex if there is overdispersion though 

Principles of Econometrics, 3rd Edition Slide16-28

Extensions: truncation and endogenous stratification •Supereasy to deal with a Truncated and Endogenously Stratified Poisson Model . poisson persontripminusone Travelcost educat income, nolog Poisson regression Number of obs = 919 LR chi2(3) = 1071.95

Prob > chi2 = 0.0000

Log likelihood = -2474.3262 Pseudo R2 = 0.1780

persontrip~e Coef. Std. Err. z P>|z| [95% Conf. Interval] Travelcost -1.657986 .0620722 -26.71 0.000 -1.779646 -1.536327

educat -.0202144 .0191574 -1.06 0.291 -.0577622 .0173333

income -.0016285 .0005184 -3.14 0.002 -.0026446 -.0006124

_cons 2.191885 .0792934 27.64 0.000 2.036473 2.347298

Principles of Econometrics, 3rd Edition

Much smaller now estimated Consumer Surplus

Slide16-29

Extensions: truncation and endogenous stratification •Endogenously Stratified Negative Binomial Model (as shown by Shaw, 1988; Englin and Shonkwiler, 1995): . nbstrat persontrip Travelcost educat income, nolog Negative Binomial with Endogenous Stratification Number of obs = 919 Wald chi2(3) = 283.49

Log likelihood = -1837.3183 Prob > chi2 = 0.0000

persontrip Coef. Std. Err. z P>|z| [95% Conf. Interval] Travelcost -1.152915 .0695958 -16.57 0.000 -1.289321 -1.01651

educat -.0229483 .0318753 -0.72 0.472 -.0854228 .0395261

income -.0017368 .0008447 -2.06 0.040 -.0033923 -.0000813

_cons 1.189429 .1561017 7.62 0.000 .8834757 1.495383

/lnalpha .092944 .1482435 0.63 0.531 -.197608 .3834959

alpha 1.0974 .1626825 .8206915 1.467406

AIC Statistic = 4.007 BIC Statistic = -6243.307

Deviance = 0.000 Dispersion = 0.000

Even after accounting for overdispersion, CS estimate is relatively low

Principles of Econometrics, 3rd Edition Slide16-30

Extensions: truncation and endogenous stratification •How do we calculate the pseudo-R2 for this model???

. nbstrat persontrip Travelcost educat income, nolog Negative Binomial with Endogenous Stratification Number of obs = 919 Wald chi2(3) = 283.49

Log likelihood = -1837.3183 Prob > chi2 = 0.0000

persontrip Coef. Std. Err. z P>|z| [95% Conf. Interval] Travelcost -1.152915 .0695958 -16.57 0.000 -1.289321 -1.01651

educat -.0229483 .0318753 -0.72 0.472 -.0854228 .0395261

income -.0017368 .0008447 -2.06 0.040 -.0033923 -.0000813

_cons 1.189429 .1561017 7.62 0.000 .8834757 1.495383

/lnalpha .092944 .1482435 0.63 0.531 -.197608 .3834959

alpha 1.0974 .1626825 .8206915 1.467406

AIC Statistic = 4.007 BIC Statistic = -6243.307

Deviance = 0.000 Dispersion = 0.000

Principles of Econometrics, 3rd Edition Slide16-31

Extensions: truncation and endogenous stratification •GNBSTRAT will also allow you to model the overdispersion parameter in this case, just as

gnbreg

did for the plain case

Principles of Econometrics, 3rd Edition Slide16-32

NOTE: what is the

exposure

• Count models often need to deal with the fact that the counts may be measured over different observation periods, which might be of different length (in terms of time or some other relevant dimension) For example, the number of accidents are recorded for 50 different intersections. However, the number of vehicles that pass through the intersections can vary greatly. Five accidents for 30,000 vehicles is very different from five accidents for 1,500 vehicles. Count models account for these differences by including the log of the exposure variable in model with coefficient constrained to be one.

The use of exposure is often superior to analyzing rates as response variables as such, because it makes use of the correct probability distributions

Principles of Econometrics, 3rd Edition Slide16-33

16.7.1 Censored Data

Figure 16.3 Histogram of Wife’s Hours of Work in 1975

Principles of Econometrics, 3rd Edition Slide16-34

Having

censored data

means that a substantial fraction of the observations on the dependent variable take a limit value. The regression function is no longer given by (16.30).  

   

1 2

x

(16.30) The least squares estimators of the regression parameters obtained by running a regression of

y

on

x

are biased and inconsistent—least squares estimation fails.

Principles of Econometrics, 3rd Edition Slide16-35

Having

censored data

means that a substantial fraction of the observations on the dependent variable take a limit value. The regression function is no longer given by (16.30).  

   

1 2

x

(16.30) The least squares estimators of the regression parameters obtained by running a regression of

y

on

x

are biased and inconsistent—least squares estimation fails.

Principles of Econometrics, 3rd Edition Slide16-36

 

With truncation, we only observe the value of the regressors when the dependent variable takes a certain value (usually a positive one instead of zero) With censoring we observe in principle the value of the regressors for everyone, but not the value of the dependent variable for those whose dependent variable takes a value beyond the limit

y

i

* Assume

e i

~

N

 0,

y i y i

 0 if

y i

* 

y i

* if

y i

*  0;  0.

1 2

x

i

e

i

9

x

i

e

i Principles of Econometrics, 3rd Edition

(16.31)

Slide16-38

 Create

N

= 200 random values of

x i

that are spread evenly (or uniformly) over the interval [0, 20]. These we will keep fixed in further simulations.

 Obtain

N

= 200 random values

e i

from a normal distribution with mean 0 and variance 16.

  Create

N

= 200 values of the latent variable. Obtain

N

= 200 values of the observed

y i

using

y i

 0   

y i

* if

y i

*  0 if

y i

*  0

Principles of Econometrics, 3rd Edition Slide16-39

Figure 16.4 Uncensored Sample Data and Regression Function

Principles of Econometrics, 3rd Edition Slide16-40

Figure 16.5 Censored Sample Data, and Latent Regression Function and Least Squares Fitted Line

Principles of Econometrics, 3rd Edition Slide16-41

i

  

x

i

(se) (.3706) (.0326)

i

  

x

i

(se) (1.2055) (.0827)

E

MC Principles of Econometrics, 3rd Edition

 1

NSAM

NSAM

m

 1

b

(16.32a) (16.32b) (16.33)

Slide16-42

The maximum likelihood procedure is called

Tobit

in honor of James Tobin, winner of the 1981 Nobel Prize in Economics, who first studied this model. The probit probability that

y

i = 0 is: 

i

i

 0] 1   1 2

x i

L

     1 , 2 , 

y i

  0  1      1 2 

x i

  

y i

  0    2  2   1 2 exp   1 2  2 

y i

  1 2

x i

 2   

Principles of Econometrics, 3rd Edition Slide16-43

The maximum likelihood estimator is consistent and asymptotically normal, with a known covariance matrix.

Using the artificial data the fitted values are:

y

i

   (se) (1.0970) (.0790)

x

i

(16.34)

Principles of Econometrics, 3rd Edition Slide16-44

Principles of Econometrics, 3rd Edition Slide16-45

x

   

1 2

x

(16.35) Because the cdf values are positive, the sign of the coefficient does tell the direction of the marginal effect, just not its magnitude. If β 2 > 0, as

x

increases the cdf function approaches 1, and the slope of the regression function approaches that of the latent variable model.

Principles of Econometrics, 3rd Edition Slide16-46

Figure 16.6 Censored Sample Data, and Regression Functions for Observed and Positive

y

values

Principles of Econometrics, 3rd Edition Slide16-47

HOURS

    1 2

EDUC

  3

EXPER

  4

AGE

  4

KIDSL

6 

e

(16.36)   

EDUC

     2  26.34

Principles of Econometrics, 3rd Edition Slide16-48

Principles of Econometrics, 3rd Edition Slide16-49

 Problem: our sample is not a random sample. The data we observe are “selected” by a systematic process for which we do not account.

 Solution: a technique called

Heckit

, named after its developer, Nobel Prize winning econometrician James Heckman.

Principles of Econometrics, 3rd Edition Slide16-50

 The econometric model describing the situation is composed of two equations. The first, is the

selection equation

that determines whether the variable of interest is observed.

z

i

*

   

1 2

w

i

u

i

i

 1, ,

N

(16.37)

z

i

  1 0

z

i

*

 0 otherwise

(16.38)

Principles of Econometrics, 3rd Edition Slide16-51

 The second equation is the linear model of interest. It is

y i

   

1 2

x i

e i i

1, ,

n N

n

(16.39)

i i

*

 0   

1 2

x

i

  1

   

1 2 2

w

i

w

i

 

i

i

 1, ,

n

(16.40) (16.41)

Principles of Econometrics, 3rd Edition Slide16-52

 The estimated “Inverse Mills Ratio” is

 

i

 

1

   

1 2 2

w

i

w

i

 

 The estimating equation is

y

i

  

1 2

x

i

  

i

v

i

i

 1, ,

n

(16.42)

Principles of Econometrics, 3rd Edition Slide16-53

ln 

WAGE

   

EDUC

 .0157

EXPER R

2  .1484

(16.43)  

AGE

 .0838

EDUC

 .3139

KIDS

 1.3939

MTR

  

IMR

    

Principles of Econometrics, 3rd Edition AGE AGE

  .0838

EDUC

.0838

EDUC

  .3139

KIDS

.3139

KIDS

  1.3939

MTR

1.3939

MTR

 

Slide16-54

ln 

WAGE

 

EDUC

 .0163

EXPER

 .8664

IMR

(16.44)  The maximum likelihood estimated wage equation is ln 

WAGE

 

EDUC

 .0118

EXPER

(t-stat) (2.84) (3.96) (2.87) The standard errors based on the full information maximum likelihood procedure are smaller than those yielded by the two-step estimation method.

Principles of Econometrics, 3rd Edition Slide16-55

               binary choice models censored data conditional logit count data models feasible generalized least squares Heckit identification problem independence of irrelevant alternatives (IIA) index models individual and alternative specific variables individual specific variables latent variables likelihood function limited dependent variables linear probability model

Principles of Econometrics, 3rd Edition

                 logistic random variable logit log-likelihood function marginal effect maximum likelihood estimation multinomial choice models multinomial logit odds ratio ordered choice models ordered probit ordinal variables Poisson random variable Poisson regression model probit selection bias tobit model truncated data

Slide 16-56

 

Survival analysis (time-to-event data analysis) Multivariate probit (biprobit, triprobit, mvprobit)

  

Hoffmann, 2004 for all topics Long, S. and J. Freese for all topics Cameron and Trivedi’s book for count data

Agresti, A. (2001) Categorical Data Analysis (2nd ed). New York: Wiley.