Transcript Document

SKEMA Ph.D programme
2010-2011
Class 6
Qualitative Dependent
Variable Models
Lionel Nesta
Observatoire Français des Conjonctures Economiques
[email protected]
Structure of the class
1. The linear probability model
2. Maximum likelihood estimations
3. Binary logit models and some other models
4. Multinomial models
5. Ordered multinomial models
6. Count data models
The Linear Probability Model
The linear probability model

When the dependent variable is binary (0/1, for example, Y=1 if the
firm innovates, 0 otherwise), OLS is called the linear probability
model.
Y  0  1x1  2 x 2  u

How should one interpret βj? Provided that OLS4 – E(u|X)=0 – holds
true, then:
E(Y | X)  0  1x1  2 x 2
The linear probability model
Y follows a Bernoulli distribution with expected value P. This model is
called the linear probability model because its expected value,
conditional on X, and written E(Y|X), can be interpreted as the
conditional probability of the occurrence of Y given values of X.

E(Y | X)  Pr(Y  1| X)
1  E(Y | X)  Pr(Y  0 | X)

β measures the variation of the probability of success for a one-unit
variation of X (ΔX=1)
E(Y | X) Pr(Y  1| X)


 Pr(Y  1| X)
X
X
Limits of the linear probability model (1)
Non normality of errors

OLS6 : The error term is independent of all RHS and follows a
normal distribution with zero mean and variance σ²
u

Normal(0,  )
2
Since the errors are the complement to unity of the conditional
probability, they follow a Bernoulli distribution, not a normal
distribution.
Limits of the linear probability model (1)
1.5
1
.5
0
Density
2
2.5
Non normality of errors
-1
-.5
0
Residuals
.5
Limits of the linear probability model (2)
Heteroskedastic errors

OLS5 : The variance of the error term, u, conditional on RHS, is
the same for all values of RHS
Var  u x1 , x 2 ,

, xk   
2
The error term is itself distributed Bernoulli, and its variance
depends on X. Hence it is heteroskedastic
Var(u)  P(1  P)  E(Y | X)  (1  E(Y | X))
Limits of the linear probability model (2)
-1
-.5
Residuals
0
.5
Heteroskedastic errors
.4
.6
.8
Fitted values
1
1.2
Limits of the linear probability model (3)
Fallacious predictions

By definition, a probability is always in the unit interval [0;1]
0  EY | X  1

But OLS does not guarantee this condition


Predictions may lie outside the bound [0;1]
The marginal effect is constant , since P = E(Y|X) grows linearly with
X. This is not very realistic (ex: the probability to give birth conditional
on the number of children already born)
Limits of the linear probability model (3)
2
3
Fallacious predictions
0
1
Density
Fallacious
predictions
.4
.6
.8
Fitted values
1
1.2
Limits of the linear probability model (4)
A downward bias in the coefficient of determination R²

Observed values are 1 or 0, whereas predictions should lie
between 0 and 1: [0;1].

Comparing predicted with observed variables, the goodness of fit
as assessed by the R² is systematically low .
Limits of the linear probability model (4)
.6
.8
1
Fallacious predictions
0
.2
.4
Fallacious
predictions
which lower the
R2
.4
.6
.8
Fitted values
1
1.2
Limits of the linear probability model (4)
1.
Non normality of errors
2.
Heteroskedastic errors
3.
Fallacious predictions
4.
A downward bias in the R²
u
Normal(0, 2 )
Var  u x1 , x 2 ,
, x k   2
0  EY | X  1
Overcoming the limits of the LPM
1.
2.
3.
4.
Non normality of errors

Increase sample size
Heteroskedastic errors

Use robust estimators
Fallacious prediction

Perform non linear or constrained regressions
A downward bias in the R²

Do not use it as a measure of goodness of fit
Persistent use of LPM

Although it has limits, the LPM is still used
1.
In the process of data exploration (early stages of the research)
2.
It is a good indicator of the marginal effect of the representative
observation (at the mean)
3.
When dealing with very large samples, least squares can
overcome the complications imposed by maximum likelihood
techniques.
 Time of computation
 Endogeneity and panel data problems
The LOGIT Model
Probability, odds and logit

We need to explain the occurrence of an event: the LHS
variable takes two values : y={0;1}.

In fact, we need to explain the probability of occurrence of
the event, conditional on X: P(Y=y | X) ∈ [0 ; 1].

OLS estimations are not adequate, because predictions
can lie outside the interval [0 ; 1].


We need to transform a real number, say z to ∈ ]-∞;+∞[
into P(Y=y | X) ∈ [0 ; 1].
The logistic transformation links a real number z ∈ ]-∞;+∞[
to P(Y=y | X) ∈ [0 ; 1].It is also called the link function
The logit link function
Let us make sure that the transformation of z lies between 0 and1
z  ; 
e z  0; 
eZ
Z
Z

0;1
since
e

1

e


1  eZ
eZ
is called the logit link function
Z
1 e
The logit model
Hence the probability of any event to occur is :
eZ
P  y  1| z  
1  eZ
1
1
P  y  0 | z   1  P  y  1| z   1 

Z
1 e
1  eZ
But what is z?
The odds ratio
The odds ratio is defined as the ratio of the probability and its
complement. Taking the log yields z. Hence z is the log transform of
the odds ratio.
P
1  eZ
Z

e
Z
1 P 1 e
 P 
ln 
z
1 P 
This has two important characteristics :
1. Z ∈ ]-∞;+∞[ and P(Y=1) ∈ [0 ; 1]
2. The probability is not linear in z (The plot linking z with
straight line)
Probability, odds and logit
P(Y=1)
Odds
p(y=1)
1-p(y=1)
0.01
1/99
0,01
-4,60
0.03
3/97
0,03
-3,48
0.05
5/95
0,05
-2,94
0.20
20/80
0,25
-1,39
0.30
30/70
0,43
-0,85
0.40
40/60
0,67
-0,41
0.50
50/50
1,00
0,00
0.60
60/40
1,50
0,41
0.70
70/30
2,33
0,85
0.80
80/20
4,00
1,39
0.95
95/5
19,0
2,94
0.97
0.99
97/3
99/1
32,3
99,0
3,48
4,60
Ln (odds)
The logit transformation

The preceding table matches levels of probability with
the odds ratio.

The probability varies between 0 and 1, The odds varies
between 0 and + ∞. The log of the odds varies between
– ∞ and + ∞ .

Notice that the distribution of the log of the odds is
symetrical.
.15
.1
0
.05
Density
.2
.25
Logistic probability density distribution
-10
-5
0
Log (Odds ratio)
5
10
.6
.4
.2
0
P(y=1 | z)
.8
1
“The probability is not linear in z”
-4
-2
0
z
2
4
The logit link function
The whole trick that can overcome the OLS problem is then to posit:
z    1x1 
 k x k
z = Xβ
eZ
e Xβ
Hence
is rewritten
Z
1 e
1  e Xβ
But how can we estimate the above equation knowing that
we do not observe z?
Maximum likelihood estimations

OLS can be of much help. We will use Maximum Likelihood
Estimation (MLE) instead.

MLE is an alternative to OLS. It consists of finding the parameters
values which is the most consistent with the data we have.

In Statistics, the likelihood is defined as the joint probability to
observe a given sample, given the parameters involved in the
generating function.

One way to distinguish between OLS and MLE is as follows:
OLS adapts the model to the data you have : you only have one model
derived from your data. MLE instead supposes there is an infinity of
models, and chooses the model most likely to explain your data.
Likelihood functions

Let us assume that you have a sample of n random observations.
Let f(yi ) be the probability that yi = 1 or yi = 0. The joint probability to
observe jointly n values of yi is given by the likelihood function:
n
f  y1 , y2 ,..., yn    f ( yi )
i 1

We need to specify function f(.). It comes from the empirical descrite
distribution of an event that can have only two outcome : a success
(yi = 1) or a failure (yi = 0). This is the binomial distribution. Hence:
1 
n
f (yi )    p k (1  p) n k    p yi (1  p)1 yi
k
 yi 

f (yi )  p yi (1  p)1 yi
Likelihood functions

Knowing p (as the logit), having defined f(.), we come up with the
likelihood function:
n
n
i 1
i 1
L  y    f (yi )    p  1  p 
1 yi
yi
yi
 e   1 
L  y, z    f (yi , z)   
z  
z 
1

e
1

e


i 1
i 1 

n
z
n
yi
1 yi
1 yi
 e
  1 
L  y, x,     f (yi , X, β)   
Xβ  
Xβ 
1

e
1

e

i 1
i 1 
 
n
n
Xβ
Log likelihood (LL) functions

The log transform of the likelihood function (the log likelihood) is
much easier to manipulate, and is written:
n
n
i 1
i 1
LL  y, z    yi  z    ln 1  e z 
n
n
i 1
i 1
Xβ
LL  y, x,     yi Xβ   ln 1  e  
n

LL  y, x,     ln 1  e Xβ   yi Xβ
i 1

Maximum likelihood estimations

The LL function can yield an infinity of values for the
parameters β.

Given the functional form of f(.) and the n observations at
hand, which values of parameters β maximize the
likelihood of my sample?

In other words, what are the most likely values of my
unknown parameters β given the sample I have?
Maximum likelihood estimations
The LL is globally concave and has a maximum. The gradient is used
to compute the parameters of interest, and the hessian is used to
compute the variance-covariance matrix.
LL n
   yi   i  x i  0

i 1


ez

 where  i 
z
n
1

e
 ²LL
   i 1   i  x i xi 


i 1
However, there is not analytical solutions to this non linear problem.
Instead, we rely on a optimization algorithm (Newton-Raphson)
You need to imagine that the computer is going to generate all possible
values of β, and is going to compute a likelihood value for each (vector of )
values to then choose (the vector of) β such that the likelihood is highest.
Example: Binary Dependent Variable

We want to explore the factors affecting the probability of
being successful innovator (inno = 1): Why?

352 (81.7%) innovate and 79 (18.3%) do not.

The odds of carrying out a successful innovation is about
4 against 1 (as 352/79=4.45).

The log of the odds is 1.494 (z = 1.494)

For the sample (and the population?) of firms the
probability of being innovative is four times higher than
the probability of NOT being innovative
Logistic Regression with STATA
 Instruction Stata : logit
logit y x1 x2 x3 … xk
[if] [weight] [, options]
Options
 noconstant : estimates the model without the constant
 robust : estimates robust variances, also in case of
heteroscedasticity
 if : it allows to select the observations we want to include in the
analysis
 weight : it allows to weight different observations
Logistic Regression with STATA

Let’s start and run a constant only model

logit inno
Goodness of fit
.
. logit inno
Iteration 0:
log likelihood = -205.30803
Logistic regression
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
Log likelihood = -205.30803
inno
Coef.
_cons
1.494183
Std. Err.
.1244955
z
12.00
=
=
=
=
431
0.00
.
0.0000
P>|z|
[95% Conf. Interval]
0.000
1.250177
1.73819
Parameter estimates, Standard errors and z values
Interpretation of Coefficients

What does this simple model tell us ?

Remember that we need to use the logit formula to transform
the logit into a probability :
 Xβ 
e
P(Y  1| X) 
 Xβ 
1 e
Interpretation of Coefficients
e1,494
P
 0,817
1,494
1 e

The constant 1.491 must be interpreted as the log of the odds ratio.

Using the logit link function, the average probability to innovate is
dis

exp(_b[_cons])/(1+exp(_b[_cons]))
We find exactly the empirical sample value: 81,7%
Interpretation of Coefficients

A positive coefficient indicates that the probability of innovation
success increases with the corresponding explanatory variable.

A negative coefficient implies that the probability to innovate
decreases with the corresponding explanatory variable.

Warning! One of the problems encountered in interpreting
probabilities is their non-linearity: the probabilities do not vary in the
same way according to the level of regressors

This is the reason why it is normal in practice to calculate the
probability of (the event occurring) at the average point of the
sample
Interpretation of Coefficients

Let’s run the more complete model
 logit inno lrdi lassets spe biotech
. logit inno lrdi lassets spe biotech
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-205.30803
-167.71312
-163.57746
-163.45376
-163.45352
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -163.45352
inno
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
[95% Conf. Interval]
.3390634
.7288574
-.3988654
2.668056
-15.43129
1.166436
1.265313
1.249434
4.93185
-7.837643
Interpretation of Coefficients
e -11.630.75rdi0.99lassets0.43spe3.79biotech
P
1  e-11.630.75rdi0.99lassets0.43spe3.79biotech

Using the sample mean values of rdi, lassets, spe and
biotech, we compute the conditional probability :
e -11.630.75rdi0.99lassets0.43spe3.79biotech
P
1  e-11.630.75rdi0.99lassets0.43spe3.79biotech
e1.953

 0,8758
1.953
1 e
Marginal Effects



It is often useful to know the marginal effect of a regressor on the probability
that the event occur (innovation)
As the probability is a non-linear function of explanatory variables, the
change in probability due to a change in one of the explanatory variables is
not identical if the other variables are at the average, median or first quartile,
etc. level.
prvalue provides the predicted probabilities of a logit model (or any other)







prvalue
prvalue
prvalue
prvalue
prvalue
prvalue
prvalue
,
,
,
,
,
,
x(lassets=10)
x(lassets=11)
x(lassets=12)
x(lassets=10)
x(lassets=11)
x(lassets=12)
rest(mean)
rest(mean)
rest(mean)
rest(median)
rest(median)
rest(median)
Marginal Effects

prchange provides the marginal effect of each of the explanatory
variables for the majority of the variations of the desired values

prchange [varlist] [if] [in range]
,x(variables_and_values) rest(stat)

prchange

prchange, fromto

prchange , fromto x(size=10.5)
fromto
rest(mean)
Goodness of Fit Measures

In ML estimations, there is no such measure as the R2

But the log likelihood measure can be used to assess the goodness of
fit. But note the following :



The higher the number of observations, the lower the joint probability, the
more the LL measures goes towards -∞
Given the number of observations, the better the fit, the higher the LL
measures (since it is always negative, the closer to zero it is)
The philosophy is to compare two models looking at their LL values.
One is meant to be the constrained model, the other one is the
unconstrained model.
Goodness of Fit Measures



A model is said to be constrained when the observed set the
parameters associated with some variable to zero.
A model is said to be unconstrained when the observer release this
assumption and allows the parameters associated with some variable
to be different from zero.
For example, we can compare two models, one with no explanatory
variables, one with all our explanatory variables. The one with no
explanatory variables implicitly assume that all parameters are equal to
zero. Hence it is the constrained model because we (implicitly)
constrain the parameters to be nil.
The likelihood ratio test (LR test)

The most used measure of goodness of fit in ML estimations is the
likelihood ratio. The likelihood ratio is the difference between the
unconstrained model and the constrained model. This difference is
distributed c2.

If the difference in the LL values is (no) important, it is because the set
of explanatory variables brings in (un)significant information. The null
hypothesis H0 is that the model brings no significant information as
follows:
LR  2ln Lunc  ln Lc 

High LR values will lead the observer to reject hypothesis H0 and accept
the alternative hypothesis Ha that the set of explanatory variables does
significantly explain the outcome.
The McFadden Pseudo
2
R

We also use the McFadden Pseudo R2 (1973). Its interpretation is
analogous to the OLS R2. However its is biased doward and remain
generally low.

Le pseudo-R2 also compares The likelihood ratio is the difference
between the unconstrained model and the constrained model and is
comprised between 0 and 1.
Pseudo R
2
MF
ln L c  ln L unc 
ln L unc


 1
ln L unc
ln L c
Goodness of Fit Measures
Constrained model
. logit
inno
Iteration 0:
log likelihood = -205.30803
Logistic regression
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
Log likelihood = -205.30803
. logit
inno
Coef.
_cons
1.494183
Std. Err.
.1244955
z
12.00
=
=
=
=
431
0.00
.
0.0000
P>|z|
[95% Conf. Interval]
0.000
1.250177
1.73819
LR  2  ln L unc  ln L c 
 2  163.5    205.3 
 83.8
inno lrdi lassets spe biotech, nolog
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -163.45352
inno
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
Unconstrained model
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
[95% Conf. Interval]
.3390634
.7288574
-.3988654
2.668056
-15.43129
1.166436
1.265313
1.249434
4.93185
-7.837643
Ps.R 2MF  1  ln L unc ln Lc
 1  163.5 205.3
 0.204
Other usage of the LR test

The LR test can also be generalized to compare any two models, the
unconstrained one being nested in the constrained one.

Any variable which is added to a model can be tested for its
explanatory power as follows :

logit [modèle contraint]

est store [nom1]

logit [modèle non contraint]

est store [nom2]

lrtest nom2 nom1
Goodness of Fit Measures
. logit
inno lrdi lassets spe, nolog
Logistic regression
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
Log likelihood = -191.84522
inno
Coef.
lrdi
lassets
spe
_cons
.9275668
.3032756
.3739987
-.4703812
Std. Err.
.1979951
.0792032
.3800765
.9313494
z
4.68
3.83
0.98
-0.51
P>|z|
0.000
0.000
0.325
0.614
=
=
=
=
431
26.93
0.0000
0.0656
[95% Conf. Interval]
.5395037
.1480402
-.3709376
-2.295793
1.31563
.4585111
1.118935
1.35503
. est store model1
. logit
inno lrdi lassets spe biotech, nolog
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -163.45352
inno
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
[95% Conf. Interval]
.3390634
.7288574
-.3988654
2.668056
-15.43129
1.166436
1.265313
1.249434
4.93185
-7.837643
. est store model2
. lrtest model2 model1
Likelihood-ratio test
(Assumption: model1 nested in model2)
LR chi2(1) =
Prob > chi2 =
56.78
0.0000
LR test on the added variable (biotech)
LR  2  ln L unc  ln L c 
 2  163.5    191.8  
 56.8
Quality of predictions

Lastly, one can compare the quality of the prediction with
the observed outcome variable (dummy variable).

One must assume that when the probability is higher than
0.5, then the prediction is that the vent will occur (most
likely

And then one can compare how good the prediction is as
compared with the actual outcome variable.

STATA does this for us:

estat class
Quality of predictions
. estat class
Logistic model for inno
True
Classified
D
~D
Total
+
-
337
15
51
28
388
43
Total
352
79
431
Classified + if predicted Pr(D) >= .5
True D defined as inno != 0
Sensitivity
Specificity
Positive predictive value
Negative predictive value
Pr( +| D)
Pr( -|~D)
Pr( D| +)
Pr(~D| -)
95.74%
35.44%
86.86%
65.12%
False
False
False
False
Pr( +|~D)
Pr( -| D)
Pr(~D| +)
Pr( D| -)
64.56%
4.26%
13.14%
34.88%
+
+
-
rate
rate
rate
rate
for
for
for
for
true ~D
true D
classified +
classified -
Correctly classified
84.69%
Other Binary Choice models

The Logit model is only one way of modeling binary
choice models

The Probit model is another way of modeling binary
choice models. It is actually more used than logit models
and assume a normal distribution (not a logistic one) for
the z values.

The complementary log-log models is used where the
occurrence of the event is very rare, with the distribution
of z being asymetric.
Other Binary Choice models

Probit model
Pr(Y  1| X)    Xβ   
z


e
 z2 2
2
 dz  
Xβ
 Xβ  2
2
e

Complementary log-log model
Pr(Y  1| X)    Xβ   1  exp   exp( Xβ) 
2
 dz  
Xβ

  t   dz
Likelihood functions and Stata commands
n
Logit : L( y, x,  )  
i 1
1 yi
yi
 e Xβ   1 
f ( yi , xi ,  )   
Xβ  
Xβ 
i 1 1  e
 1  e 
n
n
n
i 1
i 1
Probit : L( y, x,  )   f ( yi , xi ,  )    ( Xβ)  i 1  ( Xβ) 
y
n
n
i 1
i 1
1 yi
Log-log comp : L( y, x,  )   f ( yi , xi ,  )   1  exp(  exp( Xβ))  i exp(  exp( Xβ))
Example
logit inno rdi lassets spe pharma
probit inno rdi lassets spe pharma
cloglog inno rdi lassets spe pharma
y
1 yi
0
.1
.2
y
.3
.4
Probability Density Functions
-4
-2
0
x
Probit Transformation
Complementary log log Transformation
2
4
Logit Transformation
0
.2
.4
y
.6
.8
1
Cumulative Distribution Functions
-4
-2
0
x
Probit Transformation
Complementary log log Transformation
2
4
Logit Transformation
Comparison of models
Ln(R&D intensity)
ln(Assets)
Spe
BiotechDummy
Constant
Observations
OLS
Logit
Probit
C log-log
0.110
0.752
0.422
354
[3.90]***
[3.57]***
[3.46]***
[3.13]***
0.125
0.997
0.564
0.493
[8.58]***
[7.29]***
[7.53]***
[7.19]***
0.056
0.425
0.224
0.151
[1.11]
[1.01]
[0.98]
[0.76]
0.442
3.799
2.120
1.817
[7.49]***
[6.58]***
[6.77]***
[6.51]***
-0.843
-11.634
-6.576
-6.086
[3.91]**
[6.01]***
[6.12]***
[6.08]***
431
431
431
431
Absolute t value in brackets (OLS) z value for other models.
* 10%, ** 5%, *** 1%
Comparison of marginal effects
OLS
Logit
Probit
C log-log
Ln(R&D intensity)
0.110
0.082
0.090
0.098
ln(Assets)
0.125
0.110
0.121
0.136
Specialisation
0.056
0.046
0.047
0.042
Biotech Dummy
0.442
0.368
0.374
0.379
For all models logit, probit and cloglog, marginal effects have been computed for a one-unit variation
(around the mean) of the variable at stake, holding all other variables at the sample mean values.
Multinomial LOGIT Models
Multinomial models
Let us now focus on the case where the dependent variable has
several outcomes (or is multinomial). For example, innovative firms
may need to collaborate with other organizations. One can code this
type of interactions as follows

Collaborate with university (modality 1)

Collaborate with large incumbent firms (modality 2)

Collaborate with SMEs (modality 3)

Do it alone (modality 4)
Or, studying firm survival

Survival (modality 1)

Liquidation (modality 2)

Mergers & acquisition (modality 3)
Multinomial models
One could first perform three logistic regressions as follows :
 P(Y  1| X) 
(1)
ln 
 (1)
0  1 x1 

 1  P(Y  1| X) 
 (1)
m xm
 P(Y  2 | X) 
(2)
ln 
 (2)
0  1 x1 

 1  P(Y  2 | X) 
 (2)
m xm
 P(Y  3 | X) 
(3)
ln 
 (3)
0  1 x1 

 1  P(Y  3 | X) 
 (3)
m xm
Where 1 = survival, 2 = liquidation, 3 = M&A.
1. Open the file mlogit.dta
2. Estimate for each type of outcome the conditional probability
of the event for the representative firm
- time
(log_time)
- size
(log labour)
- firm age (entry_age)
- Spin out (spin_out)
- Cohort
(cohort_*)
The need for multinomial models
 P(Y  1| X) 
(1)
(1)
ln 




0
1 x1 

 1  P(Y  1| X) 
 (1)
m xm
 P(Y  2 | X) 
(2)
(2)
ln 




0
1 x1 

 1  P(Y  2 | X) 
 (2)
m xm
 P(Y  3 | X) 
(3)
(3)
ln 




0
1 x1 

 1  P(Y  3 | X) 
 (3)
m xm
P(Y  1| X)  0.8771
P(Y  2 | X)  0.0398
P(Y  3 | X)  0.0679
 P(Y  k | X) 
k
0.9848  1
Multinomial models
First, the sum of all conditional probabilities should add up to unity
k
PY  j | X  1
j0
Second, for k outcomes, we need to estimate (k-1) modality. Hence
k
PY  0 | X  1  P Y  j | X
jk
Multinomial logit models
Third, the multinomial model is a simultaneous (as opposed to
sequential) estimation model comparing the odds of each modality
with respect to all others. With three outcomes, we have:
 P(Y  1| X) 
(1|0)
(1|0)
ln 




0
1 x1 

 P(Y  0 | X) 
 (1|0)
m xm
 P(Y  2 | X) 
(2|0)
(2|0)
ln 




x1 
0
1

 P(Y  0 | X) 
 (2|0)
m xm
 P(Y  1| X) 
(1|2)
(1|2)
ln 




0
1 x1 

 P(Y  2 | X) 
 (1|2)
m xm
Multinomial logit models
Note that there is redundancy, since :
 P  Y  1| X  
 PY  2 | X 
 P  Y  1| X  
ln 

ln

ln





 PY  0 | X 
 PY  0 | X 
 P Y  2 | X  
 P  Y  1| X  
 PY  2 | X 
 P  Y  1| X  
1|0 
 2|0 
1|2 
ln 

x

;ln

x

;ln




  x
 PY  0 | X 
 PY  0 | X 
 P Y  2 | X  
x
1|0 
 x
2|0 
 x
1|2 
Fourth, the multinomial logit model estimates (k – 1) outcomes with
following constrained:
1|0

 2|0 

1|2 

Multinomial logit models
With k outcomes, the probability of occurrence of event j reads:
x


e
( j|0 )
PY  j | X 
jk

x


e
( j|0 )
j0
By convention, outcome 0 is the base outcome
Multinomial logit models
Note that x
 j|j
 PY  j | X 
 ln 
  ln(1)  0
 PY  j | X 
 j|j
x, j: 
0
 x 
e
( j|0 )
PY  j | X 
jk

x


e
( j|0 )
j0
x


e
( j|0 )
PY  j | X 
jk
1 
j1
 x 
e
( j|0 )
PY  0 | X 
1
jk
1 
j1
x


e
( j|0 )
Binomial logit as multinomial logit
Let us rewrite the probability of event that Y=1
P  Y  1| X  
e
x 
1  e
x 
 x 
e
(1|0 )
P  Y  1| X  
 x 
1 e
(1|0 )
 x 
e
(1|0 )

(1|0 )
 x   e x 
e
( 0|0 )
 x 
e
(1|0 )


 x 
e
( k|0 )
k 0,1
The binomial logit binomial is a special case of the multinomial where
only two outcomes are being analyzed.
Likelihood functions

Let us assume that you have a sample of n random observations.
Let f(yj ) be the probability that yi = j. The joint probability to observe
jointly n values of yj is given by the likelihood function:
n
f  y1 , y2 ,..., yn    f ( yi )
i 1

We need to specify function f(.). It comes from the empirical discrete
distribution of an event that can have several outcomes. This is the
multinomial distribution. Hence:
f (y j )  p0
dYi0
dYi1
1
p
p j
dYij
pk
dYik
 pj
jK
dYik
The maximum likelihood function

The maximum likelihood function reads:
 k dYij 
L(y)   f  yi      p j 
i 1
i 1  j1

n
n
dYi
dYi





x( j|0) 



n
n
k 

1
e
( j|0)
 

L(y)   f  yi , x i ,     
j k
j k
( j|0)
( j|0)
x

 x  
i 1
i 1 
j1 
 1   e
1

e

 


j1
j1




0
j







The maximum likelihood function
The log transform of the likelihood yields





 
(
j|0)

 x i    

 k 

n
1
e
    dyij  ln 
 
LL(y, x, ( j|0) )    dyi0  ln  jk
j

k


x i( j|0)  
x i( j|0)    




i 1
j1


1   e

1   e
  
j

0
j

0



 



 j k  xi( j|0)   k  j ( j|0)
 j k  xi( j|0)    
)     ln 1   e
    dyi x i  ln 1   e
  

i 1 
 j 0
 j1 
 j 0
 
n
LL(y, x, 
( j|0)
LL(y, x, 
( j|0)
n
k
)   dy x i
i 1 j1
j
i
( j|0)
 k 
 j k  xi( j|0)    
  k  1     ln 1   e
  
 j1 
 j 0

 
Multinomial logit models
 Stata Instruction : mlogit
mlogit y x1 x2 x3 … xk
[if] [weight] [, options]
 Options : noconstant : omits the constant
robust : controls for heteroskedasticity
 if : select observations
 weight : weights observations
Multinomial logit models
use mlogit.dta, clear
mlogit type_exit log_time log_labour entry_age entry_spin cohort_*
Goodness of fit
Parameter estimates, Standard
errors and z values
Base outcome, chosen by STATA, with the
highest empirical frequency
Interpretation of coefficients
The interpretation of coefficients always refer to the base category
Does the probability of being boughtout decrease overtime ?
No!
Relative to survival the probability of
being bought-out decrease overtime
Interpretation of coefficients
The interpretation of coefficients always refer to the base category
Is the probability of being bought-out
lower for spinoff?
No!
Relative to survival the probability of
being bought-out is lower for spinoff
Interpretation of coefficients
1|0

 2|0

1|2 

 2|0 

1|0 

 2|1

Relative to liquidation, the probability
of being bought-out is higher for
spinoff
lincom [boughtout]entry_spin – [death]entry_spin
Changing base outcome
mcross provides other estimates by changing the base ouctome
Mind the new base outcome!!
Being bought-out relative to
liquidation
Relative to liquidation, the
probability of being bought-out is
higher for spinoff
Changing base outcome
mcross provides other estimates by changing the base ouctome
And we observe the same results
as before
Independence of irrelevant alternatives - IAA

The model assumes that each pair of outcome is independent from
all other alternatives. In other words, alternatives are irrelevant.

From a statistical viewpoint, this is tantamount to assuming
independence of the error terms across pairs of alternatives

A simple way to test the IIA property is to estimate the model taking
off one modality (called the restrained model), and to compare the
parameters with those of the complete model

If IIA holds, the parameters should not change significantly

If IIA does not hold, the parameters should change significantly
Independence of irrelevant alternatives - IAA


H0: The IIA property is valid
H1: The IIA property is not valid


 
  
1
* 
* 
ˆ
ˆ
ˆ
ˆ
H  R  C var R  var C
ˆ R  ˆ *C




The H statistics (H stands for Hausman) follows a χ² distribution with
M degree of freedom (M being the number of parameters)
STATA application: the IIA test


H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
Omitted variable
Application de IIA


H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
We compare the parameters of the model
“liquidation relative bought-out”
estimated simultaneously with
“survival relative to bought-out”
avec
the parameters of the model
“liquidation relative bought-out”
estimated without
“survival relative to bought-out”
Application de IIA


H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
The conclusion is that outcome survival
significantly alters the choice between
liquidation and bought-out.
In fact for a company, being bought-out must be
seen as a way to remain active with a cost of
losing control on economic decision, notably
investment.
Ordered Multinomial
LOGIT Models
Ordered multinomial models
Let us now concentrate on the case where the dependent variable is
a discrete integer which indicates an intensity. Opinion surveys make
an extensive use on such so-called Likert Scale:






Obstacles à l’innovation (échelle de 1 à 5)
Intensité de collaboration (échelle de 1 à 5)
Enquête de marketing (N’apprécie pas (1) – Apprécie (7))
Note d’étudiants
Test d’opinion
Etc.
Ordered multinomial models
Such variables depict a vertical scale – quantitative, so that one can
think of them as describing the interval in which an unobserved
latent variable y* lies:
y  1 si y*n  1
y  2 si 1  y*n   2
y  3 si  2  y*n  3
M
y  k si 3  y*k
where αj are unknown bounds to be estimated.
Ordered multinomial models
We assume that the latent variable y* is a linear combination of the
set of all explanatory variables
y*i  x i  u i
where ui follows a cumulative distribution function F(.). The
probabilities with each occurrence y (y ≠ y*) are then following the cdf
F(.). Let us look at the probability that y = 1 :

P(y  1)  P y*i  1

P(y  1)  P  x i  u i  1 
P(y  1)  P  u i  1  x i 
e1 xi
P(y  1)    1  x i  
1  e1 xi
Ordered multinomial models
The probability that y = 2 is:

 
P(y  2)  P y*i   2  P y*i  1

e2 xi
e1xi
P(y  2)     2  x i     1  x i  

1  e2 xi 1  e1 xi
Altogether we have:
P(Y  1)    1  x i 
P(Y  2)     2  x i     1  x i 
P(Y  3)    3  x i      2  x i 
M
P(Y  k)  1     k 1  x i 
Probability in a ordered model
1  xi
 2  xi
3  xi k1  xi
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
y=1
0
y=2
y=3
y=k
ui
The likelihood function

The likelihood function is
n
k
L(y, x, , ) =  F( j  x i)  F( j-1  x i) 
i=1 j=1
avec
F(0 - x n)  0
F( k - x n)  1


dy j
The likelihood function
If ui follows a logistic distribution, the (log) likelihood function reads :
n
  x 
  x 

 j i 
 j-1 i 
k 




e
 e


  x 
  x  

 j i 


j=1


1  e
1  e j-1 i  
dy j
L(y, x, , ) = 
i=1
et donc
   x 
 e  j  x i 
e j-1 i 

LL(y, x, , ) =  dy ln 

 j  x i 
 j-1  x i  


i 1 j1
1  e

1 e
n
k
j
i
Ordered multinomial logit models
 Stata Instruction : ologit
ologit y x1 x2 x3 … xk
[if] [weight] [, options]
 Options : noconstant : omits the constant
robust : controls for heteroskedasticity
 if : select observations
 weight : weights observations
Ordered multinomial models
use est_var_qual.dta, clear
ologit innovativeness size rdi spe biotech
Goodness of fit
Estimated
parameters
Cutoff points
Interpretation of coefficients

A positive (negative) sign indicates a positive relationship between
the independent variable and the order (or rank)

How does one interpret the cutoff values? The model is:
Score  xi  ui

What is then the probability that Y = 1 : P(Y = 1) ?

What is the probability that the score be inferior to the first cutoff point?
P(y  1)  P  x i  u i  1 

 e 1.95
P(y  1)  P  270.5  u i  268.6  
 .1245
1.95
1

e

P(y  1)  P  u i  1.9 

Interpretation of coefficients

What is the probability that Y = 2 : P(Y = 2) ?
P(y  1)  P  x i  u i  1 

 e1.95
P(y  1)  P  270.5  u i  268.6  
 .1245
1.95
1

e

P(y  1)  P  u i  1.9 

P(y  1)  P  x i  u i   2 

 e 1.95
P(y  1)  P  270.5  u i  269.3 
 .2321
1.95
1 e
P(y  1)  P  u i  1.2 














P(Y  2)  F   2  x i   F  1  x i 
P(Y  2)  .2321  .1245
P(Y  2)  .1076
STATA computation of pred. prob.
prvalue computes the predicted probabilities
Count Data Models Part 1
The Poisson Model
Count data models
Let us now focus on outcome counting the number of
occurrences a given event. Analyzing the number of
innovations, the number patents, of invention.
Again OLS fails to meet the constrain that the prediction
must be nil or positive. To explain count variables, we
assume that the dependent variable follows a Poisson
distribution.
Poisson models
Let Y be a random count variabl. The probability that Y be equal to
integer yi is given by the Poisson probability density distribution:
e i  iyi
P  Y  yi  
, yi  0,1, 2,...
yi !
with E  Y   var  Y    i
To introduce the set of explanatory variables in the model, we condition λi
and impose the following log linear form:
 i  e xi
ln  i  x i
Poisson distributions
0,5
Lambda
0,45
0,4
0,8
1,5
2,9
10,5
0,35
0,3
0,25
0,2
0,15
0,1
0,05
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
The likehood function
The (log) likelihood function reads:
ei iyi
L(y, ) = 
yi !
i=1
n
et donc
LL(y, x, ) =   yi x i  e x   ln  yi ! 
n
i
i 1
Poisson models
 Stata Instruction : poisson
Poisson y x1 x2 x3 … xk
[if] [weight] [, options]
 Options : noconstant : omits the constant
robust : controls for heteroskedasticity
 if : select observations
 weight : weights observations
Poisson models
use est_var_qual.dta, clear
Poisson patent lrdi lassets spe biotech
Goodness of fit
. poisson patent lrdi lassets spe biotech
Iteration 0:
Iteration 1:
Iteration 2:
log likelihood = -3549.9316
log likelihood = -3549.8433
log likelihood = -3549.8433
Poisson regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -3549.8433
patent
Coef.
lrdi
lassets
spe
biotech
_cons
.699484
.4705428
.7623891
1.073271
-3.12651
Std. Err.
.0326467
.0133588
.0441729
.051571
.1971912
z
21.43
35.22
17.26
20.81
-15.86
P>|z|
0.000
0.000
0.000
0.000
0.000
=
=
=
=
431
1766.11
0.0000
0.1992
[95% Conf. Interval]
.6354977
.4443601
.6758118
.9721939
-3.512997
Estimated parameters
.7634704
.4967256
.8489664
1.174349
-2.740022
Interpretation of coefficients
If variables are entered in log, one can interpret the
coefficients as elasticities
 ln x 1
x
   ln x 
; ln  i  x i
x
x
x
patent
Coef.
lrdi
lassets
spe
biotech
_cons
.699484
.4705428
.7623891
1.073271
-3.12651
Std. Err.
.0326467
.0133588
.0441729
.051571
.1971912
z
21.43
35.22
17.26
20.81
-15.86
P>|z|
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
.6354977
.4443601
.6758118
.9721939
-3.512997
.7634704
.4967256
.8489664
1.174349
-2.740022
A one % increase in firm size is associated with a .47% increase in
the expected number of patents
Interpretation of coefficients
If variables are entered in log, one can interpret the
coefficients as elasticities
 ln x 1
x
   ln x 
; ln  i  x i
x
x
x
patent
Coef.
lrdi
lassets
spe
biotech
_cons
.699484
.4705428
.7623891
1.073271
-3.12651
Std. Err.
.0326467
.0133588
.0441729
.051571
.1971912
z
21.43
35.22
17.26
20.81
-15.86
P>|z|
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
.6354977
.4443601
.6758118
.9721939
-3.512997
A one % increase R&D investment is associated with a .69%
increase in the expected number of patents
.7634704
.4967256
.8489664
1.174349
-2.740022
Interpretation of coefficients
If variables are not entered in log, the interpretation
changes
100 × (eβ – 1)
patent
Coef.
lrdi
lassets
spe
biotech
_cons
.699484
.4705428
.7623891
1.073271
-3.12651
Std. Err.
.0326467
.0133588
.0441729
.051571
.1971912
z
21.43
35.22
17.26
20.81
-15.86
P>|z|
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
.6354977
.4443601
.6758118
.9721939
-3.512997
.7634704
.4967256
.8489664
1.174349
-2.740022
A one-point rise in the degree of specialisation is associated with a
113% increase in the expected number of patents
Interpretation of coefficients
For dummy variables, the interpretation changes slightly
patent
Coef.
lrdi
lassets
spe
biotech
_cons
.699484
.4705428
.7623891
1.073271
-3.12651
Std. Err.
.0326467
.0133588
.0441729
.051571
.1971912
z
21.43
35.22
17.26
20.81
-15.86
P>|z|
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
.6354977
.4443601
.6758118
.9721939
-3.512997
.7634704
.4967256
.8489664
1.174349
-2.740022
Biotechnology firms have an expected number of patents which is
191% higher than pharmaceutical companies.
Interpretation of coefficients
All variables are very significant
… but …
Variable
Obs
Mean
patent
431
10.83295
Std. Dev.
17.622
E  Y   var  Y 
Min
Max
0
202
Count Data Models Part 2
Negative Binomial Models
Negative binomial models
Generally, the Poisson model is not valid, due to the presence of
overdispersion in the data. This violates the asumption of equality
between the mean and variance if the dependent variable implied by
the Poisson model.
ln vi  ln i  ln u i  x i  i
The negative binomial model treats this problem by adding to the log
linear form a unobserved heterogeneity term ui:
P  Y  yi  
eiui  i u i 
yi !
yi
Negative binomial models
The density of yi is obtained by taking the density of ui :
e iui   i u i  i
  ui 1
f  Y  yi | x i   
g  u i  du i with g  u i  
e ui
yi !
 
0

y
Assuming that ui is distributed Gamma with mean 1, the density of
yi reads:
    yi    i    
f  Y  yi | x i  

 

  yi  1       i      i   
yi

Likelihood Functions
    yi 
yi
 i    
L  y, ,    

 


y

1








 i     i   i 
i 1
n
n


LL  y, x,     yi x i   yi     ln e
i 1
x i 


    ln   
where α is the overdispersion parameter

Negative binomial models
 Stata Instruction: nbreg
nbreg y x1 x2 x3 … xk
[if] [weight] [, options]
 Options : noconstant : omits the constant
robust : controls for heteroskedasticity
 if : select observations
 weight : weights observations
Negative binomial models
use est_var_qual.dta, clear
nbreg poisson PAT rdi size spe biotech
Negative binomial regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Dispersion
= mean
Log likelihood = -1374.339
patent
Coef.
Std. Err.
lrdi
lassets
spe
biotech
_cons
.7823229
.6167106
.8390795
1.515035
-5.179659
.1121344
.0600372
.1950958
.2384884
.8629357
/lnalpha
.323967
alpha
1.382602
z
0.000
0.000
0.000
0.000
0.000
431
122.62
0.0000
0.0427
Goodness of fit
[95% Conf. Interval]
.5625434
.4990399
.4566988
1.047606
-6.870982
1.002102
.7343813
1.22146
1.982464
-3.488337
.0759342
.1751387
.4727953
.1049868
1.191411
1.604473
Likelihood-ratio test of alpha=0:
6.98
10.27
4.30
6.35
-6.00
P>|z|
=
=
=
=
chibar2(01) = 4351.01 Prob>=chibar2 = 0.000
Estimated
parameters
Overdispersion
parameter
Overdispersion
test
Interpretation of coefficients
If variables are entered in log, one can still interpret the
coefficients as elasticities
A one % increase in firm size is associated with a .61% increase in
the expected number of patents
Interpretation of coefficients
If variables are entered in log, one can still interpret the
coefficients as elasticities
A one % increase in R&D investment is associated with a .78%
increase in the expected number of patents
Interpretation of coefficients
If variables are not entered in log, the interpretation
changes
100 × (eβ – 1)
A one-point rise in the degree of specialisation is associated with a
129% increase in the expected number of patents
Interpretation of coefficients
For dummy variables, the interpretation follows the same
transformation
100 × (eβ – 1)
Biotechnology firms have an expected number of patents which is
352% higher than pharmaceutical companies.
Overdispersion test
We use the LR test to compare the negative binomial
model with the Poisson model
LR  2  ln L NBREG  ln LPRM   2  3055  6110
-1481
-
-4536
The results indicate the probability to reject H0 wrongly is almost nil
(H0: Alpha=0). Hence there is overdispersion in the data and as a
consequence one shopuld use the negative binomial model
Larger standard errors and lower z values
Variable
Poisson
NegBin
patent
lrdi
lassets
spe
biotech
_cons
0.699
21.43
0.471
35.22
0.762
17.26
1.073
20.81
-3.127
-15.86
0.782
6.98
0.617
10.27
0.839
4.30
1.515
6.35
-5.180
-6.00
lnalpha
0.324
4.27
_cons
Statistics
alpha
N
431
1.383
431
legend: b/t
Extensions
ML estimators

All models can be extended to a panel context to take full
account of unobserved heterogeneity
 Fixed effect
 Random effects

Heckman models
 Selection bias
 Two equations, one on the probability to be observed

Survival models
 Discrete time (complementary log-log, logit)
 Continuous time (Cox model)