Heckman and Max Likelihood

Download Report

Transcript Heckman and Max Likelihood

Multinomial Logit & Ordered
Probit
Multinomial Logit
• Is used when the data cannot be ordered. An
example is choice of holiday: (i) beach, (ii) mountain,
(iii) culture. For each individual they are go on just
one holiday.
• We will examine this within the context of insurance
data. The exact meaning does not matter, just treat it
like holiday data. But for a clue type:
describe
summ *ins*
label list insure
use http://www.stata-press.com/data/r11/sysdsn1.dta,clear
. tab2 insure insure
-> tabulation of insure by insure
insure
Indemnity
insure
Prepaid
Indemnity
Prepaid
Uninsure
294
0
0
0
277
0
0
0
45
294
277
45
Total
294
277
45
616
Uninsure
Total
There are 3 options: those who prepay, those who are not insured and those who are
covered by an indemnity
generate site1=site==1
generate site2=site==2
generate site3=site==3
NOW TYPE: mlogit insure age male nonwhite site2 site3
Multinomial logistic regression
Number of obs
LR chi2(10)
Prob > chi2
Pseudo R2
Log likelihood = -534.36165
insure
Coef.
Std. Err.
z
P>|z|
=
=
=
=
615
42.99
0.0000
0.0387
[95% Conf. Interval]
Prepaid
age
male
nonwhite
site2
site3
_cons
-.011745
.5616934
.9747768
.1130359
-.5879879
.2697127
.0061946
.2027465
.2363213
.2101903
.2279351
.3284422
-1.90
2.77
4.12
0.54
-2.58
0.82
0.058
0.006
0.000
0.591
0.010
0.412
-.0238862
.1643175
.5115955
-.2989296
-1.034733
-.3740222
.0003962
.9590693
1.437958
.5250013
-.1412433
.9134476
-.0077961
.4518496
.2170589
-1.211563
-.2078123
-1.286943
.0114418
.3674867
.4256361
.4705127
.3662926
.5923219
-0.68
1.23
0.51
-2.57
-0.57
-2.17
0.496
0.219
0.610
0.010
0.570
0.030
-.0302217
-.268411
-.6171725
-2.133751
-.9257327
-2.447872
.0146294
1.17211
1.05129
-.2893747
.510108
-.1260135
Uninsure
age
male
nonwhite
site2
site3
_cons
(insure==Indemnity is the base outcome)
Note two equations one to exalpain those who opt for ‘prepaid’ and a second for those
who opt for ‘uninsure’
• But there are three choices, so why two
equations. Well if you know the determinants
of two of the choices the third comes about
from default.
• It can also be viewed as the default choice
against which the other two are being
compared.
• Here the default case is the first, indemnity.
Could we change it? YES.
• mlogit insure age male nonwhite site2 site3,
base(2)
This will change the default case to the second
option.
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-555.85446
-534.72983
-534.36536
-534.36165
-534.36165
Multinomial logistic regression
Number of obs
LR chi2(10)
Prob > chi2
Pseudo R2
Log likelihood = -534.36165
insure
Coef.
Std. Err.
z
P>|z|
=
=
=
=
615
42.99
0.0000
0.0387
[95% Conf. Interval]
Indemnity
age
male
nonwhite
site2
site3
_cons
.011745
-.5616934
-.9747768
-.1130359
.5879879
-.2697127
.0061946
.2027465
.2363213
.2101903
.2279351
.3284422
1.90
-2.77
-4.12
-0.54
2.58
-0.82
0.058
0.006
0.000
0.591
0.010
0.412
-.0003962
-.9590693
-1.437958
-.5250013
.1412433
-.9134476
.0238862
-.1643175
-.5115955
.2989296
1.034733
.3740222
.0039489
-.1098438
-.7577178
-1.324599
.3801756
-1.556656
.0115994
.3651883
.4195759
.4697954
.3728188
.5963286
0.34
-0.30
-1.81
-2.82
1.02
-2.61
0.734
0.764
0.071
0.005
0.308
0.009
-.0187855
-.8255998
-1.580071
-2.245381
-.3505358
-2.725438
.0266832
.6059122
.0646357
-.4038165
1.110887
-.387873
Uninsure
age
male
nonwhite
site2
site3
_cons
(insure==Prepaid is the base outcome)
Data also comes from:
• use http://www.statapress.com/data/r11/sysdsn1.dta
• mlogit insure age male nonwhite
Clear, set memory and load data
clear
set mem 100000
use "http://staff.bath.ac.uk/hssjrh/oprob.dta"
Describe pers
. describe pers
storage display
variable name type format
value
label
persit5yr
qa5
double %10.0g
variable label
QA5 PERSONAL SITUATION - FIVE
YEARS AGO
• The variable relates to a person’s situation and
how it has changed over the last five years.
• Let us look at it.
• Type: tab2 pers pers
The most common response was
improved, but for over half of the
sample this was not the case
QA5 PERSONAL
SITUATION - FIVE
YEARS AGO
QA5 PERSONAL SITUATION - FIVE YEARS AGO
Improved Stayed ab Got worse
DK
Total
Improved
Stayed about the same
Got worse
DK
11,178
0
0
0
0
9,533
0
0
0
0
8,418
0
0
0
0
301
11,178
9,533
8,418
301
Total
11,178
9,533
8,418
301
29,430
Ordered probit
• We use this when we have discrete data and
when it is ordered. In this case
• 1 best (improved)
• 2 next best (stayed about the same)
• 3 worst (got worse).
The ordering is clear.
Change in personal situation
Assume an underlying and continuous variable relating
to changes in the individual’s personal situation
Change in personal situation
If this underlying variable is to the left of μ1 we classify the
variable as ‘1’ the individual’s position has improved
Change in personal situation
If this underlying variable is to the right of μ2 we classify the
variable as ‘3’ the individual’s position has got worse
Change in personal situation
In between these two values we classify the variable as ‘2’
the individual’s position has stayed the same
• You might say: surely ‘stay the same’ is one
specific value (perhaps 0) anything to the left
of this has improved and anything to the right
has got worse.
• But it is common to assume a range of values
which denote too small a change to denote
either ‘improve’ or ‘got worse’ and these
values are μ2 and μ1
Do the estimation.
• Simply use oprobit rather than regress.
oprobit persi lgnipc male age agesq rlaw estonia
village town selfemp marrd educ2 unemp manual
if age<98 & age>17 & persi<4
This regresses persi (note we do not have to write
its full name as this is the only variable in the data
set to begin with persi) on a set of right hand side
variables
if age<98 & age>17 & persi<4
This limits the regressions to individuals older
than 17 and under 98 and also cuts out those
who answered dont know (coded 4) for persi
The results
Ordered probit regression
Number of obs
LR chi2(13)
Prob > chi2
Pseudo R2
Log likelihood = -25990.573
persit5yr
Coef.
Std. Err.
lgnipc
male
age
agesq
rlaw
estonia
village
town
selfemp
marrd
educ2
unemp
manual
-.0766027
-.0249916
.0513208
-.0322755
-.2455444
-.869435
.0524684
.0338535
-.0974318
-.1534421
-.1429612
.6080104
.0821374
.0209432
.0147207
.0025145
.0025142
.011504
.0417246
.0184945
.0182975
.0293755
.015842
.0090999
.0313593
.0193907
/cut1
/cut2
-.6563796
.3095595
.078922
.0788323
z
-3.66
-1.70
20.41
-12.84
-21.34
-20.84
2.84
1.85
-3.32
-9.69
-15.71
19.39
4.24
P>|z|
0.000
0.090
0.000
0.000
0.000
0.000
0.005
0.064
0.001
0.000
0.000
0.000
0.000
=
=
=
=
25751
4392.02
0.0000
0.0779
[95% Conf. Interval]
-.1176506
-.0538437
.0463924
-.0372033
-.2680919
-.9512136
.0162199
-.002009
-.1550067
-.1844918
-.1607966
.5465472
.0441323
-.0355548
.0038604
.0562492
-.0273478
-.222997
-.7876564
.088717
.0697159
-.0398569
-.1223923
-.1251257
.6694736
.1201426
-.8110638
.155051
-.5016953
.464068
Ordered probit regression
Log likelihood = -25990.573
Number of obs
LR chi2(13)
Prob > chi2
Pseudo R2
=
=
=
=
25751
4392.02
0.0000
0.0779
The summary output shows the number of observations, the log likelihood
and the likelihood ratio. A pseudo R2 is exactly that and we may cover in the
lectures later. It is rarely very high in ordered probit.
persit5yr
Coef.
lgnipc
male
age
agesq
rlaw
estonia
village
town
selfemp
-.0766027
-.0249916
.0513208
-.0322755
-.2455444
-.869435
.0524684
.0338535
-.0974318
Std. Err.
.0209432
.0147207
.0025145
.0025142
.011504
.0417246
.0184945
.0182975
.0293755
z
-3.66
-1.70
20.41
-12.84
-21.34
-20.84
2.84
1.85
-3.32
P>|z|
0.000
0.090
0.000
0.000
0.000
0.000
0.005
0.064
0.001
[95% Conf. Interval]
-.1176506
-.0538437
.0463924
-.0372033
-.2680919
-.9512136
.0162199
-.002009
-.1550067
-.0355548
.0038604
.0562492
-.0273478
-.222997
-.7876564
.088717
.0697159
-.0398569
Remember the lower is the dependent variable (persi...) the better the person has
done (1 for improved, 3 got worse).
So a negative coefficient indicates that as that variable increases so the person
tends to have been doing better.
OK The self employed have been doing better as have people in Estonia????????
Those in countries with a good rule of law have done better and those in richer
countries too (lgnipic: log Gross nattional income per capita)
marrd
educ2
unemp
manual
-.1534421
-.1429612
.6080104
.0821374
.015842
.0090999
.0313593
.0193907
-9.69
-15.71
19.39
4.24
0.000
0.000
0.000
0.000
-.1844918
-.1607966
.5465472
.0441323
-.1223923
-.1251257
.6694736
.1201426
Married people and educated people have been doing better but the unemployed and
manual workers worse.
Impact of age
age
agesq
.0513208
-.0322755
.0025145
.0025142
20.41
-12.84
0.000
0.000
.0463924
-.0372033
The impact of age is thus 0.0513* AGE - 0.0322*AGE*AGE/100
0.0322*AGE*AGE/100 because this is how age squared was calculated
So the impact is:
AGE
25
40
55
70
IMPACT
1.0812
1.5368
1.8474
2.0132
As people get older the probability of things getting worse increases. WHY?
.0562492
-.0273478
And finally
/cut1
/cut2
-.6563796
.3095595
.078922
.0788323
These are the estimates of μ1 and μ2
-.8110638
.155051
-.5016953
.464068
• If for an individual the predicted value from
the regression is less than -0.6564 then they
would be predicted to be categorised as ‘1’ –
position improved.
• If for an individual the predicted value from
the regression is greater than 0.3096 then
they would be predicted to be categorised as
‘3’ –position has got worse..
• And if the predicted value lies between these
two values, then predicted value is ‘no
change’.
Let us calculate some examples. First do
the regression and store the coefficient
vector as cy
oprobit persi lgnipc male age agesq rlaw estonia
village town selfemp marrd educ2 unemp
manual if age<98 & age>17 & persi<4
matrix cy= e(b)
oprobit persi lgnipc male age agesq rlaw estonia
village town selfemp marrd educ2 unemp
manual if age<98 & age>17 & persi<4
cy[1,1] is the coefficient on lgnipc. The average
value for this is 3.0
• Then calculate
scalar py50 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]*
50 + cy[1,4]* 50*50/100 + cy[1,5]*5+
cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 +
cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 +
cy[1,13]*0
cy[1,2] is the coefficient on male. Let us code
this as 1, i.e. We are predicting for a man.
• scalar py50 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]*
50 + cy[1,4]* 50*50/100 + cy[1,5]*5+
cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 +
cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 +
cy[1,13]*0
scalar py50 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]*
50 + cy[1,4]* 50*50/100 + cy[1,5]*5+
cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 +
cy[1,10]*1 + cy[1,11]*4 + cy[1,12]*0 +
cy[1,13]*0
• The other characteristics are 50 years old,
country with the highest level of rule of law
(5), etc,
. display py50
-.39618871
This lies between -0.6564 and 0.3096, the two critical values and hence
this person would be predicted to be ‘no change’
Now let us try the same person, but aged 30.
scalar py30 =cy[1,1]*3.0 + cy[1,2]*1 + cy[1,3]* 30 + cy[1,4]* 30*30/100 +
cy[1,5]*5+ cy[1,6]*0 + cy[1,7]*1 + cy[1,8]*0 + cy[1,9]*0 + cy[1,10]*1 +
cy[1,11]*4 + cy[1,12]*0 + cy[1,13]*0
. display py30
-.90619611
This is less than the lower critical value of -0.6564 hence this person would be
predicted to have improved.
• No one has ever analysed this before and
there may be a paper.
• That people’s situation gets worse as they age
is not surprising, once they reach say 50. But
these results suggest It is so for those aged 30
viz a viz 20, just as much as 60 viz a viz 50.
• Perhaps we should try a spline on this just to
check the quadratic form on age is not
misleading
• And why do educated people fare better?
Multinomial Logit ‘by hand’
program myologit
args lnf xb a1 a2
quietly replace `lnf' = ln(1/(1+exp(-`a1' + `xb')))
if $ML_y1 == 1
quietly replace `lnf' = ln(1/(1+exp(-`a2'+ `xb')) 1/(1+exp(-`a1' + `xb'))) if $ML_y1 == 2
quietly replace `lnf' = ln(1 - 1/(1+exp(-`a2'+ `xb')))
if $ML_y1 == 3
end
* specify the method (lf) and the name of your evaluator (myologit)
* followed by the equation(s) in parantheses and then the cutpoints.
ml model lf myologit (xb: insure = age male nonwhite ) /a1 /a2
ml check
ml search
ml maximize,iterate(50)
ologit insure age male nonwhite
oprobit insure age male nonwhite
convergence not achieved
Number of obs
Wald chi2(3)
Prob > chi2
Log likelihood = -547.75513
Std. Err.
z
P>|z|
=
=
=
615
15.91
0.0012
insure
Coef.
[95% Conf. Interval]
age
male
nonwhite
_cons
-.0087368
.5056461
.5615129
3.866424
.0055974
.1826912
.1958493
.2924209
-1.56
2.77
2.87
13.22
0.119
0.006
0.004
0.000
-.0197076
.147578
.1776553
3.29329
.0022339
.8637142
.9453705
4.439558
_cons
3.622825
.1567336
23.11
0.000
3.315632
3.930017
_cons
6.30395
.
.
.
.
.
xb
a1
a2
Warning: convergence not achieved
Does not converge and no second cut off point. But the coefficients per se the
same as if we use the ologit command:
ologit insure age male nonwhite
. ologit insure age male nonwhite
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-555.85446
-547.76723
-547.75513
-547.75513
Ordered logistic regression
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
Log likelihood = -547.75513
insure
Coef.
Std. Err.
age
male
nonwhite
-.0087368
.5056461
.5615129
.0055974
.1826912
.1958493
/cut1
/cut2
-.2435994
2.437526
.2619071
.2924209
z
-1.56
2.77
2.87
P>|z|
0.119
0.006
0.004
=
=
=
=
615
16.20
0.0010
0.0146
[95% Conf. Interval]
-.0197076
.147578
.1776553
.0022339
.8637142
.9453705
-.7569278
1.864391
.2697289
3.01066
See also: http://www.ats.ucla.edu/stat/stata/code/ml_maximize.htm
use http://www.stata-press.com/data/r11/sysdsn1.dta
mlogit insure age male nonwhite
ologit persi lgnipc male age agesq rlaw estonia village town selfemp
marrd educ2 unemp manual if age<98 & age>17 & persi<4
program myologit
args lnf xb a1 a2
* The contribution to the likelihood at each level of y
quietly replace `lnf' = ln(1/(1+exp(-`a1' + `xb')))
if $ML_y1 == 1
quietly replace `lnf' = ln(1/(1+exp(-`a2'+ `xb')) - 1/(1+exp(-`a1' + `xb'))) if $ML_y1 == 2
quietly replace `lnf' = ln(1 - 1/(1+exp(-`a2'+ `xb')))
if $ML_y1 == 3
end
ologit insure age male nonwhite
oprobit insure age male nonwhite
ml model lf myologit (xb: insure = age male nonwhite ) /a1 /a2
ml check
ml search
ml maximize