DANE PANELOWE - Uniwersytet Warszawski
Download
Report
Transcript DANE PANELOWE - Uniwersytet Warszawski
PANEL DATA
Development
Workshop
What are we going to do today?
1.
2.
3.
4.
5.
Panels – introduction and data properties
How to measure distance
What comes first: trade or GDP?
What else affects trade?
Role of currency?
Why panel data?
What is the sense of panel data?
pooled
data in econometrics
panels in econometrics
long or wide?
fixed or random effects?
Gravity model
All that theory is ql, but transport costs matter and
market size matters: => push and pull
–
–
–
–
–
Isard (1954), logs by Tinbergen (1962) [what if there were no
barriers? „missing trade”], Linneman (1966) [standard macro
approach],
Anderson (1979) [first theoretical model – expenses based]
Helpman-Krugman (1985) [intra-industry trade]
Bergstrand (1985) [general equilibrium, one country/one factor]
Bergstrand (1989) [H-O model with Lindera hypothesis]
Simplest model
Variables:
Explained: bilateral trade
– Explanatory: GDP, populations, distance
reg lntrade lngdp lnpop lndist
–
Source
SS
df
MS
Model
Residual
2462.80855
504.022113
3
1070
820.936185
.471048704
Total
2966.83067
1073
2.76498664
lntrade
Coef.
lngdp
lndist
lnpop
_cons
1.050343
-1.364936
.3443573
2.82369
Std. Err.
.0695453
.0408487
.0719854
.4257506
t
15.10
-33.41
4.78
6.63
Number of obs
F( 3, 1070)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
0.000
=
1074
= 1742.78
= 0.0000
= 0.8301
= 0.8296
= .68633
[95% Conf. Interval]
.9138822
-1.445088
.2031089
1.988289
1.186803
-1.284783
.4856058
3.659091
Panel data
Data collected for the same group of subjects for several years =>
there may be something consistently specific about some
particular subjects
E(yit| xit, c) = xit β+ ci, with ci unobserved and time constant
Unobserved (individual) effect (component) also unobserved
(individual) heterogeneity
Panel data
yit = xit β+ ci + uit and E (uit) = 0
If ci is nonzero, running yit = xit β+ uit has an omitted
variable problem, so biased
Panel results – same as OLS?
Random-effects GLS regression
Group variable: id
Number of obs
Number of groups
=
=
1074
91
R-sq:
Obs per group: min =
avg =
max =
6
11.8
12
within = 0.5289
between = 0.8342
overall = 0.8158
corr(u_i, X)
Wald chi2(3)
Prob > chi2
= 0 (assumed)
lntrade
Coef.
Std. Err.
z
lngdp
lndist
lnpop
_cons
1.776719
-1.239047
-.3058137
-.3401835
.0571014
.1286976
.1083628
1.054482
sigma_u
sigma_e
rho
.61522954
.29088637
.81729472
(fraction of variance due to u_i)
31.12
-9.63
-2.82
-0.32
P>|z|
0.000
0.000
0.005
0.747
=
=
1549.73
0.0000
[95% Conf. Interval]
1.664803
-1.491289
-.5182008
-2.40693
1.888636
-.9868039
-.0934266
1.726562
Panel results – same as OLS?
Variable
lngdp
lndist
lnpop
_cons
OLS
1.0503428
-1.3649357
.34435733
2.8236899
PANEL
1.7767192
-1.2390467
-.30581371
-.34018353
Panel data – how to get results?
Same data, same question, but „sth” consists of groups over time
STATA needs to be told what are panel dimensions
1. Set of commands:
iis grouping_var
tis time_var
2. xtset grouping_var time_var
3. tsset grouping_var time_var
(they are all equivalent)
Once data are set for panel? xtsum vs sum
Panel data – not as simple as that…
yit = xitβ + νit and ν it = c i + u it
Case 1 - E (xitci) = 0
–
–
Case 2 - E (xitci) ≠ 0 – RANDOM EFFECTS
–
not correlated with individual effects E (uit;ci) = 0
Case 3 - E (xitci) ≠ 0 – FIXED EFFECTS
–
Individual effects and exogenous variables uncorrelated
OLS consistent – no need for panel
correlated with individual effects E (uit;ci) ≠ 0
Need to choose somehow…
Panel regression
Do not forget context menu in STATA
To find out how to do panel regressions in STATA:
Statistics => Longtitudal/panel data
–
Many options already covered: xtset, sum, des, tab
(check’em out)
–
Also: linear models
Simplest code
–
xtreg lntrade lnpop lngdp lndist
RANDOM
–
xtreg lntrade lnpop lngdp lndist, fe
FIXED
Testing for FIXED vs RANDOM
Need to know if E (uit;ci) = 0
If E (uit;ci) ≠ 0, then assuming that E (uit;ci) = 0 leads to bias
If E (uit;ci) = 0, then estimating a model that allows E (uit;ci) ≠ 0
inefficient (too many parameters estimated)
In plain English:
– If RE is true, runnig FE leads to inefficient estimators
– If FE is true, running RE leads to biased estimators
An idea for a test: compare coefficients and see if they are „the
same”. If yes – RE is better (efficient)
Called HAUSMAN TEST or Breish-Pagan test
hausman fe re
Coefficients
(b)
(B)
fe
re
lngdp
lnpop
1.043304
13.0698
1.776719
-.3058137
(b-B)
Difference
-.7334154
13.37561
sqrt(diag(V_b-V_B))
S.E.
.0799723
1.391313
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg
Test:
Ho:
difference in coefficients not systematic
chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
=
78.60
Prob>chi2 =
0.0000
(V_b-V_B is not positive definite)
xttest0
Breusch and Pagan Lagrangian multiplier test for random effects
lntrade[id,t] = Xb + u[id] + e[id,t]
Estimated results:
Var
lntrade
e
u
Test:
sd = sqrt(Var)
2.764987
.0846149
.3785074
1.662825
.2908864
.6152295
Var(u) = 0
chibar2(01) =
Prob > chibar2 =
3397.25
0.0000
Let’s interpret FE panel estimator
Fixed-effects (within) regression
Group variable: id
Number of obs
Number of groups
=
=
1074
91
R-sq:
Obs per group: min =
avg =
max =
6
11.8
12
within = 0.5693
between = 0.5846
overall = 0.5648
corr(u_i, Xb)
F(2,981)
Prob > F
= -0.9941
lntrade
Coef.
lngdp
lndist
lnpop
_cons
1.043304
0
13.0698
-54.17363
sigma_u
sigma_e
rho
9.9186897
.29088637
.99914066
F test that all u_i=0:
Std. Err.
.0982657
(omitted)
1.395526
4.665788
t
=
=
648.39
0.0000
P>|t|
[95% Conf. Interval]
10.62
0.000
.8504686
1.236139
9.37
-11.61
0.000
0.000
10.33124
-63.32971
15.80836
-45.01756
(fraction of variance due to u_i)
F(90, 981) =
124.35
Prob > F = 0.0000
How do we know if it makes sense?
Different from pooled estimator? But maybe just looks so?
What if we add country effects to a pooled estimation? Let’s try
areg lntrade lnpop lngdp lndist, absorb(id)
Some we know from the literature and some from experience
– Linear or in logs? Maybe also non-linear terms and
interactions, trade or export share, etc.
– Should we do fixed or random effects?
– Are we interested in differences across time or across
countries? Between and within R2 tell a different story, no?
What do our models say?
areg lntrade lngdp lndist lnpop, a(id)
note: lndist omitted because of collinearity
Linear regression, absorbing indicators
lntrade
Coef.
lngdp
lndist
lnpop
_cons
1.043304
0
13.0698
-54.17363
id
Std. Err.
.0982657
(omitted)
1.395526
4.665788
F(90, 981) =
Number of obs
F(
2,
981)
Prob > F
R-squared
Adj R-squared
Root MSE
t
=
=
=
=
=
=
1074
648.39
0.0000
0.9720
0.9694
0.2909
P>|t|
[95% Conf. Interval]
10.62
0.000
.8504686
1.236139
9.37
-11.61
0.000
0.000
10.33124
-63.32971
15.80836
-45.01756
124.348
0.000
(91 categories)
OLS, RE, FE, AREG - comparison
Variable
lngdp
lndist
lnpop
_cons
OLS
1.0503428
-1.3649357
.34435733
2.8236899
re
fe
AREG
1.7767192
-1.2390467
-.30581371
-.34018353
1.0433037
(omitted)
13.069798
-54.173633
1.0433037
(omitted)
13.069798
-54.173633
What if there are FEs?
One idea to estimate a model with FE without
FE is to first difference the model
–
yit = xitβ + ci + uit => Δyit = Δ xit β + Δ uit
We loose all time-invariant effects (like in FE)
Interpretation of the coefficients is different
FD estimation…
Source
SS
df
MS
Model
Residual
.14862047
38.2886732
2
980
.074310235
.039070075
Total
38.4372936
982
.039141847
D.lntrade
Coef.
Std. Err.
lngdp
D1.
-.0821923
.1301412
lndist
D1.
0
(omitted)
lnpop
D1.
-2.734188
_cons
.1004433
t
Number of obs
F( 2,
980)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
=
=
=
=
=
=
983
1.90
0.1498
0.0039
0.0018
.19766
[95% Conf. Interval]
-0.63
0.528
-.3375798
.1731952
1.492163
-1.83
0.067
-5.662391
.1940149
.009844
10.20
0.000
.0811255
.1197611
And if we wanted to keep TI effects?
yit = ziγ + xitβ + ci + uit
FE and FD eliminate γ
But if E (z’ici) = 0 we can estimate γ –
Hausman-Taylor estimator
But that is for more advanced users …
xthtaylor lntrade lndist lnpop lngdp,
endog(lngdp) constant(lndist)
Hausman-Taylor estimation
Group variable: id
Number of obs
Number of groups
Random effects u_i ~ i.i.d.
lntrade
TVexogenous
lnpop
TVendogenous
lngdp
TIexogenous
lndist
Coef.
=
=
1074
91
Obs per group: min =
avg =
max =
6
11.8
12
Wald chi2(3)
Prob > chi2
Std. Err.
z
=
=
1305.69
0.0000
P>|z|
[95% Conf. Interval]
6.697837
.9950836
6.73
0.000
4.747509
8.648165
1.414057
.0793424
17.82
0.000
1.258548
1.569565
2.249499
2.196748
1.02
0.306
-2.056048
6.555046
_cons
-48.74691
16.72393
-2.91
0.004
-81.5252
-15.96861
sigma_u
sigma_e
rho
9.7558279
.2905903
.99911356
(fraction of variance due to u_i)
Do more complex estimators
always make more sense?
Variable
lngdp
lndist
lnpop
_cons
OLS
1.0503428
-1.3649357
.34435733
2.8236899
re
fe
AREG
HT
1.7767192
-1.2390467
-.30581371
-.34018353
1.0433037
(omitted)
13.069798
-54.173633
1.0433037
(omitted)
13.069798
-54.173633
1.4140566
2.2494989
6.6978367
-48.746905
What do you find on do-file?
1.
2.
3.
Declare panel, run simplest models, do graphs, etc
Run diagnostics
Learn more
Next - huge problem - endogeneity
What is first:
– rich trade more or rich because trade more?
– how to go around this problem?
What is it that we want?
– Cross country differences?
– Time evolutions within one country?
– Test theory?