Week 1 - University of Essex

Download Report

Transcript Week 1 - University of Essex

SC968
Panel data methods for sociologists
Lecture 1, part 1
A review of concepts for
regression modelling
Or
things you should know already
Overview

Models






OLS, logit and probit
Mathematically and practically
Interpretation of results, measures of fit and regression diagnostics
Model specification
Post-estimation commands
STATA competence
Ordinary Least Squares (OLS)
Value of dependent
variable for
individual i
(LHS variable)
yi    xi11  xi 2 2  xi33  ......... xiK  K   i
Intercept
(constant)
Value of
explanatory
variable 1 for
person i
Coefficient on
variable 1
Examples
yi = mental health
x1 = sex
x2 = age
x3 = marital status
x4 = employment status
x5 = physical health
yi = hourly pay
x1 = sex
x2 = age
x3 = education
x4 = job tenure
x5 = industry
x6 = region
Residual
(disturbance, error
term)
Total no. of explanatory
variables (RHS variables
or regressors) is K
OLS
yi    xi11  xi 2 2  xi33  ......... xiK  K   i
In vector form
In matrix form
yi  xi '    i
Vector of
explanatory
variables
y  X ' 
Vector of
coefficients
 1 
 2 
 
yi  xi1 xi 2 xi 3 . . xiK *   3    i
 . 
 . 
 K 
Note: you will often see x’β written as xβ
 y1   x11
 y2   x21
 y  x
 3   31
 y4    x41
 y5   x51
 .   .
 .   .
 y N   xN 1
x12
x22
x32
x42
x52
.
.
xN 2
x13 . . x1K 
 1 
x23 . . x2 K   1    2 
x33 . . x3 K    2    3 

x43 . . x4 K  *   3    . 
   
x53 . . x5 K   .   . 
. .. .   .   . 
. . . .   K   . 
 N 
xN 3 . . xNK 
OLS



Also called “linear regression”
Assumes dependent variable is a linear combination of dependent
variables, plus disturbance
“Least squares”: β’s estimated so as to minimise the sum of the ε’s.
min  ( i ) 2
min{ '  }
b  ( X ' X ) 1 X ' y
Assumptions


Residuals have zero mean………………………………. E( i )  0
Follows that ε’s and X’s are uncorrelated………………. E( i | X i )  0
E( i X i )  0
 violated if a regressor is endogenous



Homoscedasticity: all ε’s have same variance ………… Var ( i )   2



Classic example: food consumption and income
Cure by using weighted least squares
Nonautocorrelation: ε’s uncorrelated with each other … E( i j )  0



Eg, number of children in female labour supply models
Cure by (eg) Instrumental Variables
Data sets where the same individual appears multiple times
Adjust standard errors: ‘cluster’ option in STATA
Distubances are iid (normally distributed, zero mean, constant
variance)
When is OLS appropriate?

When you have a continuous dependent variable



When the assumptions are not obviously violated
As a first step in research to get ball-park estimates


We will use them a lot for this purpose
Worked examples






Eg, you would use it to estimate regressions for height, but not for whether a
person has a university degree.
Coefficients, P-values, t-statistics
Measures of fit (R-squared, adjusted R-squared)
Thinking about specification
Post-estimation commands
Regression diagnostics.
A note on the data

All examples (in lectures and practicals) drawn from a 20% sample of the British
Household Panel Survey (BHPS) – more about the data later!
Summarize monthly earned income
.
sum incm if age >= 17 & age <= 64, d
incm
1%
5%
10%
25%
50%
75%
90%
95%
99%
Percentiles
43
156
268.6667
615.3333
Smallest
1
1.25
2
2.416667
1073.088
1690
2471.848
3061.355
5003.849
Largest
9207.083
9333.333
10000
10000
Obs
Sum of Wgt.
16696
16696
Mean
Std. Dev.
1282.831
1008.308
Variance
Skewness
Kurtosis
1016685
2.19295
11.94321
First worked example
Monthly labour income, for people
whose labour income is >= £1
For illustrative purposes only. Not
an example of good practice.
MS = SS/df
. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"
Analysis of variance
(ANOVA) table. reg
incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64
Source
R-squared =
Model SS /
Total SS
SS
df
MS
Model
Residual
4.8145e+09
7
1.1811e+10 16450
687785597
718000.667
Total
1.6626e+10 16457
1010245.5
incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
mth_int
_cons
-594.9641
101.0994
-1.155281
155.7992
380.5032
1076.674
-5.059072
-819.931
Std. Err.
13.26812
3.859657
.0479992
16.62703
14.36582
20.54526
4.036446
78.80064
t
-44.84
26.19
-24.07
9.37
26.49
52.40
-1.25
-10.41
T-stat =
coefficient / standard error
Number of obs
F( 7, 16450)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.210
0.000
=
=
=
=
=
=
16458
957.92
0.0000
0.2896
0.2893
847.35
Tests
whether
all coeffs
except
constant
are
jointly
zero
[95% Conf. Interval]
-620.9711
93.53401
-1.249364
123.2085
352.3446
1036.403
-12.97094
-974.3888
-568.9571
108.6647
-1.061197
188.39
408.6618
1116.945
2.8528
-665.4732
Coefficients
+ or – 1.96 standard errors
Root MSE =
sqrt(MS)
What do the results tell us?
. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"
.
reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64
Source





SS
df
MS
Model
Residual
4.8145e+09
7
1.1811e+10 16450
687785597
718000.667
Total
1.6626e+10 16457
1010245.5
incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
mth_int
_cons
-594.9641
101.0994
-1.155281
155.7992
380.5032
1076.674
-5.059072
-819.931
Std. Err.
13.26812
3.859657
.0479992
16.62703
14.36582
20.54526
4.036446
78.80064
t
-44.84
26.19
-24.07
9.37
26.49
52.40
-1.25
-10.41
Number of obs
F( 7, 16450)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.210
0.000
=
=
=
=
=
=
16458
957.92
0.0000
0.2896
0.2893
847.35
[95% Conf. Interval]
-620.9711
93.53401
-1.249364
123.2085
352.3446
1036.403
-12.97094
-974.3888
-568.9571
108.6647
-1.061197
188.39
408.6618
1116.945
2.8528
-665.4732
All coefficients except month of interview are significant
29% of variation explained
Being female reduces income by nearly £600 per month
Income goes up with age and then down
16458 observations…..oops, this is from panel data, so there are repeated
observations on individuals.
Add ,cluster(pid) as an option
.
reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64, cluster(pid)
Linear regression
Number of obs
F( 7, 2465)
Prob > F
R-squared
Root MSE
=
=
=
=
=
16458
135.26
0.0000
0.2896
847.35
(Std. Err. adjusted for 2466 clusters in pid)


incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
mth_int
_cons
-594.9641
101.0994
-1.155281
155.7992
380.5032
1076.674
-5.059072
-819.931
Robust
Std. Err.
31.81172
7.323088
.0933813
30.87227
30.36746
64.45131
4.126102
132.8455
t
-18.70
13.81
-12.37
5.05
12.53
16.71
-1.23
-6.17
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.220
0.000
[95% Conf. Interval]
-657.3445
86.73932
-1.338395
95.26099
320.9549
950.2898
-13.15006
-1080.431
-532.5836
115.4594
-.9721666
216.3375
440.0516
1203.058
3.031912
-559.4306
Coefficients, R-squared etc are unchanged from previous specification
But standard errors are adjusted: standard errors larger, t-statistics are lower
Let’s get rid of the “month” variable
.
reg incm female age age2 partner ed_sec ed_deg
Linear regression
if age >= 17 & age <= 64, cluster(pid)
Number of obs
F( 6, 2466)
Prob > F
R-squared
Root MSE
=
=
=
=
=
16460
156.78
0.0000
0.2895
847.33
(Std. Err. adjusted for 2467 clusters in pid)
incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
_cons
-594.8596
100.9827
-1.153834
155.5618
381.0247
1076.837
-866.2836
Robust
Std. Err.
31.80682
7.325995
.0934155
30.87778
30.36183
64.44019
125.9787
t
-18.70
13.78
-12.35
5.04
12.55
16.71
-6.88
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
-657.2304
86.617
-1.337015
95.01275
321.4874
950.4745
-1113.319
-532.4887
115.3485
-.9706534
216.1109
440.562
1203.199
-619.2486
Think about the female coefficient a bit more. Could it be to do with
women working shorter hours?
Control for weekly hours of work
.
reg incm female age age2 partner ed_sec ed_deg hrsm
Linear regression
if age >= 17 & age <= 64, cluster(pid)
Number of obs
F( 7, 2262)
Prob > F
R-squared
Root MSE
=
=
=
=
=
13998
247.67
0.0000
0.4580
690.95
(Std. Err. adjusted for 2263 clusters in pid)


incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
hrsm
_cons
-314.6874
79.55289
-.873335
148.0265
340.68
996.7434
5.654682
-1495.805
Robust
Std. Err.
34.32954
6.372918
.0817518
26.07885
26.67171
59.88369
.2467777
111.8223
t
-9.17
12.48
-10.68
5.68
12.77
16.64
22.91
-13.38
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
-382.0081
67.05551
-1.033651
96.88551
288.3764
879.3107
5.170747
-1715.09
-247.3667
92.05027
-.7130186
199.1675
392.9835
1114.176
6.138616
-1276.52
Is the coefficient on hours of work reasonable?
£5.65 for every additional hour worked – certainly in the right ball park.
Looking at 2 specifications together
Linear Number
regression
of obs
F( 6, 2466)
Prob > F
R-squared
Root MSE





incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
_cons
-594.8596
100.9827
-1.153834
155.5618
381.0247
1076.837
-866.2836
=
=
=
=
=
16460
156.78
0.0000
0.2895
847.33
Robust
Std. Err.
31.80682
7.325995
.0934155
30.87778
30.36183
64.44019
125.9787
Number of obs
F( 7, 2262)
Prob > F
R-squared
Root MSE
t
-18.70
13.78
-12.35
5.04
12.55
16.71
-6.88
P>|t|
=
=
=
=
=
13998
247.67
0.0000
0.4580
690.95
Robust
[95% Conf. Interval]
incm
Coef.
Std. Err.
0.000
-657.2304
-532.4887
female
-314.6874
34.32954
0.000
86.617
115.3485
age
79.55289
6.372918
0.000
-1.337015
-.9706534
age2
-.873335
.0817518
0.000
95.01275
216.1109
partner
148.0265
26.07885
0.000
321.4874
440.562
ed_sec
340.68
26.67171
0.000
950.4745
1203.199
ed_deg
996.7434
59.88369
0.000
-1113.3195.654682
-619.2486
hrsm
.2467777
_cons
-1495.805
111.8223
t
-9.17
12.48
-10.68
5.68
12.77
16.64
22.91
-13.38
R-squared jumps from 29% to 46%
Coefficient on female goes from -595 to -315
Almost half the effect of gender is explained by women’s shorter hours of work
Age, partner and education coefficients are also reduced in magnitude, for similar
reasons
Number of observations reduces from 16460 to 13998 – missing data on hours
P>|t|
[95%
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
-382.0
67.05
-1.033
96.88
288.3
879.3
5.170
-1715
Interesting post-estimation activities
Number of obs
F( 7, 2262)
Prob > F
R-squared
Root MSE
incm
Coef.
female
age
age2
partner
ed_sec
ed_deg
hrsm
_cons
-314.6874
79.55289
-.873335
148.0265
340.68
996.7434
5.654682
-1495.805
=
=
=
=
=
13998
247.67
0.0000
0.4580
690.95
Robust
Std. Err.
34.32954
6.372918
.0817518
26.07885
26.67171
59.88369
.2467777
111.8223
What age does income peak?
Income = Y + β1*age + β2*age2
t
-9.17
12.48
-10.68
5.68
12.77
16.64
22.91
-13.38
= β1+ 2β2*age
P>|t|d(Income)/d(age)
[95% Conf. Interval]
0.000Derivative
-382.0081
= zero -247.3667
when
0.000
67.05551
92.05027
0.000
-1.033651
-.7130186
= - β1/2β
2
0.000age 96.88551
199.1675
0.000
288.3764
392.9835
= -79.552/(-0.873*2)
0.000
879.3107
1114.176
0.000
5.170747
6.138616
0.000
-1715.09
= 45.5 -1276.52
Is the effect of university qualifications statistically different from the effect of secondary
education?
.
test ed_sec = ed_deg
( 1)
ed_sec - ed_deg = 0
F(
1, 2262) =
Prob > F =
110.75
0.0000
A closer look at “couple” coefficient
.
bysort female: reg incm age age2 partner ed_sec ed_deg hrsm
if age >= 17 & age <= 64, cluster(pid)
-> female = 0
Linear regression
Number of obs
F( 6, 1095)
Prob > F
R-squared
Root MSE
=
=
=
=
=
6776
115.53
0.0000
0.3452
787.93

Men benefit much
more than women
from being in a
couple.

Other coefficients
also differ
between men and
women, but with
current
specification, we
can’t test whether
differences are
significant.
(Std. Err. adjusted for 1096 clusters in pid)
incm
Coef.
age
age2
partner
ed_sec
ed_deg
hrsm
_cons
113.3119
-1.257366
213.351
356.7472
1082.255
3.930412
-1907.107
Robust
Std. Err.
10.56356
.1316253
46.9817
41.91151
89.21241
.3784925
175.5681
t
10.73
-9.55
4.54
8.51
12.13
10.38
-10.86
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
92.5848
-1.515633
121.1667
274.5113
907.2087
3.18776
-2251.595
134.039
-.9991
305.5354
438.9832
1257.302
4.673065
-1562.62
-> female = 1
Linear regression
Number of obs
F( 6, 1166)
Prob > F
R-squared
Root MSE
=
=
=
=
=
7222
125.30
0.0000
0.4830
564.65
(Std. Err. adjusted for 1167 clusters in pid)
incm
Coef.
age
age2
partner
ed_sec
ed_deg
hrsm
_cons
56.20989
-.6229372
84.15365
277.2823
819.3002
6.806946
-1382.844
Robust
Std. Err.
7.327688
.0961605
29.27677
31.66175
73.74637
.3051015
133.0607
t
7.67
-6.48
2.87
8.76
11.11
22.31
-10.39
P>|t|
0.000
0.000
0.004
0.000
0.000
0.000
0.000
[95% Conf. Interval]
41.83296
-.8116041
26.7126
215.1619
674.6098
6.208337
-1643.909
70.58682
-.4342702
141.5947
339.4026
963.9906
7.405556
-1121.779
Logit and Probit



Developed for discrete (categorical) dependent variables
Eg, psychological morbidity, whether one has a job…. Think of other
examples.
Outcome variable is always 0 or 1. Estimate:
Pr(Y  1)  F ( X ,  )
Pr(Y  0)  1  F ( X ,  )


OLS (linear probability model) would set F(X,β) = X’β + ε
Inappropriate because:


Heteroscedasticity: the outcome variable is always 0 or 1,
so ε only takes the value -x’β or 1-x’β
More seriously, one cannot constrain estimated probabilities
to lie between 0 and 1.
Logit and Probit

Looking for a function which lies between 0 and 1:

Cumulative normal distribution: Probit model
 'x
Pr(Y  1)    (t ).dt  ( X '  )


Logistic distribution: Logit (logistic) model
e x
Pr(Y  1) 
 ( x )
x
1 e

They are very similar! Note how they lie between 0 and 1 (vertical axis)

http://www.gseis.ucla.edu/courses/ed231c/notes3/probit1.html
Maximum likelihood estimation


Likelihood function: product of

Pr(y=1) = F(x’β)

Pr(y=0) = 1- F(x’β) for all observations where y=0

(think of the probability of flipping exactly four heads and two tails, with six dice)
for all observations where y=1
Log likelihood written as
ln L   w j ln F ( x j  )   w j ln[1  F ( x j  )]
js

js
Estimated using an iterative procedure

STATA chooses starting values for β’s

Computes slopes of likelihood function at these values

Adjusts β’s accordingly

Stops when slope of LF is ≈0

Can take time!
Let’s look at whether a person works
.
tab jbstat, m
current economic
activity
Freq.
Percent
Cum.
missing or wild
-7
not answered
self-employed
employed
unemployed
retired
maternity leave
family care
ft studt, school
lt sick, disabld
gvt trng scheme
other
.
13
66
2
2,204
14,702
1,120
4,726
320
1,964
1,394
1,057
67
124
9,793
0.03
0.18
0.01
5.87
39.15
2.98
12.59
0.85
5.23
3.71
2.81
0.18
0.33
26.08
0.03
0.21
0.22
6.08
45.24
48.22
60.80
61.66
66.89
70.60
73.41
73.59
73.92
100.00
Total
37,552
100.00
gen byte work = (jbstat == 1 | jbstat == 2) if jbstat >= 1 & jbstat != .
Logit regression: whether have a job
All the iterations
.
logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pi
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
=
=
=
=
=
-9174.0313
-7909.9067
-7838.4288
-7838.2372
-7838.2372
Logistic regression
2* (LL of this model – LL of null model)
Number of obs
Wald chi2(8)
Prob > chi2
Pseudo R2
Log pseudolikelihood = -7838.2372
=
=
=
=
17268
613.59
0.0000
0.1456
(Std. Err. adjusted for 2430 clusters in pid)



work
Coef.
female
age
age2
badhealth
partner
ed_sec
ed_deg
nkids
_cons
-.8001156
.3578242
-.0046546
-.5213826
.4681257
.602653
.8734892
-.477714
-3.666352
Robust
Std. Err.
.090802
.0282831
.0003504
.0404068
.0943383
.0900282
.1468462
.0391116
.5216913
z
-8.81
12.65
-13.28
-12.90
4.96
6.69
5.95
-12.21
-7.03
P>|z|
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
[95% Conf. Interval]
-.9780842
.3023904
-.0053414
-.6005785
.2832261
.4262009
.585676
-.5543714
-4.688848
-.622147
.4132581
-.0039679
-.4421868
.6530253
.7791051
1.161302
-.4010567
-2.643856
Interpret like Rsquared, but is
computed differently
From these coefficients, can tell whether estimated effects are positive or negative
Whether they’re significant
Something about effect sizes – but difficult to draw inferences from coefficients
Comparing logit and probit
female
age
age2
badhealth
partner
ed_sec
ed_deg
nkids
_cons

Logit
-0.800
0.358
-0.005
-0.521
0.468
0.603
0.873
-0.478
-3.666
Probit
-0.455
0.206
-0.003
-0.300
0.284
0.343
0.476
-0.275
-2.112
Probit * 1.6
-0.728
0.330
-0.004
-0.479
0.455
0.548
0.762
-0.441
-3.380
Scaling factor proposed by Amemiya (1981)


Multiply Probit coefficients by 1.6 to get an approximation to Logit
Other authors have suggested a factor of 1.8
Marginal effects


After logit or Probit estimation, type “mfx” into the command line
Calculates marginal effects of each of the RHS variables on the
dependent variable





.
Slope of the function for continuous variables
Effect of change from 0 to 1 in a dummy variable
Can also calculate elasticities
By default, calculates mfx at means of dependent variables
Can also calculate at medians, or at specified points
mfx
Marginal effects after logit
y = Pr(work) (predict)
= .81734048
variable
female*
age
age2
badhea~h
partner*
ed_sec*
ed_deg*
nkids
dy/dx
-.1182405
.0534214
-.0006949
-.0778398
.0755794
.0866619
.105861
-.0713203
Std. Err.
.013
.004
.00005
.00633
.01644
.01255
.01407
.0059
z
-9.09
13.35
-13.90
-12.29
4.60
6.91
7.52
-12.08
P>|z|
[
95% C.I.
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
-.143723
.045578
-.000793
-.090249
.043352
.062063
.078281
-.082891
]
-.092758
.061265
-.000597
-.065431
.107806
.111261
.133441
-.059749
(*) dy/dx is for discrete change of dummy variable from 0 to 1
X
.525712
39.8687
1705.71
2.12746
.770848
.398077
.134063
.732221
Marginal effects
female*
age
age2
badhea~h
partner*
ed_sec*
ed_deg*
nkids
Constant
Logit
-0.118
0.053
-0.001
-0.078
0.076
0.087
0.106
-0.071
Probit
-0.122
0.056
-0.001
-0.081
0.082
0.090
0.109
-0.075

Logit and Probit mfx are very similar indeed

OLS is actually not too bad
OLS
-0.114
0.057
-0.001
-0.086
0.075
0.094
0.118
-0.077
-0.045
Odds ratios
Only an option with logit
Type “or” in, after the comma as an option
Reports odds ratios: that is, how many times more (or less) likely the outcome becomes





Results >1 show an increased probability, results <1 show decrease

.
if the variable is 1 rather than 0, in the case of a dichotomous variable
for each unit increase of the variable, for a continuous variable
logit
work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pid) or
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
=
=
=
=
=
-9174.0313
-7909.9067
-7838.4288
-7838.2372
-7838.2372
Logistic regression
Number of obs
Wald chi2(8)
Prob > chi2
Pseudo R2
Log pseudolikelihood = -7838.2372
=
=
=
=
17268
613.59
0.0000
0.1456
(Std. Err. adjusted for 2430 clusters in pid)
work
Odds Ratio
female
age
age2
badhealth
partner
ed_sec
ed_deg
nkids
.449277
1.430214
.9953562
.5936991
1.596998
1.826959
2.395254
.6201995
Robust
Std. Err.
.0407952
.0404509
.0003488
.0239895
.150658
.1644779
.3517339
.024257
z
-8.81
12.65
-13.28
-12.90
4.96
6.69
5.95
-12.21
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
.3760308
1.353089
.9946728
.5484942
1.327405
1.531428
1.796205
.5744333
.5367907
1.511735
.99604
.6426296
1.921345
2.179521
3.194091
.6696121
Other post-estimation commands

Likelihood ratio test “lrtest”






How to do it







Adding an extra variable to the RHS always increases the likelihood
But, does it add “enough” to the likelihood?
LR test calculates L0/L1 (Lrestricted/Lunrestricted) and calculates chi-squared stat
with d.f. equal to the number of variables you are dropping.
Null hypothesis: restricted specification.
Only works on nested models, ie, where the RHS variables in one model are
a subset of the RHS variables in the other.
Run the full model
Type “estimates store NAME”
Run a smaller model
Type “estimates store ANOTHERNAME”
….. And so on for as many models as you like
Type “lrtest NAME ANOTHERNAME”
Be careful…..



Sample sizes must be the same for both models
Won’t happen if the dropped variable is missing for some observations
Solve problem by running the biggest model first and using e(sample)
. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"
.
logit
LR test - example
work age age2 badhealth partner ed_sec ed_deg nkids r_*
Iteration 0:
Iteration 1:
Iteration 2:
Iteration

Similar3:but
Iteration 4:
if age >= 21 & age <= 60 & wave == 15
log likelihood = -548.06325
log likelihood = -480.90757
log likelihood = -477.04783
logidentical
likelihood
= -477.02974
not
regression
to previous
log likelihood = -477.02974
examples

Add regional variables, decide which ones to keep
Logistic regression
Number of obs

Looks as though Scotland might stay, also possibly SW,
NW, N
LR chi2(14)
=
=
Prob > chi2
=
Log
likelihood
= age
-477.02974
=
.
logit work
age2 badhealth partner ed_secPseudo
ed_degR2
nkids r_*
work
Coef.
age
age2
badhealth
partner
ed_sec
ed_deg
nkids
r_lon
r_mid
r_sw
r_nw
r_nth
r_wls
r_sco
_cons
.3093037
-.0039101
-.5337105
.244436
.7737744
1.356818
-.3589658
-.5363941
-.3796683
-.7379424
-.6369382
-.6270993
-.4251579
-1.183256
-3.042685
Std. Err.
.0599295
.000729
.0893498
.1984392
.1749884
.2734631
.0813107
.3247594
.2807851
.34484
.3140179
.2940544
.3862621
.3413128
1.133771
.
estimates store ALL
.
quietly logit
.
estimates store DROPREG
.
quietly logit
.
estimates store KEEP4
.
quietly logit
.
estimates store KEEPSCOT
z
5.16
-5.36
-5.97
1.23
4.42
4.96
-4.41
-1.65
-1.35
-2.14
-2.03
-2.13
-1.10
-3.47
-2.68
P>|z|
0.000
0.000
0.000
0.218
0.000
0.000
0.000
0.099
0.176
0.032
0.043
0.033
0.271
0.001
0.007
1066
142.07
0.0000
0.1296
if
age >= 21 & age <= 60 & wave == 15
[95% Conf. Interval]
.1918441
-.0053389
-.708833
-.1444977
.4308035
.8208405
-.5183318
-1.172911
-.929997
-1.413816
-1.252402
-1.203435
-1.182218
-1.852216
-5.264836
.4267633
-.0024814
-.358588
.6333698
1.116745
1.892796
-.1995998
.1001226
.1706604
-.0620684
-.0214744
-.0507633
.3319019
-.5142949
-.8205342
work age age2 badhealth partner ed_sec ed_deg nkids
if e(sample)
work age age2 badhealth partner ed_sec ed_deg nkids r_sco r_sw r_nw r_nth if e(sample)
work age age2 badhealth partner ed_sec ed_deg nkids r_sco
if e(sample)
LR test - example
.
lrtest ALL
DROPREG
Likelihood-ratio test
(Assumption: DROPREG nested in ALL)
.
lrtest ALL
lrtest ALL
LR chi2(3) =
Prob > chi2 =
3.34
0.3422
LR chi2(6) =
Prob > chi2 =
7.60
0.2689
LR chi2(3) =
Prob > chi2 =
4.26
0.2347
LR chi2(1) =
Prob > chi2 =
6.59
0.0102
KEEPSCOT
Likelihood-ratio test
(Assumption: KEEPSCOT nested in ALL)
.
14.19
0.0479
REJECT nested
specification
KEEP4
Likelihood-ratio test
(Assumption: KEEP4 nested in ALL)
.
LR chi2(7) =
Prob > chi2 =
DON’T REJECT
nested spec
lrtest KEEP4 KEEPSCOT
Likelihood-ratio test
(Assumption: KEEPSCOT nested in KEEP4)
.
lrtest KEEPSCOT DROPREG
Likelihood-ratio test
(Assumption: DROPREG nested in KEEPSCOT)





Reject dropping all regional variables against keeping full set
Don’t reject dropping all but 4, over keeping full set
Don’t reject dropping all but Scotland, over keeping full set
Don’t reject dropping all but Scotland, over dropping all but 4
[and just to check: DO reject dropping all regional variables against dropping all but Scotland]
Again, specification is illustrative only

This is not an example of a “finished” labour supply model!
How could one improve the model?

Model specification








Theoretical considerations,
Empirical considerations
Parsimony
Stepwise regression techniques
Regression diagnostics
Interpreting results
Spotting “unreasonable” results
Other models


Other models to be aware of, but not covered on this course:
Multinomial logit and probit

Ordered models (ologit, oprobit) for ordered outcomes
•
•
•

Multinomial models (mlogit, mprobit) for multiple outcomes with no obvious
ordering
•
•

Working in public, private or voluntary sector
Choice of nursery, childminder or playgroup for pre-school care
Heckman selection model

For modelling two-stage procedures
•
•
•

Levels of education,
Number of children
Excellent, good, fair or poor health
Earnings, conditional on having a job at all
Having a job is modelled as a probit, earnings are modelled as OLS
Used particularly for women’s earnings
Tobit model for censored or truncated data

Typically, for data where there are lots of zeros
•
•
Expenditure on rarely-purchased items, eg cars
Children’s weights, in an experiment where the scales broke and gave a minimum
reading of 10kg
Competence in STATA


Best results in this course if you already know how to use STATA
competently.
Check you know how to
Get data into STATA (use and using commands)
Manipulate data, (merge, append, rename, drop, save)
Describe your data (describe, tabulate, table)
Create new variables (gen, egen)
Work with subsets of data (if, in, by)
Do basic regressions (regress, logit, probit)
Run sessions interactively and in batch mode
Organise your datasets and do-files so you can find them again.


If you can’t do these, upgrade your knowledge ASAP!
Could enroll in STATA net course 101




Costs $110
ESRC might pay
Courses run regularly
www.stata.com
SC968
Panel data methods for sociologists
Lecture 1, part 2
Introducing Longitudinal Data
Overview





Cross-sectional and longitudinal data
Types of longitudinal data
Types of analysis possible with panel data
Data management – merging, appending, long and wide forms
Simple models using longitudinal data
Cross-sectional and longitudinal data

First, draw the distinction between macro- and micro-level data






Micro level: firms, individuals
Macro level: local authorities, travel-to-work areas, countries, commodity prices
Both may exist in cross-sectional or longitudinal forms
We are interested in micro-level data
But macro-level variables are often used in conjunction with micro-data
Cross-sectional data


Contains information collected at a given point in time
(More strictly, during a given time window)
•
•


Workplace Industrial Relations Survey (WIRS)
General Household Survey (GHS)
Many cross-sectional surveys are repeated annually, but on different individuals
Longitudinal data

Contains repeated observations on the same subjects
Types of longitudinal data

Time-series data


Repeated interviews at irregular intervals





“Panel” surveys
Usually annual intervals, sometimes two-yearly
BHPS, ECHP, PSID, SOEP
Some surveys have both cross-sectional and panel elements



UK cohort studies:
NCDS (1958), BHPS (1970), MCS (2000)
Repeated interviews at regular intervals


Eg, commodity prices, exchange rates
Panels more expensive to collect
LFS, EU-SILC both have a “rolling panel” element
Other sources of longitudinal data



Retrospective data (eg work or relationship history)
Linkage with external data (eg, tax or benefit records) – particularly in Scandinavia
May be present in both cross-sectional or longitudinal data sets
Analysis with longitudinal data

The “snapshot” versus the “movie”
Essentially, longitudinal data allow us to observe how events evolve
Study “flows” as well as “stocks”.

Example: unemployment







Cross-sectional analysis shows steady 5% unemployment rate
Does this mean that everyone is unemployed one year out of five?
That 5% of people are unemployed all the time?
Or something in between
Very different implications for equality, social policy, etc
The BHPS











Interviews about 10,000 adults in about 6,000 households
Interviews repeated annually
People followed when they move
People join the sample if they move in with a sample member
Household-level information collected from “head of household”
Individual-level information collected from people aged 17+
Young people aged 11-16 fill in a youth questionnaire
BHPS is being upgraded to Understanding Society
Much larger and wider-ranging survey
BHPS sample being retained as part of US sample
Data set used for this course is a 20% sample of BHPS, with selected
variables
The BHPS

All files prefixed with a letter indicating the year




Several files each year, containing different information







hhsamp
hhresp
indall
indresp
egoalt
income
information on sample households
household-level information on households that actually responded
info on all individuals in responding households
info on respondents to main questionnaire (adults)
file showing relationship of household members to one another
incomes
Extra files each year containing derived variables:


All variables within each file also prefixed with this letter
1991: a
1992: b………. and so on, so far up to p
Work histories, net income files
And others with occasional modules, eg life histories in wave 2

bjobhist blifemst bmarriag bcohabit bchildnt
Some BHPS files
768.1k
10.7M
1626.3k
330.6k
1066.4k
541.3k
303.8k
aindall.dta
aindresp.dta
ahhresp.dta
ahhsamp.dta
aincome.dta
aegoalt.dta
ajobhist.dta
635.3k
978.2k
11.0M
1499.7k
257.1k
1073.0k
546.5k
237.8k
bindsamp.dta
bindall.dta
bindresp.dta
bhhresp.dta
bhhsamp.dta
bincome.dta
begoalt.dta
bjobhist.dta
23.5k
284.4k
34.3k
766.4k
272.4k
bchildad.dta
bchildnt.dta
bcohabit.dta
blifemst.dta
bmarriag.dta
624.3k
975.6k
11.0M
1539.0k
287.4k
1008.9k
542.2k
237.8k
1675.0k
Extra modules
in Wave 2
cindsamp.dta
cindall.dta
cindresp.dta
chhresp.dta
chhsamp.dta
cincome.dta
cegoalt.dta
cjobhist.dta
clifejob.dta
616.7k
943.7k
11.2M
1508.9k
301.9k
1019.7k
531.8k
245.0k
129.0k
dindsamp.dta
dindall.dta
dindresp.dta
dhhresp.dta
dhhsamp.dta
dincome.dta
degoalt.dta
djobhist.dta
dyouth.dta
4977.3k
1027.7k
xwaveid.dta
xwlsten.dta
Following
sample
members
Youth module
introduced
1994
Cross-wave
identifiers
Person and household identifiers


BHPS (along with other panels such as ECHP, SOEP, ECHP) is a
household survey – so everyone living in sample households becomes
a member
Need identifiers to
1.
2.
Associate the same individual with him- or herself in different waves
Link members of same household with each other in the same wave
- the HID identifier

Note: no such thing as a longitudinal household!


Household composition changes, household location changes…..
HID is a cross-sectional concept only!
What it looks like: 4 waves of data, sorted by pid and wave.
. list pid wave hgsex age jbstat mastat in 1/30, clean
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
pid
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10042571
10042571
10042571
10051538
10051538
10051538
10051538
10051562
10051562
10051562
10051562
10059377
10059377
10059377
10059377
10064966
10064966
10064966
10064966
10076166
10076166
10076166
wave
1
2
3
4
1
2
3
4
1
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
hgsex
female
female
female
female
male
male
male
male
male
male
male
female
female
female
female
female
female
female
female
female
female
female
female
male
male
male
male
female
female
female
age
59
60
61
62
30
31
32
33
59
60
62
22
23
24
25
4
5
6
7
46
47
48
49
70
70
71
72
77
78
79
jbstat
retired
retired
retired
retired
employed
employed
employed
employed
unemploy
lt sick,
retired
unemploy
family c
unemploy
family c
.
.
.
.
employed
employed
employed
self-emp
retired
retired
retired
retired
retired
retired
retired
mastat
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
never ma
.
.
.
.
never ma
never ma
never ma
never ma
widowed
widowed
widowed
widowed
widowed
widowed
widowed
Observations in rows,
variables in columns.
Blue stripes show
where one individual
ends & another begins
Not present at 2nd wave
A child, so no data on
job or marital status
(Can also use ,nol option)
. list pid wave hgsex age jbstat mastat in 1/30, clean nol
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
pid
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10042571
10042571
10042571
10051538
10051538
10051538
10051538
10051562
10051562
10051562
10051562
10059377
10059377
10059377
10059377
10064966
10064966
10064966
10064966
10076166
10076166
10076166
wave
1
2
3
4
1
2
3
4
1
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
hgsex
2
2
2
2
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
age
59
60
61
62
30
31
32
33
59
60
62
22
23
24
25
4
5
6
7
46
47
48
49
70
70
71
72
77
78
79
jbstat
4
4
4
4
2
2
2
2
3
8
4
3
6
3
6
.
.
.
.
2
2
2
1
4
4
4
4
4
4
4
mastat
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
.
.
.
.
6
6
6
6
3
3
3
3
3
3
3
Joining data sets together
1.
1.
1.
2.
2.
2.
3.
3.
3.
4.
4.
4.
5.
5.
5.
6.
6.
6.
7.
7.
7.
8.
8.
8.
9.
9.
9.
10.
10.
10.
11.
11.
11.
12.
12.
12.
13.
13.
13.
14.
14.
14.
15.
15.
15.
16.
16.
16.
17.
17.
17.
18.
18.
18.
19.
19.
19.
20.
20.
20.
21.
21.
21.
22.
22.
22.
23.
23.
23.
24.
24.
24.
25.
25.
25.
26.
26.
26.
27.
27.
27.
28.
28.
28.
29.
29.
29.
30.
30.
30.
31.
31.
31.
32.
32.
32.
33.
33.
33.
34.
34.
34.
35.
35.
35.
36.
36.
36.
37.
37.
37.
38.
38.
38.
39.
39.
39.
40.
40.
40.
41.
41.
41.
42.
42.
42.
43.
43.
43.
44.
44.
44.
45.
45.
45.
46.
46.
46.
47.
47.
47.
48.
48.
48.
49.
49.
49.
50.
50.
50.
pid
pid
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10042571
10042571
10042571
10042571
10042571
10042571
10051538
10051538
10051538
10051538
10051538
10051538
10051538
10051538
10051562
10051562
10051562
10051562
10051562
10051562
10051562
10051562
10059377
10059377
10059377
10059377
10059377
10059377
10059377
10059377
10064966
10064966
10064966
10064966
10064966
10064966
10064966
10064966
10076166
10076166
10076166
10076166
10076166
10076166
10076166
10076166
10081763
10081763
10081763
10081763
10081763
10081763
10081763
10081763
10081798
10081798
10081798
10081798
10081798
10081798
10081798
10081798
10081798
10091831
10091831
10091831
10091831
10091831
10091831
10091831
10091831
10091831
10091831
10091831
10091831
10091866
10091866
10091866
10091866
10091866
10091866
10091866
10091866
10091866
10091866
10091866
10091866
10091904
10091904
10091904
10091904
10091904
10091904
10091904
10091904
10091904
wave
wave
11
22
33
44
11
22
33
44
11
33
44
11
22
33
44
11
22
33
44
11
22
33
44
11
22
33
44
11
22
33
44
11
22
33
44
11
22
33
44
4
11
1
22
2
33
3
44
4
11
1
22
2
33
3
44
4
11
1
22
2
33
3
hgsex
hgsex
female
female
female
female
female
female
female
female
male
male
male
male
male
male
male
male
male
male
male
male
male
male
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
male
male
male
male
male
male
male
male
female
female
female
female
female
female
female
female
male
male
male
male
male
male
male
male
female
female
female
female
female
female
female
female
female
male
male
male
male
male
male
male
male
male
male
male
male
female
female
female
female
female
female
female
female
female
female
female
female
male
male
male
male
male
male
male
male
male
age
age
59
59
60
60
61
61
62
62
30
30
31
31
32
32
33
33
59
59
60
60
62
62
22
22
23
23
24
24
25
25
44
55
66
77
46
46
47
47
48
48
49
49
70
70
70
70
71
71
72
72
77
77
78
78
79
79
79
79
71
71
72
72
73
73
74
74
72
72
73
73
74
74
75
75
75
49
49
49
50
50
50
50
50
50
51
51
51
48
48
48
48
48
48
49
49
49
50
50
50
11
11
11
11
11
11
12
12
12
jbstat
jbstat
retired
retired
retired
retired
retired
retired
retired
retired
employed
employed
employed
employed
employed
employed
employed
employed
unemploy
unemploy
lt sick,
sick,
lt
retired
retired
unemploy
unemploy
family
family cc
unemploy
unemploy
family
family cc
..
..
..
..
employed
employed
employed
employed
employed
employed
self-emp
self-emp
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
.
..
.
..
.
..
.
..
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
retired
.
..
.
..
.
..
.
..
maternit
maternit
maternit
employed
employed
employed
employed
employed
employed
family cc
c
family
family
.
..
.
..
.
..
Adding extra observations:
“append” command
mastat
mastat
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
never ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
..
..
..
..
never
never ma
ma
never
never ma
ma
never
never ma
ma
never
never ma
ma
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
widowed
..
..
..
..
married
married
married
married
married
married
married
married
..
..
..
..
married
married
married
married
married
married
married
married
..
.
.
..
hlghq1
hlghq1
77
12
12
10
10
11
11
77
88
12
12
77
11
11
77
66
11
11
66
88
10
10
..
..
..
..
12
12
10
10
14
14
17
17
missing
missing
missing
missing
18
18
17
17
66
77
77
proxy
proxy re
re
..
..
..
..
88
55
77
missing
missing
..
..
..
..
17
17
11
11
proxy
proxy re
re
missing
missing
..
..
..
hlstat
hlstat
excellen
excellen
excellen
excellen
excellen
excellen
excellen
excellen
excellen
excellen
fair
fair
fair
fair
good
good
fair
fair
good
good
fair
fair
good
good
excellen
excellen
excellen
excellen
good
good
..
..
..
..
fair
fair
good
good
fair
fair
poor
poor
fair
fair
good
good
good
good
poor
poor
excellen
excellen
excellen
excellen
excellen
excellen
excellen
excellen
..
..
..
..
good
good
excellen
excellen
excellen
excellen
excellen
excellen
..
..
..
..
good
good
good
good
good
good
good
good
..
..
..
Adding extra variables:
“merge” command
Whether appending or merging

Whether appending or merging




The data set you are using at the time is called the “master” data
The data set you want to merge it with is called the “using” data
Make sure you can identify observations properly beforehand
Make sure you can identify observations uniquely afterwards
Appending








Use this command to add more observations
Relatively easy
Check first that you are really adding observations you don’t already
have (or that if you are adding duplicates, you really want to do this)
Syntax: append using using_data
STATA simply sticks the “using” data on the end of the “master” data
STATA re-orders the variables if necessary.
If the using data contain variables not present in the master data,
STATA sets the values of these variables to missing in the using data
(and vice versa if the master data contains variables not present in the
using data)
Merging is more complicated

Use “merge” to add more variables to a data set
Master data: age.dta
pid
wave age
28005 1
30
19057 1
59
28005 2
31
19057 3
61
19057 4
62
28005 4
33

Using data: sex.dta
pid
wave sex
19057 1
female
19057 3
female
28005 1
male
28005 2
male
28005 4
male
42571 1
male
42571 3
male
First, make sure both data sets are sorted the same way
use sex.dta
sort pid wave
save, replace
use age.dta
sort pid wave
Merging
Master data: age.dta
pid
wave age
19057 1
59
19057 3
61
19057 4
62
28005 1
30
28005 2
31
28005 4
33

Using data: sex.dta
pid
wave sex
19057 1
female
19057 3
female
28005
28005
28005
42571
42571
1
2
4
1
3
male
male
male
male
male
Notice that both data sets don’t contain the same observations
•
Merge 1:1 pid wave using sex
New in STATA this year: shows you are expecting one “using”
observation for each “master” observation
pid
19057
19057
19057
28005
28005
28005
42571
42571
wave
1
3
4
1
2
4
1
3
age
59
61
62
30
31
33
.
.
sex
female
female
.
male
male
male
male
male
_merge
3
3
1
3
3
3
2
2

STATA creates a variable called _merge after merging
•
•
•

1: observation in master but not using data
2: observation in using but not master data
3: observation in both data sets
Options available for discarding some observations – see manual
More on merging



Previous example showed one-to-one merging
Not every observation was in both data sets, but every observation in the master data
was matched with a maximum of only one observation in the using data – and vice versa.
Many-to-one merging:
Household-level data sets contain only one observation per household (usually <1 per person)
Regional data (eg, regional unemployment data), usually one observation per region
Sample syntax: merge m:1 hid wave using hhinc_data



hid
1604
2341
3569
4301
4301
4956
5421
6363
6827
6827

pid
age
19057
59
28005
30
42571
59
51538
22
51562
4
59377
46
64966
70
76166
77
81763
71
81798
72
hid
1604
2341
3569
4301
4956
5421
6363
6827
h/h income
780
1501
268
394
1601
225
411
743
hid
1604
2341
3569
4301
4301
4956
5421
6363
6827
6827
pid
19057
28005
42571
51538
51562
59377
64966
76166
81763
81798
age
59
30
59
22
4
46
70
77
71
72
h/h income
780
1501
268
394
394
1601
225
411
743
743
One-to-many merging



Job and relationship files contain one observation per episode (potentially >1 per person)
Income files contain one observation per source of income (potentially >1 per person)
Sample syntax: merge 1:m pid wave using births_data
Long and wide forms



1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
The data we have here is in “long” form
One row for each person/wave combination
From a few slides back:
pid
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10042571
10042571
10042571
10051538
10051538
10051538
10051538
10051562
10051562
10051562
10051562
10059377
10059377
10059377
10059377
10064966
10064966
10064966
10064966
10076166
10076166
10076166
wave
1
2
3
4
1
2
3
4
1
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
hgsex
female
female
female
female
male
male
male
male
male
male
male
female
female
female
female
female
female
female
female
female
female
female
female
male
male
male
male
female
female
female
age
59
60
61
62
30
31
32
33
59
60
62
22
23
24
25
4
5
6
7
46
47
48
49
70
70
71
72
77
78
79
Wide form


However, it’s also possible to put longitudinal data into “wide” form
One observation per person, with different variables relating to different years of
data
Sex doesn’t change [usually]
Age at wave 1, and so on
1.
2.
3.
4.
5.
6.
7.
8.
pid
10019057
10028005
10042571
10051538
10051562
10059377
10064966
10076166
sex1
female
male
male
female
.
female
male
female
age1
59
30
59
22
4
46
70
77
age2
60
31
.
23
5
47
70
78
age3
61
32
60
24
6
48
71
79
age4
62
33
62
25
7
49
72
79
The reshape command

Switching from long to wide:
•

In BHPS, this becomes
•





reshape wide [stubnames], i(pid) j(wave)
What are stub names?
They are a list of variables which vary between years
Variables like sex or ethnicity would not normally be included in this list
Switching from wide to long:
Exactly the opposite
•

reshape wide [stubnames], i(id) j(year)
reshape long [stubnames], i(id) j(wave)
Lots more info and examples in STATA manual
Simple models using longitudinal data


Auto-regressive and time-lagged models
Models of change
But first: the GHQ




Use this for lots of analysis in the lectures and practical sessions
General Health Questionnaire
Different versions: BHPS carries the GHQ-12, with 12 questions.
Have you recently:













been able to concentrate on whatever you're doing ?
lost much sleep over worry ?
felt that you were playing a useful part in things ?
felt capable of making decisions about things ?
felt constantly under strain?
felt you couldn't overcome your difficulties ?
been able to enjoy your normal day to day activities ?
been able to face up to problems ?
been feeling unhappy or depressed?
been losing confidence in yourself?
been thinking of yourself as a worthless person ?
been feeling reasonably happy, all things considered ?
Answer each question on 4-point scale
not at all - no more than usual - rather more - much more
GHQ
(ghq) 1: likert |
Freq.
Percent
Cum.
------------------------+----------------------------------missing or wild |
582
2.10
2.10
proxy respondent
|
1,202
4.33
6.43
0 |
77
0.28
6.70
1 |
109
0.39
7.10
2 |
149
0.54
7.63
3 |
288
1.04
8.67
4 |
504
1.82
10.49
5 |
867
3.12
13.61
6 |
2,229
8.03
21.64
7 |
2,265
8.16
29.80
8 |
2,355
8.48
38.28
9 |
2,426
8.74
47.02
10 |
2,259
8.14
55.16
11 |
2,228
8.03
63.19
12 |
2,478
8.93
72.11
13 |
1,316
4.74
76.85
14 |
1,115
4.02
80.87
15 |
876
3.16
84.03
16 |
714
2.57
86.60
17 |
635
2.29
88.89
18 |
499
1.80
90.68
19 |
439
1.58
92.27
20 |
381
1.37
93.64
21 |
318
1.15
94.78
22 |
276
0.99
95.78
23 |
264
0.95
96.73
24 |
220
0.79
97.52
25 |
134
0.48
98.00
26 |
103
0.37
98.38
27 |
96
0.35
98.72
28 |
59
0.21
98.93
29 |
66
0.24
99.17
30 |
47
0.17
99.34
31 |
47
0.17
99.51
32 |
35
0.13
99.64
33 |
26
0.09
99.73
34 |
20
0.07
99.80
35 |
29
0.10
99.91
36 |
26
0.09
100.00
------------------------+----------------------------------Total |
27,759
100.00





HLGHQ1 in BHPS
Sum of scores
LIKERT scale
We recode <0 values to
missings, rename LIKERT
Consider as a continuous
variable
GHQ


subjective wellbeing
(ghq) 2: caseness
Freq.
Percent
Cum.
missing or wild
proxy respondent
0
1
2
3
4
5
6
7
8
9
10
11
12
582
1,202
13,222
3,569
2,189
1,581
1,143
933
720
561
529
417
385
343
383
2.10
4.33
47.63
12.86
7.89
5.70
4.12
3.36
2.59
2.02
1.91
1.50
1.39
1.24
1.38
2.10
6.43
54.06
66.92
74.80
80.50
84.61
87.98
90.57
92.59
94.50
96.00
97.38
98.62
100.00


HLGHQ2
Caseness scale
Recodes answers 3-4
as 1, and adds up
Scores above 2 used to
indicate psychological
morbidity
Time-lagged models
Start with simple OLS model
The Likert score is a measure of psychological wellbeing derived from a battery of questions
.
reg LIKERT age age2 female ue_sick partner if age >= 18
Source
SS
df
MS
Model
Residual
37842.892
5
682847.462 25102
7568.5784
27.2029106
Total
720690.354 25107
28.7047578
LIKERT
Coef.
age
age2
female
ue_sick
partner
_cons
.0797637
-.0007342
1.593608
3.562843
-.044241
8.298458
Std. Err.
.0111716
.0001119
.0661958
.1249977
.0788756
.2374816
t
7.14
-6.56
24.07
28.50
-0.56
34.94
Number of obs
F( 5, 25102)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
0.000
0.575
0.000
=
=
=
=
=
=
25108
278.23
0.0000
0.0525
0.0523
5.2156
[95% Conf. Interval]
.0578667
-.0009535
1.46386
3.31784
-.1988419
7.83298
.1016607
-.000515
1.723356
3.807846
.1103598
8.763936
Generate lagged variable
.
.
sort pid wave
.
capture drop LIKERT_lag
.
gen LIKERT_lag = LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
(14738 missing values generated)
.
. * check:
.
list pid wave LIKERT LIKERT_lag in 1/30, clean
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
pid
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10042571
wave
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
LIKERT
7
12
10
11
12
.
12
12
12
12
12
12
.
11
11
7
8
12
7
8
8
7
9
7
8
7
12
9
13
11
LIKERT~g
.
7
12
10
11
12
.
12
12
12
12
12
12
.
11
.
7
8
12
7
8
8
7
9
7
8
7
12
9
.
NB: the 1/30 here is just so it will fit on the
page. You should check many more observations
than this!
OLS, with lagged dependent variable
.
reg LIKERT LIKERT_lag age age2 female ue_sick partner if age >= 18
Source
SS
df
MS
Model
Residual
163104.604
6
453760.485 21456
27184.1006
21.1484193
Total
616865.089 21462
28.7421997
LIKERT
Coef.
LIKERT_lag
age
age2
female
ue_sick
partner
_cons
.4752892
.0272471
-.0002391
.8414271
2.128451
.0967488
4.593374
Std. Err.
.0060424
.0108394
.0001079
.0638746
.1222784
.0759926
.2365749
t
78.66
2.51
-2.22
13.17
17.41
1.27
19.42
Also possible to include lagged explanatory
variables
Number of obs
F( 6, 21456)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.012
0.027
0.000
0.000
0.203
0.000
=
21463
= 1285.40
= 0.0000
= 0.2644
= 0.2642
= 4.5987
R-squared
rockets from
5% to 26%
[95% Conf. Interval]
.4634456
.0060011
-.0004506
.7162282
1.888777
-.0522022
4.12967
.4871327
.0484931
-.0000276
.966626
2.368126
.2456999
5.057079
Big & very
significant
coefficient on
lagged variable
Coeff on “ue_sick”
falls from 3.6 to 2.1
Models of change
yi    xi  ......  i
Start with OLS model [simplified, but imagine more variables]
yi1    xi1 ......  i1
Separate model for each year – suffix denotes year
yi 2    xi 2  ......  i 2
( yi 2  yi1 )  ( xi 2  xi1 ) ...... ( i 2   i1 )
Subtract 1st from 2nd model
yi  xi  ......  i
Or, express in terms of change
Generate difference variables
capture drop dif*
sort pid wave
gen dif_LIKERT =
gen dif_age
=
gen dif_age2
=
gen dif_female =
gen dif_ue_sick =
gen dif_partner =
LIKERT
age
age2
female
ue_sick
partner
-
LIKERT[_n-1]
age[_n-1]
age[_n-1]
female[_n-1]
ue_sick[_n-1]
partner[_n-1]
if
if
if
if
if
if
pid
pid
pid
pid
pid
pid
==
==
==
==
==
==
pid[_n-1]
pid[_n-1]
pid[_n-1]
pid[_n-1]
pid[_n-1]
pid[_n-1]
&
&
&
&
&
&
wave
wave
wave
wave
wave
wave
==
==
==
==
==
==
Check you understand why dif_female will [very nearly] always be zero
wave[_n-1]
wave[_n-1]
wave[_n-1]
wave[_n-1]
wave[_n-1]
wave[_n-1]
+
+
+
+
+
+
1
1
1
1
1
1
Check for sensible results!
.
list pid wave age dif_age in 1/30, clean
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
pid
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10042571
wave
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
age
59
60
61
62
63
64
65
66
67
67
68
69
71
71
73
30
31
32
33
34
35
36
37
38
39
40
41
42
43
59
dif_age
.
1
1
1
1
1
1
1
1
0
1
1
2
0
2
.
1
1
1
1
1
1
1
1
1
1
1
1
1
.
More checking….
.
list pid wave LIK dif_LIK in 1/30, clean
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
pid
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10019057
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10028005
10042571
wave
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
LIKERT
7
12
10
11
12
.
12
12
12
12
12
12
.
11
11
7
8
12
7
8
8
7
9
7
8
7
12
9
13
11
dif_LI~T
.
5
-2
1
1
.
.
0
0
0
0
0
.
.
0
.
1
4
-5
1
0
-1
2
-2
1
-1
5
-3
4
.
Obvious problems





Interview times may mean difference of 100% in age difference
variable
Most differences are zero
Moving into unemployment or partnership is given equal and opposite
weighting to moving out. No real reason why this should be the case
There are MUCH better ways to use these data!
Nevertheless, let’s proceed!
Results
.
reg LIKERT age age2 female ue_sick partner if age >= 18
Source
.
SS
df
MS
Model
Residual
37842.892
5
682847.462 25102
7568.5784
27.2029106
Total
720690.354 25107
28.7047578
LIKERT
Coef.
age
age2
female
ue_sick
partner
_cons
.0797637
-.0007342
1.593608
3.562843
-.044241
8.298458
Std. Err.
.0111716
.0001119
.0661958
.1249977
.0788756
.2374816
t
7.14
-6.56
24.07
28.50
-0.56
34.94
Number of obs
F( 5, 25102)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
0.000
0.575
0.000
=
=
=
=
=
=
25108
278.23
0.0000
0.0525
0.0523
5.2156
[95% Conf. Interval]
.0578667
-.0009535
1.46386
3.31784
-.1988419
7.83298
.1016607
-.000515
1.723356
3.807846
.1103598
8.763936
reg dif_LIKERT dif_age - dif_partner if age >= 18
Source
SS
df
MS
Model
Residual
5058.58867
4
608713.117 21456
1264.64717
28.3702982
Total
613771.706 21460
28.6007319
dif_LIKERT
Coef.
dif_age
dif_age2
dif_ue_sick
dif_partner
_cons
.3757715
-6.24e-07
1.857857
-.5948999
-.3104285
Std. Err.
.1227943
.0000212
.149281
.1780776
.1364378
t
3.06
-0.03
12.45
-3.34
-2.28
Number of obs
F( 4, 21456)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.002
0.977
0.000
0.001
0.023
=
=
=
=
=
=
21461
44.58
0.0000
0.0082
0.0081
5.3264
[95% Conf. Interval]
.1350856
-.0000422
1.565255
-.9439452
-.5778567
.6164574
.000041
2.150458
-.2458545
-.0430003
Age increase equal and
opposite to constant
Female drops out
Coeffs on sick and
partner significant