Some thoughts on statistical modelling in Stata

Download Report

Transcript Some thoughts on statistical modelling in Stata

Scottish Social Survey Network:
Master Class 1
Data Analysis with Stata
Dr Vernon Gayle and Dr Paul Lambert
23rd January 2008, University of Stirling
The SSSN is funded under Phase II of the ESRC
Research Development Initiative
Handling coefficients
1) Some general issues
(Some thoughts on statistical modelling in Stata, and some
tricks and tips …)
2) Using Quasi-variance
Statistical Modelling Process
Model formulation
[make assumptions]
Model fitting
[quantify systematic
relationships & random
variation]
(Model criticism)
[review assumptions]
Model interpretation
[assess results]
Davies and Dale, 1994 p.5
Building Models
REMEMBER – Real data is much more messy,
badly behaved (people do odd stuff), harder to
interpret etc. than the data used in books and at
workshops
Building Models
• Always be guided by substantive theory
(the economists are good at this – but a bit rigid)
• Form of the outcome variable (or process)
• Main effects – more complicated models later
• Don’t use stepwise regression
(stepwise, pr(.05):regress wage married children educ age)
• An example…
A regression model
GHS Data
Y = age left education (years)
X Vars
Female
Social Class
(Advantaged; Lower Supervisory; Semi-routine; Routine)
Age (centred at 40)
Regression Estimates
A
Female
B
-0.32
Age (40)
-0.06
C
D
E
-0.34
-0.27
-0.06
-0.05
Supervisory
SemiRoutine
-1.83
-1.85
-1.98
-1.88
Routine
-2.40
-2.33
Constant
17.52
17.5 17.75 18.22 18.54
Linear Regression Models
• 1 unit change in X leading to a b change in Y
• The b is consistent – minor insignificant
random variation (survey data)
• As long as the X vars are uncorrelated
(a classical regression assumption)
A logit model (non-linear)
GHS Data
Y = Graduate / Non Graduate
X Vars
Female
Social Class
(Advantaged; Lower Supervisory; Semi-routine; Routine)
Age (centred at 40)
Estimates Logit (log scale)
Parameterization ??
A
Female
B
C
-0.24
Age (40)
-0.03
D
E
-0.23
-0.20
-0.03
-0.04
Supervisory
-1.46
-1.52
Semi-Routine
-1.82
-1.87
Routine
Constant
-2.65
-0.39
-2.70
-0.04
-0.90
-0.80
-0.68
Logit Model
• Estimates on a log scale
• The b estimates a shift from X1=0 to X1=1 leads to a
change in the log odds of y=1
• Even when the X vars are uncorrelated, including
additional variables can lead to changes in b
estimates
• The b estimates the effect given all other X vars in the
model
• Fixed variance in the logit model (p2/3)
Non-Linear Models
• Be sensible about how you parameterize them
• Be careful interpreting them…
• Don’t throw variables in like a ‘bull in a china shop’
• Model checking – make sure you understand how
the ‘left hand side’ (lhs) is working
• Some bad examples using SARs (large dataset with
many significant X variables)
Reference
• A technical explanation of the issue is
given in
Davies, R.B. (1992) ‘Sample Enumeration
Methods for Model Interpretation’ in P.G.M.
van der Heijden, W. Jansen, B. Francis and
G.U.H. Seeber (eds) Statistical Modelling,
Elsevier.
A Few Tricks
• The outreg2 command was written by John
Luke Gallup and appears in the Stata Tech.
Bullet. #59
• You can download outreg2 from within Stata
• Outputs regression results in a more
‘publishable’ format e.g. in a word document
A Few Tricks
• statsby – tells Stata to collect statistics for a
command across a by list
• Attractive because it saves the data simply, and
can be used in Graphs
• In our experience statsby can save a lot of
manual editing of results
• You can re-run the models with small
adjustments; subsequent operations such as
graph generation can be better automated
Handling coefficients
1) Some general issues
(Some thoughts on statistical modelling in Stata, and some
tricks and tips …)
2) Using Quasi-variance
Using Quasi-variance to
Communicate Sociological Results
from Statistical Models
Vernon Gayle & Paul S. Lambert
University of Stirling
Gayle and Lambert (2007) Sociology, 41(6):1191-1208.
“One of the useful things about mathematical
and statistical models [of educational
realities] is that, so long as one states the
assumptions clearly and follows the rules
correctly, one can obtain conclusions which
are, in their own terms, beyond reproach.
The awkward thing about these models is
the snares they set for the casual user; the
person who needs the conclusions, and
perhaps also supplies the data, but is
untrained in questioning the assumptions….
…What makes things more difficult is that, in
trying to communicate with the casual user,
the modeller is obliged to speak his or her
language – to use familiar terms in an
attempt to capture the essence of the model.
It is hardly surprising that such an enterprise
is fraught with difficulties, even when the
attempt is genuinely one of honest
communication rather than compliance with
custom or even subtle indoctrination”
(Goldstein 1993, p. 141).
A little biography (or narrative)…
• Since being at Centre for Applied Stats in 1998/9 I
has been thinking about the issue of model
presentation
• Done some work on Sample Enumeration Methods
with Richard Davies
• Summer 2004 (with David Steele’s help) began to
think about “quasi-variance”
• Summer 2006 began writing a paper with Paul
Lambert
Statistical Models
• Statistical models offer an attractive way for
sociological researchers to summarize patterns from
social survey datasets
• They offer techniques to summarize the joint relative
effects of several different variables in a research
study
• This is achieved by estimating statistical values
(‘parameters’ or ‘coefficient estimates’) that indicate
the magnitude and direction of the effect of each
explanatory variable
• The appropriate sociological interpretation of the
parameter estimates from statistical models is by no
means trivial
The Reference Category Problem
• In standard statistical models the effects of a
categorical explanatory variable are assessed
by comparison to one category (or level) that
is set as a benchmark against which all other
categories are compared
• The benchmark category is usually referred to
as the ‘reference’ or ‘base’ category
The Reference Category Problem
An example of Some English Government
Office Regions
0 = North East of England
---------------------------------------------------------------1 = North West England
2 = Yorkshire & Humberside
3 = East Midlands
4 = West Midlands
5 = East of England
Government Office Region
Table 1: Logistic regression prediction that self-rated health is ‘good’
(Parameter estimates for model 1 )
No Higher qualifications
1
2
3
4
Beta
Standard
Error
Prob.
95% Confidence Intervals
-
-
-
Higher Qualifications
Males
-
-
Females
North East
0.0056
0.65
-
0.0041
-0.20
-
-
<.001
<.001
-
-
-
0.64
0.66
-
-
-0.21
-0.20
-
-
North West
0.09
0.0102
<.001
0.07
0.11
Yorkshire & Humberside
0.12
0.0107
<.001
0.10
0.14
East Midlands
0.15
0.0111
<.001
0.13
0.17
West Midlands
0.13
0.0106
<.001
0.11
0.15
East of England
0.32
0.0107
<.001
0.29
0.34
South East
0.36
0.0101
<.001
0.34
0.38
South West
0.26
0.0109
<.001
0.24
0.28
Inner London
0.17
0.0122
<.001
0.15
0.20
Outer London
0.27
0.0111
<.001
0.25
0.29
Constant
0.48
0.0090
<.001
0.46
0.50
Beta
Standard
Error
Prob.
North East
-
-
-
North West
Yorkshire & Humberside
95% Confidence
Intervals
-
-
0.09
0.07
0.11
0.12
0.10
0.14
Conventional Confidence Intervals
• Since these confidence intervals overlap we might be beguiled
into concluding that the two regions are not significantly
different to each other
• However, this conclusion represents a common
misinterpretation of regression estimates for categorical
explanatory variables
• These confidence intervals are not estimates of the difference
between the North West and Yorkshire and Humberside, but
instead they indicate the difference between each category and
the reference category (i.e. the North East)
• Critically, there is no confidence interval for the reference
category because it is forced to equal zero
Formally Testing the Difference
Between Parameters -
t
ˆ
ˆ
b2- b3
ˆ
ˆ
s.e. ( b 2 - b 3 )
The banana skin is here!
Standard Error of the Difference
var( bˆ 2)  var( bˆ 3) - 2 (cov ( bˆ 2 - bˆ 3 ))
Variance North West (s.e.2 )
Only Available in the
variance covariance matrix
Variance Yorkshire &
Humberside (s.e.2 )
Table 2: Variance Covariance Matrix of Parameter Estimates for the Govt Office Region variable in Model 1
Column
Row
1
2
3
4
5
6
7
8
9
North
West
Yorkshire &
Humberside
East
Midlands
West
Midlands
East
England
South East
South West
Inner
London
Outer
London
1
North West
.00010483
2
Yorkshire &
Humberside
.00007543
.00011543
3
East
Midlands
.00007543
.00007543
.00012312
4
West
Midlands
.00007543
.00007543
.00007543
.00011337
5
East
England
.00007544
.00007543
.00007543
.00007543
.0001148
6
South East
.00007545
.00007544
.00007544
.00007544
.00007545
.00010268
7
South West
.00007544
.00007543
.00007544
.00007543
.00007544
.00007546
.00011802
8
Inner
London
.00007552
.00007548
.0000755
.00007547
.00007554
.00007572
.00007558
.00015002
9
Outer
London
.00007547
.00007545
.00007546
.00007545
.00007548
.00007555
.00007549
.00007598
Covariance
.00012356
Standard Error of the Difference
0.0083 =
0.00010483 0.00011543- 2 ( 0.00007543
)
Variance North West (s.e.2 )
Only Available in the
variance covariance matrix
Variance Yorkshire &
Humberside (s.e.2 )
Formal Tests
t = -0.03 / 0.0083 = -3.6
Wald c2 = (-0.03 /0.0083)2 = 12.97; p =0.0003
Remember – earlier because the two sets of
confidence intervals overlapped we could wrongly
conclude that the two regions were not
significantly different to each other
Comment
• Only the primary analyst who has the
opportunity to make formal comparisons
• Reporting the matrix is seldom, if ever, feasible
in paper-based publications
• In a model with q parameters there would, in
general, be ½q (q-1) covariances to report
Firth’s Method (made simple)
s.e. difference ≈
quasivar(bˆ2 )  quasivar(bˆ3 )
Table 1: Logistic regression prediction that self-rated health is ‘good’ (Parameter estimates for model 1, featuring
conventional regression results, and quasi-variance statistics )
No Higher qualifications
Higher Qualifications
Males
Females
North East
1
2
3
4
5
Beta
Standard
Error
Prob.
95% Confidence
Intervals
QuasiVariance
-
-
-
0.65
-0.20
-
0.0056
0.0041
-
<.001
<.001
-
-
-
-
0.64
0.66
-
-
-
-
-0.21
-0.20
-
-
-
0.0000755
North West
0.09
0.0102
<.001
0.07
0.11
0.0000294
Yorkshire & Humberside
0.12
0.0107
<.001
0.10
0.14
0.0000400
Firth’s Method (made simple)
s.e. difference ≈
0.0083 =
quasivar(bˆ2 )  quasivar(bˆ3 )
0.0000294 0.0000400
t = (0.09-0.12) / 0.0083 = -3.6
Wald c2 = (-.03 / 0.0083)2 = 12.97; p =0.0003
These results are identical to the results calculated by
the conventional method
The QV based ‘comparison intervals’ no longer overlap
Firth QV Calculator (on-line)
Table 2: Variance Covariance Matrix of Parameter Estimates for the Govt Office Region variable in Model 1
Column
Row
1
2
3
4
5
6
7
8
9
North West
Yorkshire &
Humberside
East
Midlands
West
Midlands
East
England
South East
South West
Inner
London
Outer
London
1
North West
.00010483
2
Yorkshire &
Humberside
.00007543
.00011543
3
East
Midlands
.00007543
.00007543
.00012312
4
West
Midlands
.00007543
.00007543
.00007543
.00011337
5
East England
.00007544
.00007543
.00007543
.00007543
.0001148
6
South East
.00007545
.00007544
.00007544
.00007544
.00007545
.00010268
7
South West
.00007544
.00007543
.00007544
.00007543
.00007544
.00007546
.00011802
8
Inner
London
.00007552
.00007548
.0000755
.00007547
.00007554
.00007572
.00007558
.00015002
9
Outer
London
.00007547
.00007545
.00007546
.00007545
.00007548
.00007555
.00007549
.00007598
.00012356
Information from the Variance-Covariance
Matrix Entered into the Data Window (Model 1)
0
0 0.00010483
0 0.00007543 0.00011543
0 0.00007543 0.00007543 0.00012312
0 0.00007543 0.00007543 0.00007543 0.00011337
0 0.00007544 0.00007543 0.00007543 0.00007543 0.00011480
0 0.00007545 0.00007544 0.00007544 0.00007544 0.00007545 0.00010268
0 0.00007544 0.00007543 0.00007544 0.00007543 0.00007544 0.00007546 0.00011802
0 0.00007552 0.00007548 0.00007550 0.00007547 0.00007554 0.00007572 0.00007558 0.00015002
0 0.00007547 0.00007545 0.00007546 0.00007545 0.00007548 0.00007555 0.00007549 0.00007598 0.00012356
Conclusion –
We should start using method
Benefits
• Overcomes the
reference category
problem when
presenting models
• Provides reliable
results (even though
based on an
approximation)
• Easy(ish) to
calculate
Costs
• Extra Column in
results
• Time convincing
colleagues that this
is a good thing
Conclusion –
Why have we told you this…
• Categorical X vars are ubiquitous
• Interpretation of coefficients is critical to
sociological analyses
– Subtleties / slipperiness
– Emphasis often on precision rather than
communication (e.g. in economics)