UKSUG10.DSouza
Download
Report
Transcript UKSUG10.DSouza
A Stata program for
calibration weighting
John D’Souza
National Centre for Social Research
Outline
Description of calibration
Adjust selection weights so that a weighted
sample exactly matches the population
Generalizes post-stratification
Several methods: Linear, logistic …
SAS, GenStat
A new Stata program
Limitations and extensions
Sampling
Selection weights: dk = 1/P(Person k is
chosen)
Sample frame variables Xk1, …, XkJ with
known population totals, P1, …, PJ.
Horvitz-Thompson estimator of Pi
∑dkXki ≈ Pi for i=1,2, …, J.
Calibration: Adjust dk to get calibration
weights, wk, giving exact equality:
∑wkXki = Pi for i=1,2, …, J.
Example: School Census
Variables include
Age, Gender, Ethnic Group, Exam results
Type of School, Region
Pupil’s Free School Meal eligibility
We calibrate to J variables. Eg.
Boy (binary)
Girl (binary)
Region (eg. four categories)
FSM eligibility (binary)
J= 1 + 1 + (4-1) + 1 = 6
Special case: post-stratification
Simplest case:
One categorical variable
Easy to deal with (post-stratification)
svyset , poststrata() postweight()
More general case:
Several variables (categorical and
numerical)
Deville and Sarndal (1992).
Minimize the “distance” between w and d
subject to the J calibration constraints.
Linear calibration: Minimize
∑S (wk- dk)2/dk
Involves solving J simultaneous linear equations
Logistic calibration: Minimize
∑S (wklog(wk/dk) – wk + dk)
Involves solving J simultaneous non-linear equations
GenStat, SAS, Stata
GenStat and SAS
Methods: linear, logistic and bounded.
Estimation: GenStat gives SEs.
SAS handles categorical variables directly.
Enter as indicator variables in GenStat.
Stata
Post-stratification (calibration to one
categorical variable). Gives SEs.
No routine for general calibration.
A new Stata program
Typical syntax.
matrix M=[10000, 10000, 3000, 4000, 3000, 8000]
calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
marginals(boy girl FSM ireg1-ireg3) ///
method(linear) print(final)
10,000 boys, 10,000 girls, 3,000 FSM
Variables boys, girls, FSM are binary
Categorical variable region (4 categories) turned
into 4 binary indicator variables). Only 3 entered in
the syntax (colinearity)
Output
Variable
Pop total
Weighted (entrywt)
Weighted
(exitwt)
R
boy
10000
9619.7188
10000
.21373408
girl
10000
10380.281
10000
.13733883
FSM
3000
2915.4929
3000
.04710333
ireg1
4000
4056.3379
4000
-.19511394
ireg2
3000
3197.1831
3000
-.24808005
ireg3
8000
8507.042
8000
-.2391432
Options
Options available to:
Control amount of output/graphs
Set max number of iterations/tolerance
Methods
linear, logistic, bounded linear and nonresp
(blinear sets bounds for wk/dk. GenStat and
SAS have something very similar )
(nonresp adjusts for non-response – see
below)
Limitations (1)
Solves the equations by finding a
matrix inverse
1. Won’t work if J is large
2. Can have problems with singular or nearly
singular matrices
3. Iterative methods (logistic, blinear) won’t
always converge
No obvious solution to 1. Problem 2
and 3 are usually down to problems
with the data
Limitations (2)
We need to recode categorical
variables (SAS doesn’t)
Stata: tab region, gen(ireg)
More complicated (eg two-phase)
problems aren’t handled directly
Need a bit of syntax to handle this
Other packages can handle this directly
Extensions –Standard errors
Calibration weights are often incorrectly
treated as selection weights.
calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
marginals(boy girl FSM ireg1-ireg3)
calibmean , selwt(w1) calibwt(w2) yvar(y) ///
marginals(boy girl FSM ireg1-ireg3) ///
psu(school) designops (strata(region))
This generalizes Stata’s poststrata command
Extension: Method nonresp (1)
Example
Select schools, then classes, then pupils
Assume all schools respond, pupils might not
Variables available on responders. (Pop totals
available)
Gender, Exam results, FSM, Region
Variables on non-responders. (Pop totals not
available)
PTratio: Pupil-teacher ratio
topset: Is pupil in the top set?
Extension: Method nonresp (2)
serial region topset outc sex FSM
-----------------------------------------1. 1001
1
1
0
.
.
2. 1002
1
0
1
1
0
3. 1003
2
0
0
.
.
4. 1004
1
0
1
1
1
5. 1005
3
1
0
.
.
-----------------------------------------6. 1006
1
0
1
0
1
7. 1007
3
1
1
1
0
8. 1008
2
1
0
.
.
9. 1009
1
0
1
1
0
Extension: Method nonresp (3)
Population totals unknown, but variables are
available on all the sample (including nonresponders)
calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
marginals(boy girl FSM ireg1-ireg3) ///
method(nonresp) outc(outc) ///
svars(PTratio topset)
Responders weighted to pop totals on “marginals”
and to selected sample totals on “svars”
(Lundstrom & Sarndal, 2005)
Conclusions
We’ve found the program can handle many
practical problems
Easy to calculate SEs (but theory assumes no
non-response)
Method nonresp isn’t available in many
packages
We don’t have to calibrate to population
totals
Eg, calibrate Wave n+1 of a survey to totals from
Wave n
Calibrate one sample to look like another
Questions
References
Deville, J.-C. and Sarndal, C.-E. 1992. Calibration
estimators in survey sampling. Journal of the American
Statistical Association 87: 376-382
Background and theory behind calibration
Lundstrom, S. and Sarndal, C.-E. 2005. Estimation in
Surveys with Nonresponse. Wiley
Deals with non-response
Singh, A.C. and Mohl, C.A. 1996. Understanding
Calibration estimators in Survey Sampling. Survey
Methodology 22: 107-115
Discusses several methods of doing bounded calibration