UKSUG10.DSouza

Transcript UKSUG10.DSouza

A Stata program for
calibration weighting
John D’Souza
National Centre for Social Research
Outline

Description of calibration
 Adjust selection weights so that a weighted
sample exactly matches the population
 Generalizes post-stratification
 Several methods: Linear, logistic …



SAS, GenStat
A new Stata program
Limitations and extensions
Sampling



Selection weights: dk = 1/P(Person k is
chosen)
Sample frame variables Xk1, …, XkJ with
known population totals, P1, …, PJ.
Horvitz-Thompson estimator of Pi
∑dkXki ≈ Pi for i=1,2, …, J.

Calibration: Adjust dk to get calibration
weights, wk, giving exact equality:
∑wkXki = Pi for i=1,2, …, J.
Example: School Census
Variables include
 Age, Gender, Ethnic Group, Exam results
 Type of School, Region
 Pupil’s Free School Meal eligibility
We calibrate to J variables. Eg.
Boy (binary)
Girl (binary)
Region (eg. four categories)
FSM eligibility (binary)
J= 1 + 1 + (4-1) + 1 = 6
Special case: post-stratification


Simplest case:
 One categorical variable
 Easy to deal with (post-stratification)
 svyset , poststrata() postweight()
More general case:
 Several variables (categorical and
numerical)
Deville and Sarndal (1992).
Minimize the “distance” between w and d
subject to the J calibration constraints.
Linear calibration: Minimize
∑S (wk- dk)2/dk
Involves solving J simultaneous linear equations
Logistic calibration: Minimize
∑S (wklog(wk/dk) – wk + dk)
Involves solving J simultaneous non-linear equations
GenStat, SAS, Stata

GenStat and SAS
 Methods: linear, logistic and bounded.
 Estimation: GenStat gives SEs.
 SAS handles categorical variables directly.
Enter as indicator variables in GenStat.

Stata
 Post-stratification (calibration to one
categorical variable). Gives SEs.
 No routine for general calibration.
A new Stata program

Typical syntax.
matrix M=[10000, 10000, 3000, 4000, 3000, 8000]
calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
marginals(boy girl FSM ireg1-ireg3) ///
method(linear) print(final)

10,000 boys, 10,000 girls, 3,000 FSM
 Variables boys, girls, FSM are binary
 Categorical variable region (4 categories) turned
into 4 binary indicator variables). Only 3 entered in
the syntax (colinearity)
Output
Variable
Pop total
Weighted (entrywt)
Weighted
(exitwt)
R
boy
10000
9619.7188
10000
.21373408
girl
10000
10380.281
10000
.13733883
FSM
3000
2915.4929
3000
.04710333
ireg1
4000
4056.3379
4000
-.19511394
ireg2
3000
3197.1831
3000
-.24808005
ireg3
8000
8507.042
8000
-.2391432
Options


Options available to:
 Control amount of output/graphs
 Set max number of iterations/tolerance
Methods
 linear, logistic, bounded linear and nonresp
(blinear sets bounds for wk/dk. GenStat and
SAS have something very similar )
(nonresp adjusts for non-response – see
below)
Limitations (1)

Solves the equations by finding a
matrix inverse
1. Won’t work if J is large
2. Can have problems with singular or nearly
singular matrices
3. Iterative methods (logistic, blinear) won’t
always converge

No obvious solution to 1. Problem 2
and 3 are usually down to problems
with the data
Limitations (2)


We need to recode categorical
variables (SAS doesn’t)
 Stata: tab region, gen(ireg)
More complicated (eg two-phase)
problems aren’t handled directly
 Need a bit of syntax to handle this
 Other packages can handle this directly
Extensions –Standard errors
Calibration weights are often incorrectly
treated as selection weights.
calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
marginals(boy girl FSM ireg1-ireg3)
calibmean , selwt(w1) calibwt(w2) yvar(y) ///
marginals(boy girl FSM ireg1-ireg3) ///
psu(school) designops (strata(region))
This generalizes Stata’s poststrata command
Extension: Method nonresp (1)
Example
 Select schools, then classes, then pupils
 Assume all schools respond, pupils might not
Variables available on responders. (Pop totals
available)
 Gender, Exam results, FSM, Region
Variables on non-responders. (Pop totals not
available)
 PTratio: Pupil-teacher ratio
 topset: Is pupil in the top set?
Extension: Method nonresp (2)
serial region topset outc sex FSM
-----------------------------------------1. 1001
1
1
0
.
.
2. 1002
1
0
1
1
0
3. 1003
2
0
0
.
.
4. 1004
1
0
1
1
1
5. 1005
3
1
0
.
.
-----------------------------------------6. 1006
1
0
1
0
1
7. 1007
3
1
1
1
0
8. 1008
2
1
0
.
.
9. 1009
1
0
1
1
0
Extension: Method nonresp (3)
Population totals unknown, but variables are
available on all the sample (including nonresponders)
calibrate , entrywt(w1) exitwt(w2) poptot(M) ///
marginals(boy girl FSM ireg1-ireg3) ///
method(nonresp) outc(outc) ///
svars(PTratio topset)
Responders weighted to pop totals on “marginals”
and to selected sample totals on “svars”
(Lundstrom & Sarndal, 2005)
Conclusions




We’ve found the program can handle many
practical problems
Easy to calculate SEs (but theory assumes no
non-response)
Method nonresp isn’t available in many
packages
We don’t have to calibrate to population
totals
 Eg, calibrate Wave n+1 of a survey to totals from

Wave n
Calibrate one sample to look like another
Questions
References



Deville, J.-C. and Sarndal, C.-E. 1992. Calibration
estimators in survey sampling. Journal of the American
Statistical Association 87: 376-382
 Background and theory behind calibration
Lundstrom, S. and Sarndal, C.-E. 2005. Estimation in
Surveys with Nonresponse. Wiley
 Deals with non-response
Singh, A.C. and Mohl, C.A. 1996. Understanding
Calibration estimators in Survey Sampling. Survey
Methodology 22: 107-115
 Discusses several methods of doing bounded calibration