Variance Estimation in Complex Surveys Drew Hardin Kinfemichael Gedif

Download Report

Transcript Variance Estimation in Complex Surveys Drew Hardin Kinfemichael Gedif

Variance Estimation in
Complex Surveys
Drew Hardin
Kinfemichael Gedif
So far..

Variance for estimated mean and total
under
 SRS, Stratified, Cluster (single, multi-stage), etc.

Variance for estimating a ratio of two
means under
 SRS (we used linearization method)
What about other cases?

Variance for estimators that are not linear
combinations of means and totals
– Ratios

Variance for estimating other statistic from
complex surveys
– Median, quantiles, functions of EMF, etc.

Other approaches are necessary
Outline

Variance Estimation Methods
– Linearization
– Random Group Methods
– Balanced Repeated Replication (BRR)
– Resampling techniques
 Jackknife, Bootstrap
Adapting to complex surveys
 ‘Hot’ research areas
 Reference

Linearization (Taylor Series
Methods)
We have seen this before (ratio estimator
and other courses).
 Suppose our statistic is non-linear. It can
often be approximated using Taylor’s
Theorem.
 We know how to calculate variances of
linear functions of means and totals.

Linearization (Taylor Series
Methods)

Linearize
h(c1, c 2, c3,...., ck )
ˆ
ˆ
ˆ
ˆ
h(t 1, t 2, t 3,..., t k )  h(t1, t 2,..., tk )  
t 1, t 2 ,.. tk (tˆj  tj )
cj
j 1
 Calculate Variance
k
 h
ˆ
ˆ
V h(t1 ,..., t k )  
 tˆ1
2
 h

ˆ
( t1 ,... t k )  V (t1 )    

 tˆk
 h   h 
     Cov(tˆi , tˆ j )
ˆ tˆ j 
i  j  t i  
 
2

ˆ
( t1 ,... t k )  V (t k )

Linearization (Taylor Series)
Methods
– Pro:
 Can be applied in general sampling designs
 Theory is well developed
 Software is available
– Con:
 Finding partial derivatives may be difficult
 Different method is needed for each statistic
 The function of interest may not be expressed a
smooth function of population totals or means
 Accuracy of the linearization approximation
Random Group Methods
Based on the concept of replicating the survey
design
 Not usually possible to merely go and replicate
the survey
 However, often the survey can be divided into R
groups so that each group forms a miniature
versions of the survey

Random Group Methods
Stratum 1
Stratum 2
Stratum 3
Stratum 4
Stratum 5
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
8
8
8
8
8
Treat as miniature sample

Unbiased Estimator (Average of Samples)
R
~ 2
ˆ
(


 r )
1
~
Vˆ1 (  )  r 1
R
R 1

Slightly Biased Estimator (All Data)
R
2
ˆ
ˆ
(



)
 r
1 r 1
ˆ
V2 
R
R 1
Random Group Methods

Pro:
– Easy to calculate
– General method (can also be used for non smooth
functions)

Con:
– Assumption of independent groups (problem when N
is small)
– Small number of groups (particularly if one strata is
sampled only a few times)
– Survey design must be replicated in each random
group (presence of strata and clusters remain the
same)
Resampling and Replication Methods

Balanced Repeated Replication (BRR)
– Special case when nh=2
Jackknife (Quenouille (1949) Tukey (1958))
 Bootstrap (Efron (1979) Shao and Tu (1995))
 These methods





Extend the idea of random group method
Allows replicate groups to overlap
Are all purpose methods
Asymptotic properties ??
Balanced Repeated Replication
Suppose we had sampled 2 per stratum
 There are 2H ways to pick 1 from each
stratum.
 Each combination could treated as a
sample.
 Pick R samples.

Balanced Repeated Replication

Which samples should we include?
– Assign each value either 1 or –1 within the stratum
– Select samples that are orthogonal to one another to
create balance
– You can use the design matrix for a fraction factorial
– Specify a vector ar of 1,-1 values for each stratum

Estimator
R

1
ˆ
ˆ
VBRR ( )   ˆ(a r)  ˆ
R r 1

2
Balanced Repeated Replication

Pro
– Relatively few computations
– Asymptotically equivalent to linearization methods for
smooth functions of population totals and quantiles
– Can be extended to use weights

Con
– 2 psu per sample
 Can be extended with more complex schemes
The Jackknife
SRS-with replacement

Quenoule (1949); Tukey (1958); Shao and Tu (1995)
Let ˆi be the estimator of  after omitting the ith
n
observation
ˆ
~
~
~
 J    i / n where  i  n ˆ  (n  1)ˆ i
Jackknife estimate
i 1

ˆ
Jackknife estimator of the V ( )


l
n
n

1
ˆ i  ˆ ) 2
VJ (ˆ) 
(


n i 1
n
where ˆ   ˆ i / n
i 1
n
1
~i ~ 2

(

J )

n(n  1) i 1

For Stratified SRS without replacement Jones (1974)
The Jackknife
stratified multistage design
In stratum h, delete one PSU at a time
 Let ˆ( hi) be the estimator of the same form as ˆ
when PSU i of stratum h is omitted
 Jackknife estimate:

y hi  h ' h Wh ' yh ' Wh (nh yh  yhi ) /( nh  1) where ˆ hi  g ( y hi )

Or using pseudovalues
~
 ( hi)  nhˆ  (nh  1)ˆ ( hi)
~(I )
L
nh
~ ( hi)
 J  
h 1 i 1
~ ( II )
/ n ; J
1 L 1
 
L h1 nh
nh
~ ( hi)

i 1
The Jackknife
stratified multistage design

Different formulae for V (ˆ)
nh
n

1
)
ˆ ( hi)  ˆ method ) 2
VL (ˆ)   h
(


n
h 1
i 1
h
L


L
L
h 1
h 1
method
can be ˆ ( h ) , ˆ,  ˆ ( hi ) / n, or  ˆ ( h ) / L
Where ˆ
Using the pseudovalues
nh
n

1
)
~ ( hi) ~ ( j ) 2
h
ˆ
VL ( )  
(
J )

nh i 1
h 1
L
j  I , II
The Jackknife
Asymptotics

Krewski and Rao (1981)
Based on the concept of a sequence of finite populations  
with L strata in L

Under conditions C1-C6 given in the paper


L
n1/ 2 (ˆ   ) d N (0,  2 )
ii ) nVmethod (ˆ)   2
ˆ  
iii ) Tmethod 
d N (0,1)
Vmethod (ˆ)
i)
Where method is the estimator used (Linearization, BRR, Jackknife)
L 1
The Bootstrap
Naïve bootstrap


Efron (1979); Rao and Wu (1988); Shao and Tu (1995)
Resample y  with replacement in stratum h
* nh
hi i 1
yh*(b )  nh
1
y
i
*(b )
hi
,
y *(b )  h yh*(b ) , and ˆ*(b )  g ( y * )

Estimate: b  1,2,..., B

Variance: VˆNBS (ˆ* )  E* (ˆ*  E* (ˆ* )) 2


B
1
ˆ*(b )  ˆ*. )
– Or approximate by Vˆ * (ˆ* ) 
(


NBS
B  1 b 1

The estimator is not a consistent estimator of the
variance of a general nonlinear statistics
The Bootstrap
Naïve bootstrap

For ˆ*  Wh yh*  y *
2
W
 nh  1  2
*
h

sh
Var ( y )  
nh  nh 


Comparing with
The ratio
bounded nh
Var ( y * )
Var ( y )
Var ( y )  
Wh2
nh
sh2
does not converge to 1for a
The Bootstrap
Modified bootstrap
Resample
 Calculate:

y 
* mh
hi i 1
, mh  1
~
yhi  yh 
with replacement in stratum h
m1h/ 2
*
(
y
 y)
hi
1/ 2
(nh  1)
mh
L
~
~
~
~
yh   yhi / mh , y   Wh ~
yh ,   g ( ~
y)
i 1
h


Variance:
 Can be approximated with Monte Carlo
 For the linear case, it reduces to the customary
unbiased variance estimator
~
~
~
*
VˆMBS
( * )  E* ( *  E* ( * )) 2


mh < nh
More on bootstrap

The method can be extended to stratified srs
without replacement by simply changing
~
yhi
to
1/ 2
m
*
~
h
yhi  yh 
(
1

f
)(
y
h
hi  yh )
1/ 2
(nh  1)
For mh=nh-1, this method reduces to the naïve BS
 For nh=2, mh=1, the method reduces to the
random half-sample replication method
 For nh>3, choice of mh …see Rao and Wu (1988)

Simulation
Rao and Wu (1988)





Jackknife and Linearization intervals gave
substantial bias for nonlinear statistics in one sided
intervals
The bootstrap performs best for one-sided intervals
(especially when mh=nh-1)
For two-sided intervals, the three methods have
similar performances in coverage probabilities
The Jackknife and linearization methods are more
stable than the bootstrap
B=200 is sufficient
‘Hot’ topics
Jackknife with non-smooth functions (Rao
and Sitter 1996)
 Two-phase variance estimation (Graubard
and Korn 2002; Rubin-Bleuer and SchiopuKratina 2005)
 Estimating Function (EF) bootstrap method
(Rao and Tausi 2004)

Software





OSIRIS – BRR, Jackknife
SAS – Linearization
Stata – Linearization
SUDAAN – Linearization, Bootstrap, Jackknife
WesVar – BRR, JackKnife, Bootstrap
References:
Effron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of
statistics 7, 1-26.
 Graubard, B., J., Korn, E., L. (2002). Inference for supper population parameters
using sample surveys. Statistical Science, 17, 73-96.
 Krewski, D., and Rao, J., N., K. (1981). Inference from stratified samples: Properties
of linearization, jackknife, and balanced replication methods. The annals of statistics.
9, 1010-1019.
 Quenouille, M., H.(1949). Problems in plane sampling. Annals of Mathematical
Statistics 20, 355-375.
 Rao, J.,N.,K., and Wu, C., F., J., (1988). Resampling inferences with complex survey
data. JASA, 83, 231-241.
 Rao, J.,N.,K., and Tausi, M. (2004). Estimating function variance estimation under
stratified multistage sampling. Communications in statistics. 33:, 2087-2095.
 Rao, J. N. K., and Sitter, R. R. (1996). Discussion of Shao’s paper.Statistics, 27, pp.
246–247.
 Rubin-Bleuer, S., and Schiopu-Kratina, I. (2005). On the two-phase framework for
joint model and design based framework. Annals of Statistics (to appear)
 Shao, J., and Tu, (1995). The jackknife and bootstrap. New York: Springer-Verlag.
 Tukey, J.W. (1958). Bias and confidence in not-quite large samples. Annals of
Mathematical Statistics. 29:614.
Not referred in the presentation
 Wolter, K. M. (1985) Introduction to variance estimation. New York: Springer-Verlag.
 Shao, J. (1996). Resampling Methods in Sample Surveys. Invited paper, Statistics,
27, pp. 203–237, with discussion, 237–254.
