Transcript Document

Evaluating Methods of Standard
Error Estimation for Use with
the Current Population Survey’s
Public Use Data
The Hawaii Coverage For All Technical Workshop
Honolulu, Hawaii
February 7, 2003
Presented by:
Michael Davern, Ph.D.
University of Minnesota
Division of Health Services Research and Policy
School of Public Health
Supported by a grant from The Robert Wood Johnson Foundation
This paper is a Work in Progress
• Paper is co-authored with:
James Lepkowski, University of Michigan
Gestur Davidson, University of Minnesota/SHADAC
Arthur Jones Jr., US Census Bureau
Lynn A. Blewett, University of Minnesota/SHADAC
• Estimates have not cleared final Census Review
– Estimates are therefore PRELIMINARY
– We hope to present it at AAPOR in May of 2003
The Problem:
• CPS is a complex survey
– Sample Design information is necessary to estimate
appropriate standard errors
– Important components of the sampling design are not
released to the public
• Public use data are widely used by policy-makers
and academics
– Significance tests in research are likely biased due to
standard error estimation
– These significance tests provide important rules for
“evidence” in the policy analysis and academic
literature
The Result:
• Thus what constitutes “evidence” in policy
analysis and academic journals—and the
inferences drawn from that evidence--may not be
valid
• In other words: What we know from research
using Census Bureau public use data products may
not be usefully accurate
• In a quick search we found over 50 journal articles
in the top social science journals that used Census
Bureau public Use data.
The Analysis:
• We identified four approaches to estimating
the standard error on the public use data
– The Simple Random Sample (SRS) approach
– Generalized variance parameter (GVP)
approach (Census Bureau’s Standard)
– Robust variance estimation (aka sandwich
estimator or Huber-White estimator)
– Taylor Series with a stratum and cluster variable
defined
The Data:
• The CPS uses a complex sampling design with the
following features:
– Country is divided into Primary Sampling Units
• A PSU is a county or group of contiguous counties
• “Self-representing” PSUs are Metro Areas that are selected with
certainty
• Non-self-representing PSUs are sampled through a stratification
process within each state
– Within PSUs, a groups of housing units are identified
and called Ultimate Sampling Units (USUs)
The Data:
– On average 4 housing units are selected from a
USU using a systematic sampling method
– Information is collected on everyone within a
selected household
– Due to the rotation schedule, about 45 percent of
the households that were interviewed in the
monthly CPS were interviewed in the previous
year during that month.
The Variables and Standard Error
Estimation
• We run the state rates of health insurance coverage,
and poverty. We also run the state average income
• We estimate the standard errors for these
rates/averages in the following manner:
– SRS uses normalized weights and conventional
calculations to determine standard errors
– GVP approach uses the parameters in the Source and
Accuracy Statement from the Census Bureau to correct
for the complex sampling design (this is the technique
used by the Census).
Standard Error Estimation
– Robust standard errors use the person weights to account
for the degree of heterogeneity in the probability of
selection
– Taylor Series on the Public Use file uses the ‘Lowest’
level of identifiable geography as the stratum variable
and household as the cluster variable
• Lowest level of identifiable geography is either:
– (1) largest 250 MSAs,
– (2) Other counties with over 100,000 in population,
– (3) non-MSA and non-identified county within a state
The Standard Error “Standard”
• Ultimate Cluster Method is the current standard
way to estimate standard errors for survey data
– Taylor series combined with an identified ultimate
cluster and stratum variable
– The Ultimate cluster for the CPS is the PSU
– We used the Census internal data that has the PSU
identifiers
• In the Taylor Series the State is stratum and PSU is cluster
(except DC)
These Results are Preliminary
and Subject to Internal Census
Bureau Review
Please do not cite our work without
permission
Table 1
State Health Insurance Coverage Rates and Standard Error Computation Comparisons by Year: 2001
Percent Change From SRS Method
United States
Simple
2001
Random
Robust
Taylor Series Generalized
Taylor Series
Coverage
Sample (SRS) Variance
On Public Use Variance
On Internal
Estimate
Standard Error Estimation
File
Estimation
Census File
85.39%
0.08%
21.83%
81.49%
-7.34%
567.74%
Hawaii
90.4%
0.53%
5.68%
56.74%
Rhode Island
92.3%
0.45%
10.97%
58.56%
Vermont
90.4%
0.52%
22.29%
49.80%
Illinois
86.4%
0.39%
9.12%
58.40%
New York
84.5%
0.34%
7.70%
50.80%
California
80.5%
0.30%
8.96%
73.53%
Average Change
8.18%
54.18%
Source: 2002 Current Population Survey Annual Demographic Supplements
-15.39%
-21.62%
-21.60%
-12.05%
-17.40%
-5.51%
-16.92%
-4.75%
-2.24%
1.69%
335.20%
524.75%
542.96%
137.71%
Table 2
State Poverty Rates and Standard Error Computation Comparisons by Year: 2001
Percent Change From SRS Method
United States
Simple
Random
Robust
Taylor Series Generalized
Taylor Series
2001 Poverty Sample (SRS) Variance
On Public Use Variance
On Internal
Rate
Standard Error Estimation
File
Estimation
Census File
11.67%
0.07%
19.92%
107.39%
101.68%
335.16%
Arizona
14.6%
0.64%
4.59%
91.37%
Iowa
7.4%
0.44%
8.65%
61.38%
Indiana
8.5%
0.45%
8.45%
78.05%
Hawaii
11.4%
0.57%
4.84%
86.49%
Michigan
9.4%
0.37%
6.50%
70.03%
Pennsylvania
9.6%
0.34%
7.41%
82.21%
New York
14.2%
0.32%
4.28%
76.41%
Average Change
6.56%
76.66%
Source: 2002 Current Population Survey Annual Demographic Supplements
95.69%
81.98%
72.58%
84.15%
78.60%
79.53%
79.77%
80.82%
-12.83%
10.09%
49.52%
191.51%
407.62%
444.93%
516.39%
189.57%
Table 3
Average State Earned Income and Standard Error Computation Comparisons by Year: 2001
Percent Change From SRS Method
United States
Simple
Random
Robust
Taylor Series Generalized
Taylor Series
2001 Average Sample (SRS) Variance
On Public Use Variance
On Internal
Income
Standard Error Estimation
File
Estimation
Census File
29,089
120
0.59%
23.04%
199.18%
206.66%
Hawaii
26,607
717
0.03%
4.40%
Arkansas
22,448
708
3.28%
2.89%
Oklahoma
23,850
741
-1.84%
-2.40%
Delaware
31,758
993
9.62%
9.32%
New York
30,778
677
8.99%
17.55%
Washington
29,413
680
4.70%
8.39%
Average Change
5.82%
7.16%
Source: 2002 Current Population Survey Annual Demographic Supplements
69.15%
209.33%
161.53%
35.49%
160.20%
352.57%
152.73%
7.44%
19.25%
37.23%
256.51%
315.80%
339.78%
122.73%
Findings
• Health Insurance Coverage on Average:
– Robust is 8% larger than SRS
– Taylor Series public use file is 54% larger than
SRS
– GVP is 17% smaller than SRS
– Taylor Series on internal file is 138% larger than
SRS
Findings
• Percent in Poverty on Average:
– Robust is 7% larger than SRS
– Taylor Series public use file is 77% larger than
SRS
– GVP is 81% larger than SRS
– Taylor Series on internal file is 190% larger than
SRS
Findings
• Individual (adult) Income on Average:
– Robust is 6% larger than SRS
– Taylor Series public use file is 7% larger than
SRS
– GVP is 154% percent smaller than SRS
– Taylor Series on internal file is 123% larger than
SRS
Discussion
• GVPs are all over the board compared to the Standard Error
“Standard”
– Std. Errors for Income are too high, for poverty too low and health
insurance they are way too low
• Robust Std. Error estimates are consistently too small
– The main cause of standard error inflation is not differential
probability of selection but rather intra-cluster correlation
• To the extent households have a high intra-cluster
correlation, then the Taylor Series is better than the 3 other
public use file estimates
– Poverty and health insurance have high intra-household correlations
but not individual income
Discussion
• Larger states are likely to have increased numbers of PSUs
in the Census Internal file than are recognized in the Public
Use File (where we only see their aggregation)
• By their very construction, the increased number of PSUs
result in more “within-PSU” homogeneity being recognized:
– States with more PSU’s in the internal data have much higher Std.
Errors (using the “Standard”) than currently being estimated
– Greater homogeneity within PSUs or households reduces the
“effective” sample size (there is less ‘independent’ information than
the full sample size would suggest)
• Consequences of this especially with health insurance and poverty
estimates, as expected.
Conclusion
• Census is not going to release PSU identifiers to
public
• The data are widely used for important policy and
academic research
– The work done on public use file has biased standard
errors and may not support inferences by meeting the
statistical standard for evidence
• Therefore, I feel it is the responsibility of the
Census Bureau to improve its GVPs or come up
with a better substitute
– What is currently offered is inadequate
SHADAC Contact Information
www.shadac.org
2221 University Avenue, Suite 345
Minneapolis Minnesota 55414
(612) 624-4802
Principal Investigator: Lynn Blewett, Ph.D. ([email protected])
Co-Principal Investigator: Kathleen Call, Ph.D. ([email protected])
Center Director: Kelli Johnson, M.B.A. ([email protected])
Senior Research Associate: Timothy Beebe, Ph.D. ([email protected])
Research Associate: Michael Davern, Ph.D. ([email protected])