A ‘Microdata for Research’ sample from a New Zealand census Mike Camden [email protected] Statistics New Zealand www.stats.govt.nz.

Download Report

Transcript A ‘Microdata for Research’ sample from a New Zealand census Mike Camden [email protected] Statistics New Zealand www.stats.govt.nz.

A ‘Microdata for Research’ sample
from a New Zealand census
Mike Camden
[email protected]
Statistics New Zealand
www.stats.govt.nz
1
Reseachers accessing microdata in NZ:
The researchers
• Datalab
output
• Offsite
dataset
• Remote
Access
code
output
• CURFs
CURF on CD
detail
in
data
Govt
only
ease
of
access
2
The census dataset
About 100 output variables (all categorical):
ID Geographic
Location
Family
& Household
Demographics:
Sex Age Residence Ethnicity
Origin Income Employment
3 820 749
people
(the census-night
population)
The CURF is a subset of this …
where CURF = Confidentialised Unit Record File
3
The census dataset and its CURF
About 100 output variables (all categorical):
ID Geographic
Location
Family
& Household
3 820 749
people
(the census-night
population)
Demographics:
Sex Age Residence Ethnicity
Origin Income Employment
33 variables:
all categorical, some collapsed
250 values
modified
76 415
people;
2%
We want this CURF to be both useful and safe!
But there are several possible results …
4
The possible results …
Usability
Useful, Unsafe
Useless, Unsafe
Useful, Safe
Useless, Safe
Safety
We hope we’ve got ours up here !
5
How to get safety (we hope)
and not lose usefulness:
• Choose the variables carefully
but location and household
are sad losses
• Collapse variables carefully
to preserve important groups
• Choose a small sample size
and still get good estimates
• Use Special Uniques
to find rogue records and variables
and change a tiny fraction of the dataset
6
Example: carefully collapsed categories:
AgeGroup has:
a few (8) large (5% +) categories
useful life-stage categories
7
For future census CURFs,
we’ll rethink:
• Including
location and
household variables
at expense of others
• Collapsing of categories
• Sample size
8
One measure of usefulness:
reliability of counts
5%
• Here’s a cell in a table:
5% of the population
is in it.
• The CURF will give
an answer ± its sampling error:
5%
±
0.08%.
• This is what the sampling error
looks like for other population %’s:
100%
9
Let’s fix curf size at 2%:
• Let pop proportion go from 0% to 5%:
What happens to sampling error of p?
Expected proportion p +- its SE
0.06
0.05
0.04
0.03
0.02
0.01
0
0.00
0.01
0.02
0.03
0.04
0.05
0.06
A 2% sample,
with 76 000
people,
gives
good estimates
Population proportion k/N
10
The CURF expresses NZ’s diversity:
•
•
•
•
We have 5 Yes/No Ethnicity variables
Special Uniques process set some to No
The CURF adds some sampling error
5 variables give 32 (= 25) combinations …
11
The 32
combined
ethnicities:
CombEthCensus% CURF %
4.1568 4.1890
E
69.8481 69.9431
M
7.7240 7.7302
P
4.3746 4.3473
A
5.9069 5.9962
O
0.5608 0.5889
EM
5.0730 5.0186
EP
0.7895 0.7289
EA
0.3374 0.3311
EO
0.1005 0.0851
MP
0.4090 0.4410
MA
0.0528 0.0327
MO
0.0066 0.0052
PA
0.0751 0.0694
PO
0.0032 0.0039
AO
0.0058 0.0013
EMP
0.3697 0.3128
EMA
0.0894 0.0746
EMO
0.0199 0.0236
EPA
0.0397 0.0249
EPO
0.0032 0.0039
EAO
0.0027 0.0013
MPA
0.0199 0.0131
MPO
0.0005 0.0013
MAO
0.0003 0.0013
PAO
0.0002 0.0000
EMPA
0.0254 0.0249
EMPO
0.0020 0.0052
EMAO
0.0018 0.0013
EPAO
0.0005 0.0000
MPAO
0.0004 0.0000
EMPAO
0.0002 0.0000
Diff
0.0322
0.0949
0.0062
-0.0273
0.0893
0.0281
-0.0544
-0.0606
-0.0063
-0.0154
0.0320
-0.0200
-0.0014
-0.0058
0.0007
-0.0045
-0.0570
-0.0148
0.0036
-0.0148
0.0007
-0.0014
-0.0069
0.0008
0.0010
-0.0002
-0.0005
0.0033
-0.0005
-0.0005
-0.0004
-0.0002
Single
ethnicities
only
This variable
makes
some of us
unique.
Differences
come from
-Special Uniques
process
-Sampling error
12
Census curf unique records
• 74.4% of records for NZ adults
have unique combinations of values
across all 33 vars
They’re Population Uniques
• If someone is unique in the CURF, are
they also unique in the population?
What is Pr(PU|SU) ?
13
Is a Sample Unique
also a Population Unique?
Pr(PU|SU) vs Sample Size (%)
1.0
0.8
0.6
0
20
40
60
Sample Size (%)
80
100
14
We volunteered ten researchers
to assess safety and usefulness …
• Useful?
“no more variables needed”
“keep the household relationships”
“I’d like Region” “I’d never use Region”
“I get down to small numbers (employment by
ethnicity) and worry about small sample size”
“the CURF will be a major asset to researchers”
• Safe?
“I am quite confident that our identity
has been protected as much as possible”
15
How big a sample?
• 1% is Too Small
We have sample surveys like that already:
- Household Labour Force Survey
- Income Survey
- SoFIE
• 3% is Too Big
Disclosure risk up, variability down only a bit
• Whole-number %s are best
for the sampling method
Suggestions please???
16
Usability, Safety and Sample Size:
Variability of Ests and Disclosure Risk
vs Curf Size as % of Population
0. 10%
Variability
Disclosure Risk
0. 05%
0. 00%
0.00
0.01
Existing surveys
0.02
0.03
0.04
Curf Size (%)
17
Our ‘controlled’ sampling method:
We used:
A sort on
Sex,
AgeGroup (8 groups),
AreaUnit (not in CURF)
then a grouping into 100s
then a systematic sample from each 100
This gives great proportions
for these ‘controlled’ variables
and may help related variables
(ethnicities, urban/rural etc)
18
Diffs in Counts: CURF – Expected
Variation with random sampling
and independance; ±1, ± 2 SDs
Controlled variables give tiny differences ( ≤ ±1)
Others show little drop in variation
19
Conclusions:
• Making a CURF both Useful and Safe needs:
Cunning
Contracts, Co-operation
Confidence
• Controlling the sampling improves counts:
for controlled variables: spectacularly
for other variables:
minimally!
• See
www.stats.govt.nz/CURF
20
The slides from here on are for background
A MiniCurf:
Sex
AgeGp
Female
0-4
Female
0-4
Female
15-19
Female
15-19
Female
65+
Male
0-4
Male
0-4
Male
15-19
Male
25-34
Male
25-34
Tenure
Birthplace Income(k$)Unique??
X
NZ
X
n
X
NZ
X
n
Don't own Oceania 70+
y
Own
UK+IrelandLoss,Zero y
NEI
OtherEurope
70+
y
X
NZ
X
n
X
NZ
X
n
Don't own Asia
25+-30
y
Own
NZ
30+-40
y
Own
NZ
NotStated y
NEI = Not Elsewhere Indicated; X = Missing (structural)
21
2001 Census Statement of Confidentiality
Only people authorised by the Statistics
Act 1975 are allowed to see your
individual information.
They must use it only for statistical
purposes, such as the preparation of
summary statistics about groups.
We’re working within this.
22
Overseas Practice
Agency
Office for
National
Statistics
US Census
Bureau
Filename
Sample of
Anonymised
Records (SARs)
Public Use
Microdata Files
(PUMs)
Public Use
Microdata Files
(PUMFs)
Statistics
Canada
Australian
Bureau
of Statistics CURF
Sample size
Vars Pop
2% (individuals)
1% (households)
48
45
60M
1% and 5%
(households)
all
280M
2.8% (individuals)
122
30M
1%(dwellings,indiv) 69
1% (non-pvte,indiv) 39
19M
23
How big are NZ’s Area Units??
Area Units: Frequency distribution of population size
60
35
10
0
-15
1600
3200
4800
6400
8000
Population size, in classes 0-, 200, 400- etc
Mean = 2 000
There are lots of very small population Area Units
24
Diffs in Counts: CURF – Expected
• Counts in the cells for controlled variables:
Sex* AgeGroup * AreaUnit are ≤ ±1 out
25
Diffs in Counts: CURF – Expected
Difference Vs Expected for selected variables
200
150
100
Difference
50
0
Sex*Age*Income
0
1000
2000
3000
4000
5000
6000
-50
-100
-150
-200
Expected
IncomeGroup is related to Sex and AgeGroup
but is still quite variable
26
Curf behaviour: Tables of counts
Counts by Tenure and Sex
Tenure\Sex Female
Male
Don't own
1
2
NEI
1
0
Own
1
1
X
2
2
(From the Minicurf)
• For the population we have:
Population size = N
Sample size = n
• For any cell we have:
Population count = k
Sample count = x, and let p = x/n
• So x behaves as a …. Ummmmm …
27
… is x a hypergeometric??
The population …
N
n
The curf …..
k
The
people
with
the
property
of
interest
…
x
28
What happens to sampling error
of p as curf size increases?
• If we believe:
x is hypergeometric; parameters N, n, k
x is approx binomial, parameters k/N, n
x is approx Poisson, parameter nk/N
(when k/N is small)
• Let p = x/n = proportion of curf in cell.
Then SE(p) = expression /√n
SE of p (for k/N = .01)
so it declines
gracefully
as n increases
0.0005
0
0
0.01
0.02
0.03
0.04
Curf size: Sample fraction n/N
0.05
29
0.06