A ‘Microdata for Research’ sample from a New Zealand census Mike Camden [email protected] Statistics New Zealand www.stats.govt.nz.
Download ReportTranscript A ‘Microdata for Research’ sample from a New Zealand census Mike Camden [email protected] Statistics New Zealand www.stats.govt.nz.
A ‘Microdata for Research’ sample from a New Zealand census Mike Camden [email protected] Statistics New Zealand www.stats.govt.nz 1 Reseachers accessing microdata in NZ: The researchers • Datalab output • Offsite dataset • Remote Access code output • CURFs CURF on CD detail in data Govt only ease of access 2 The census dataset About 100 output variables (all categorical): ID Geographic Location Family & Household Demographics: Sex Age Residence Ethnicity Origin Income Employment 3 820 749 people (the census-night population) The CURF is a subset of this … where CURF = Confidentialised Unit Record File 3 The census dataset and its CURF About 100 output variables (all categorical): ID Geographic Location Family & Household 3 820 749 people (the census-night population) Demographics: Sex Age Residence Ethnicity Origin Income Employment 33 variables: all categorical, some collapsed 250 values modified 76 415 people; 2% We want this CURF to be both useful and safe! But there are several possible results … 4 The possible results … Usability Useful, Unsafe Useless, Unsafe Useful, Safe Useless, Safe Safety We hope we’ve got ours up here ! 5 How to get safety (we hope) and not lose usefulness: • Choose the variables carefully but location and household are sad losses • Collapse variables carefully to preserve important groups • Choose a small sample size and still get good estimates • Use Special Uniques to find rogue records and variables and change a tiny fraction of the dataset 6 Example: carefully collapsed categories: AgeGroup has: a few (8) large (5% +) categories useful life-stage categories 7 For future census CURFs, we’ll rethink: • Including location and household variables at expense of others • Collapsing of categories • Sample size 8 One measure of usefulness: reliability of counts 5% • Here’s a cell in a table: 5% of the population is in it. • The CURF will give an answer ± its sampling error: 5% ± 0.08%. • This is what the sampling error looks like for other population %’s: 100% 9 Let’s fix curf size at 2%: • Let pop proportion go from 0% to 5%: What happens to sampling error of p? Expected proportion p +- its SE 0.06 0.05 0.04 0.03 0.02 0.01 0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 A 2% sample, with 76 000 people, gives good estimates Population proportion k/N 10 The CURF expresses NZ’s diversity: • • • • We have 5 Yes/No Ethnicity variables Special Uniques process set some to No The CURF adds some sampling error 5 variables give 32 (= 25) combinations … 11 The 32 combined ethnicities: CombEthCensus% CURF % 4.1568 4.1890 E 69.8481 69.9431 M 7.7240 7.7302 P 4.3746 4.3473 A 5.9069 5.9962 O 0.5608 0.5889 EM 5.0730 5.0186 EP 0.7895 0.7289 EA 0.3374 0.3311 EO 0.1005 0.0851 MP 0.4090 0.4410 MA 0.0528 0.0327 MO 0.0066 0.0052 PA 0.0751 0.0694 PO 0.0032 0.0039 AO 0.0058 0.0013 EMP 0.3697 0.3128 EMA 0.0894 0.0746 EMO 0.0199 0.0236 EPA 0.0397 0.0249 EPO 0.0032 0.0039 EAO 0.0027 0.0013 MPA 0.0199 0.0131 MPO 0.0005 0.0013 MAO 0.0003 0.0013 PAO 0.0002 0.0000 EMPA 0.0254 0.0249 EMPO 0.0020 0.0052 EMAO 0.0018 0.0013 EPAO 0.0005 0.0000 MPAO 0.0004 0.0000 EMPAO 0.0002 0.0000 Diff 0.0322 0.0949 0.0062 -0.0273 0.0893 0.0281 -0.0544 -0.0606 -0.0063 -0.0154 0.0320 -0.0200 -0.0014 -0.0058 0.0007 -0.0045 -0.0570 -0.0148 0.0036 -0.0148 0.0007 -0.0014 -0.0069 0.0008 0.0010 -0.0002 -0.0005 0.0033 -0.0005 -0.0005 -0.0004 -0.0002 Single ethnicities only This variable makes some of us unique. Differences come from -Special Uniques process -Sampling error 12 Census curf unique records • 74.4% of records for NZ adults have unique combinations of values across all 33 vars They’re Population Uniques • If someone is unique in the CURF, are they also unique in the population? What is Pr(PU|SU) ? 13 Is a Sample Unique also a Population Unique? Pr(PU|SU) vs Sample Size (%) 1.0 0.8 0.6 0 20 40 60 Sample Size (%) 80 100 14 We volunteered ten researchers to assess safety and usefulness … • Useful? “no more variables needed” “keep the household relationships” “I’d like Region” “I’d never use Region” “I get down to small numbers (employment by ethnicity) and worry about small sample size” “the CURF will be a major asset to researchers” • Safe? “I am quite confident that our identity has been protected as much as possible” 15 How big a sample? • 1% is Too Small We have sample surveys like that already: - Household Labour Force Survey - Income Survey - SoFIE • 3% is Too Big Disclosure risk up, variability down only a bit • Whole-number %s are best for the sampling method Suggestions please??? 16 Usability, Safety and Sample Size: Variability of Ests and Disclosure Risk vs Curf Size as % of Population 0. 10% Variability Disclosure Risk 0. 05% 0. 00% 0.00 0.01 Existing surveys 0.02 0.03 0.04 Curf Size (%) 17 Our ‘controlled’ sampling method: We used: A sort on Sex, AgeGroup (8 groups), AreaUnit (not in CURF) then a grouping into 100s then a systematic sample from each 100 This gives great proportions for these ‘controlled’ variables and may help related variables (ethnicities, urban/rural etc) 18 Diffs in Counts: CURF – Expected Variation with random sampling and independance; ±1, ± 2 SDs Controlled variables give tiny differences ( ≤ ±1) Others show little drop in variation 19 Conclusions: • Making a CURF both Useful and Safe needs: Cunning Contracts, Co-operation Confidence • Controlling the sampling improves counts: for controlled variables: spectacularly for other variables: minimally! • See www.stats.govt.nz/CURF 20 The slides from here on are for background A MiniCurf: Sex AgeGp Female 0-4 Female 0-4 Female 15-19 Female 15-19 Female 65+ Male 0-4 Male 0-4 Male 15-19 Male 25-34 Male 25-34 Tenure Birthplace Income(k$)Unique?? X NZ X n X NZ X n Don't own Oceania 70+ y Own UK+IrelandLoss,Zero y NEI OtherEurope 70+ y X NZ X n X NZ X n Don't own Asia 25+-30 y Own NZ 30+-40 y Own NZ NotStated y NEI = Not Elsewhere Indicated; X = Missing (structural) 21 2001 Census Statement of Confidentiality Only people authorised by the Statistics Act 1975 are allowed to see your individual information. They must use it only for statistical purposes, such as the preparation of summary statistics about groups. We’re working within this. 22 Overseas Practice Agency Office for National Statistics US Census Bureau Filename Sample of Anonymised Records (SARs) Public Use Microdata Files (PUMs) Public Use Microdata Files (PUMFs) Statistics Canada Australian Bureau of Statistics CURF Sample size Vars Pop 2% (individuals) 1% (households) 48 45 60M 1% and 5% (households) all 280M 2.8% (individuals) 122 30M 1%(dwellings,indiv) 69 1% (non-pvte,indiv) 39 19M 23 How big are NZ’s Area Units?? Area Units: Frequency distribution of population size 60 35 10 0 -15 1600 3200 4800 6400 8000 Population size, in classes 0-, 200, 400- etc Mean = 2 000 There are lots of very small population Area Units 24 Diffs in Counts: CURF – Expected • Counts in the cells for controlled variables: Sex* AgeGroup * AreaUnit are ≤ ±1 out 25 Diffs in Counts: CURF – Expected Difference Vs Expected for selected variables 200 150 100 Difference 50 0 Sex*Age*Income 0 1000 2000 3000 4000 5000 6000 -50 -100 -150 -200 Expected IncomeGroup is related to Sex and AgeGroup but is still quite variable 26 Curf behaviour: Tables of counts Counts by Tenure and Sex Tenure\Sex Female Male Don't own 1 2 NEI 1 0 Own 1 1 X 2 2 (From the Minicurf) • For the population we have: Population size = N Sample size = n • For any cell we have: Population count = k Sample count = x, and let p = x/n • So x behaves as a …. Ummmmm … 27 … is x a hypergeometric?? The population … N n The curf ….. k The people with the property of interest … x 28 What happens to sampling error of p as curf size increases? • If we believe: x is hypergeometric; parameters N, n, k x is approx binomial, parameters k/N, n x is approx Poisson, parameter nk/N (when k/N is small) • Let p = x/n = proportion of curf in cell. Then SE(p) = expression /√n SE of p (for k/N = .01) so it declines gracefully as n increases 0.0005 0 0 0.01 0.02 0.03 0.04 Curf size: Sample fraction n/N 0.05 29 0.06