The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social and Spatial Statistics Department Support.

Download Report

Transcript The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social and Spatial Statistics Department Support.

The availability of Dutch census
microdata
Eric Schulte Nordholt
Senior researcher and project leader of the Census
Statistics Netherlands
Division Social and Spatial Statistics
Department Support and Development
Section Research and Development
[email protected]
Workshop on Communication and Dissemination of Census
Results in Geneva
16 May 2008
Contents
• Historical introduction
• Registers used for the virtual census
• Micro linkage
• Social Statistical Database
• Publicity about Dutch censuses
• Harmonisation
• Microdata availability
• Statistical Disclosure Control
2
Historical introduction
Till 1899: Ministry of Home Affairs
1899: 8th Census
1971: 14th Census
Till 1995: more and more surveys
Last twelve years:
moving to a register-based statistical office
Reasons:
• Unwillingness (non-response)
• Reduction of response burden
• Reduction of expenses
3
Registers used for the virtual census
External registers (maintenance by register holders):
• Population Register (PR), 16 million records
demographic variables: sex, day of birth, marital status,
country of birth etc.
• Fiscal administration (FIBASE),
jobs, 7.2 million records and
pensions and life insurance benefits, 2.7 million records
• Social Security administrations, 2 million records,
auxiliary information integration process
Internal registers (maintenance by Statistics Netherlands):
• Jobs file (employees), 6.5 million records and
• Self-employed persons, 790 thousand records
dates of job, branch of economic activity
• General Business Register, 600.000 records
size class, (economic) activity
• Housing Register, about 7 million records
housing variables
4
Micro linkage
• Linkage key:
Registers
Social security and Fiscal number (SoFi), unique
since 26 November 2007: Citizen Service Number
Surveys
Sex, date of birth,
address (postal code and house number)
• Linkage key replaced by RIN-person
• Linkage strategy
Optimizing number of matches
Minimizing number of mismatches and missed
matches
5
Social Statistical Database
Social Statistical Database (SSD):
Set of integrated microdata files with coherent
and detailed demographic and socio-economic
data on persons, households, jobs and benefits
No remaining internal conflicting information
SSD-set:
• Population Register (backbone)
• Integrated jobs file
• Integrated file of (social and other) benefits
• Surveys, e.g. LFS
Combining element: RIN-person
6
Publicity about Dutch censuses
The Dutch Virtual Census of
2001 was a successful
alternative for a traditional
census
Tables: http://www.cbs.nl/enGB/menu/themas/dossiers/volk
stellingen/publicaties/2005virtual-dutch-census-art.htm
Book: http://www.cbs.nl/enGB/menu/themas/dossiers/volk
stellingen/publicaties/2001-b57e-pub.htm
7
Harmonisation (1)
More information about the Dutch traditional
Censuses (including those of 1960 and 1971):
http://www.volkstellingen.nl/en/
For 1960 and 1971 the same variables as for 2001
• if not available: constructed based on existing variables in
Census data
Variables not internationally harmonised (e.g. sex,
age, marital status, household position, country of
birth, economic status, household size and
country of citizenship)
• same classification and priority rules as for 2001
8
Harmonisation (2)
Household size and country of citizenship:
• missing for 1960
Religious denomination (philosophy of life):
• only for 1960 and 1971
Place of residence one year prior to the census:
• only for 2001
International classifications
• Branch of current economic activity: ISIC / NACE
• Occupation: ISCO-COM
• Level of educational attainment: ISCED
9
Harmonisation (3)
1960
1971
2001
Sex
X
X
X
Age
X
X
X
X
X
Country of citizenship
Marital status
X
X
X
Household position
X
X
X
Religious denomination
X
X
Country of birth
X
X
X
X
X
Household size
Place of residence one
year prior to the census
X
Economic status
X
X
X
Level of educational
attainment
X
X
X
Occupation
X
X
X
Branch of current
economic activity
X
X
X
10
Microdata availability
One percent samples for three years (1960, 1971
and 2001)
IPUMS (Integrated Public Use Microdata Series):
http://www.ipums.org/international/index.html
Weighting to population totals
Protecting according to rules for public use
microdata files with Mu-ARGUS
Microdata sets for all three years available for
research!
DANS (Data Archiving and Networked Services):
http://www.dans.knaw.nl/en/
11
Statistical Disclosure Control (1)
Microdata under contract (MUC):
1. No direct identifiers
2. Rule against spontaneous recognition: each
combination of an extremely identifying variable, a
very identifying variable and an identifying variable
should occur at least 100 times in the population
3. Extension of this rule: maximum level of detail of
some variables (occupation, level of education,
branch of economic activity) is determined by the
most detailed direct regional variable
4. Each region that can be distinguished in the
microdata should contain at least 10,000
inhabitants
5. No direct regional variables in panel data
12
Statistical Disclosure Control (2)
Identifying variables
• Direct (formal) identifiers
• Name, address, citizen service number, …
• Indirect identifiers, differentiated into
• Extremely identifying (E)
• Very identifying (V)
• Identifying (I)
E
V
I
13
Statistical Disclosure Control (3)
Examples of identifying variables
• Extremely identifying:
• Regional variables (residence, work, …)
• Very identifying:
• Sex, nationality
+ Extremely identifying variables
• Identifying:
• Age, occupation, education
+ Very identifying variables
E
V
I
14
Statistical Disclosure Control (4)
Public use microdata files:
1. Microdata must be at least one year old
2. No direct identifiers or direct regional variables
3. Only 1 kind of indirect regional variables. Values of
indirect regional variables sufficiently scattered.
Each area should contain at least 200,000 persons
in the target population and should consist of
municipalities from at least six of the twelve
provinces. No dominating municipality in any area.
4. At most 15 indirect identifiers
5. No sensitive variables
15
Statistical Disclosure Control (5)
Public use microdata files (continued):
6. Sampling weights should not provide additional
identifying information
7. Rule against spontaneous recognition: at least
200,000 individuals in the population for each
category of an identifying variable
8. Another rule against spontaneous recognition: at
least 1000 individuals in the population for each
category in the crossing of two identifying
variables
9. At least 5 households per combination of
categories of household variables
10.Records should be in random order
16
Statistical Disclosure Control (6)
Microdata for remote analyses
• Remote execution:
Scripts are sent (on line) to Statistics Netherlands and
applied to the microdata; SDC is applied before returning
the results
(Compare with on-site microdata)
• Remote access:
On-line access to confidentialized microdata sets
(Compare with microdata under contract or on-site)
17
Time for questions and discussion
18