Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social and Spatial.

Download Report

Transcript Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social and Spatial.

Record matching for census
purposes in the Netherlands
Eric Schulte Nordholt
Senior researcher and project leader of the Census
Statistics Netherlands
Division Social and Spatial Statistics
Department Support and Development
Section Research and Development
[email protected]
Joint UNECE/Eurostat Meeting on Population and Housing
Censuses in Astana
4-6 June 2007
Contents
• History of the Dutch Census
• Data sources
• Micro linkage
• Micro integration
• Social Statistical Database
• Estimation aspects
• Statistical confidentiality
• Conclusions
2
History of the Dutch Census
TRADITIONAL CENSUS
Ministry of Home Affairs:
1829, 1839, 1849, 1859, 1869, 1879 and 1889
Statistics Netherlands:
1899, 1909, 1920, 1930, 1947, 1960 and 1971
Unwillingness (nonresponse) and reduction
expenses  no more Traditional Censuses
ALTERNATIVE: VIRTUAL CENSUS
1981 and 1991: Population Register and surveys
development 90’s: more registers →
2001: integrated set of registers and surveys, SSD
3
Data sources
Registers:
• Population Register (PR), 16 million records
demographic variables: sex, age, household status etc.
• Jobs file, employees, 6.5 million records, and
self-employed persons, 790 thousand records
dates of job, branch of economic activity
• Fiscal administration (FIBASE)
jobs, 7.2 million records, and
pensions and life insurance benefits, 2.7 million records
• Social Security administrations, 2 million records,
auxiliary information integration process
Surveys:
• Survey on Employment and Earnings (SEE),
3 million records, working hours, place of work
• Labour Force Survey (LFS), 2 years: 230.000 records
education, occupation, (economic) activity
4
Matching process
– Matching of registers and datasets to a self
constructed Central Matching File
– Records are identified by a surrogate identifier
(RIN)
– One unique table RIN-Social Security Number
– Minimal set of identifying variables
– Every step in the process is a deterministic
match
5
Statistics Netherlands’ backbone of
persons
The Central Matching File (April 2007)
46.436.060 records 16.334.210 unique persons
Social security number (sofi)
< 0.03 % unknown for 1995-2007;
Date of birth
< 0.5% unknown month and/or day
Gender
always
Postal code
< 0.05% unknown
House number
< 0.05% unknown
RIN Person
always
RIN Address
always
Time frame of variable validity
always
6
Matching process
1. Social security number matching
Check on date of birth and gender
A valid match when no more than one of the
variables year, month, day of birth and gender
differ
else
2. Matching using other variables like postal
code, house number, date of birth, gender
All keys must match
else
3. Match on social security number without any
control on other variables
7
Micro data with Surrogate Identifier
Surveys
Direct Identifier
Surrogate Identifier (RIN)
de-identification table
RIN
Micro data Services
Registers
Micro data
Preparation and
documentation
Social Statistics Database
production
environment SN
Municipal Population
Register
RIN
RIN
RIN
employment
income, jobs
education
social
security,..
RIN
YearMonthBirth,
gender,
municipality, civil
status
Selection from Municipal
population register
de-identified micro data
8
Example
Employement and Wages survey 2003
3801246
100,0
Total matched
3747976
98,6
1
Sofi number, year of birth, month, day, gender
3577090
94,1
2
Postal code, year of birth, month, day, gender
164267
4,3
3
Sofi number
6619
0,2
53270
1,4
21194
0,6
5799
0,2
10294
0,3
5101
0,1
32076
0,8
8718
0,2
20052
0,5
3306
0,1
Not matched
Valid sofi number
valid postal code
invalid postal code
non-resident
Unknown or invalid sofi number
valid postal code
invalid postal code
non-resident
9
Micro integration (1)
The aim of micro integration is:
– To check the linked data and modify incorrect
records,
– In such a way that the results that are to be
published are of higher quality than the
original sources
10
Micro integration (2)
To fulfil this demand an integrated process of:
• data editing,
• derivation of statistical variables,
• and imputation
is executed
11
Micro integration (3)
Constraints and limitations:
- Only variables that are to be published are
micro integrated
- Identity rules are necessary, e.g. the same
variable in two sources or a relationship
between two or more variables in one or more
sources
- No mass imputation
12
Social Statistical Database (SSD)
Social Statistical Database (SSD):
Set of integrated microdata files with coherent
and detailed demographic and socio-economic
data on persons, households, jobs and benefits
No remaining internal conflicting information
SSD set:
• Population Register (backbone)
• Integrated jobs file
• Integrated file of (social and other) benefits
• Surveys, e.g. LFS
Combining element: RIN-person
13
satellite
Core and satellites (1)
satellite
satellite
SSDcore
14
Core and satellites (2)
Core:
• contains only integral register information
• contains the most important demographic and
socio-economic information
• contains only information that is used in at
least two satellites
15
Core and satellites (3)
Satellites are produced in two steps:
• Copying and derivation of the relevant
information from the core SSD
• Adding of the unique information on a specific
theme from registers and surveys
16
Conclusions SSD
The SSD diminishes the administrative burden
The SSD increases
– The efficiency of statistics production
– The accuracy of statistical outputs
– The relevance of social statistics
– The possibilities for social policy research
17
Estimation aspects
– Surveys are samples from the population
– If surveys are enriched with register
information, estimations of the register part of
the enriched survey will lead to inconsistencies
with the counts from the entire register
– Statistics Netherlands developed the method of
consistent and repeated weighting to solve
these inconsistencies
18
Statistical confidentiality
IDs
Variables
Characteristics
Administrative sources
Identifiers
(PINs, sex,
date of birth,
address)
IDs
Variables
Household surveys
PERSONS BACKBONE
full range of all persons as from 1995
IDs in sources are replaced by random
Record Identification Numbers (RINs)
19
Conclusions
• Matching is relatively cheap
• Matching is relatively quick (short production
time)
• Micro integration remains important
• The SSD has found its place in the organisation
• Repeated weighting method guarantees
consistent estimates
• Statistical confidentiality aspects have become
very important
20
Time for questions and discussion
21