Wir bewegen Data Imputation and Estimation for the Austrian Register Based Census Test Informationen Reinhard Fiedler Peter Schodl April 23rd, 2008 © STATISTIK AUSTRIA www.statistik.at 06.11.2015 S T A T.

Download Report

Transcript Wir bewegen Data Imputation and Estimation for the Austrian Register Based Census Test Informationen Reinhard Fiedler Peter Schodl April 23rd, 2008 © STATISTIK AUSTRIA www.statistik.at 06.11.2015 S T A T.

Wir bewegen
Data Imputation and Estimation for the
Austrian Register Based Census Test
Informationen
Reinhard Fiedler
Peter Schodl
April 23rd, 2008
© STATISTIK AUSTRIA
www.statistik.at
06.11.2015
S T A T I S T I K
A U S T R I A
1
Welcome
4/23/2008
S T A T I S T I K
A U S T R I A
2
Introduction
Background Information
Time Plan
Pros and Cons
Registers used for RBCT (Register Based Census Test)
Estimation procedures
Record Linkage
Estimation
Hot-deck technique
Clustering
4/23/2008
S T A T I S T I K
A U S T R I A
3
Background Information
Time Plan
2001:
last conventional census
31.10.2006:
reference date for RBCT
April 2008:
first report RBCT
2010:
first register-based census
4/23/2008
S T A T I S T I K
A U S T R I A
4
Background Information
Pros and cons
Pros:
Cons:
economic efficiency
incomplete data
faster
inconsistent data
more often
timeliness
unburden respondents
privacy
4/23/2008
S T A T I S T I K
A U S T R I A
5
Background Information
Registers used for RBCT
8 basis registers, e.g.
Central Population Register (CPR)
Central Social Security Register (CSSR)
Register of Educational Attainment
7 comparison registers for cross-checks, e.g.
Register of Family Allowance
Register of Social welfare
Linkage by unique keys
Branch-specific identification number (bPK)
(a specific personal code)
Social Security Number (RBCT)
4/23/2008
S T A T I S T I K
A U S T R I A
6
Background Information
Missing data
Low missing rates
Covered by more than one data source
Sex
(<1% missing)
Date of birth
(<1% missing)
Medium to high missing rates
Marital status
(11% missing)
Graduates
(7% missing)
Not included in any register
Occupation
4/23/2008
S T A T I S T I K
(100% missing)
A U S T R I A
7
Estimation Strategy
Record Linkage
For all registers
Estimation
Marital status (high missing rate)
Occupation (not contained in any register)
Graduates (immigrants since last census)
4/23/2008
S T A T I S T I K
A U S T R I A
8
Record Linkage
Problem: imperfect linkage of registers
Wrong or missing keys
Attributes used:
Date of birth
Address
Nationality
Sex
Standardization of notations
4/23/2008
S T A T I S T I K
A U S T R I A
9
Record Linkage
Example: Current school enrolment
By record linkage, people in school-age without current school
enrolment are reduced by 40%
4/23/2008
S T A T I S T I K
A U S T R I A
10
Estimation
Occupation and graduates
Graduates
Source: RBCT itself
6.600.000 people with graduation
409.000 people with missing graduation
Occupation
Source: Labour Force Survey
Quarterly sample survey
About 35.000 people with occupation in survey
3.800.000 People with missing occupation (all working persons)
4/23/2008
S T A T I S T I K
A U S T R I A
11
Estimation
Basic idea:
Same procedure for estimation of occupation and graduates
Estimation on person-level
Target-distribution
Building of groups, to transfer the distribution of the source to the
corresponding group of the target
Groups are formed by attributes with significant influence on the
target-variable
4/23/2008
S T A T I S T I K
A U S T R I A
12
Hot-deck technique
Example:
1000 People from 30 to 34 years living in Tyrol form one deck
Labour Force Survey
200 with occupation A
300 with occupation B
500 with occupation C
Weighting scheme gets applied to all people within the deck in the RBCT
20% probability for occupation A
30% probability for occupation B
50% probability for occupation C
4/23/2008
S T A T I S T I K
A U S T R I A
13
Which attributes have influence?
Graduates
Age
Status in employment
Sex
Nationality
Urban / rural environment
Occupation
Age
Region
Status in employment
NACE of employment
Sex
Level of educational achievement
Nationality
4/23/2008
S T A T I S T I K
A U S T R I A
14
Clustering
Groups must not be too small
No donor for many persons
Wrong distribution
Example:
Source:
Tyrol, male, 87 years, German nationality:
Tyrol, female, 87 years, German nationality:
10 Persons
5 occupation A (50% A)
5 occupation B (50% B)
1 Person
1 occupation B (100% B)
Target:
Tyrol, male, 87 years, German nationality:
Tyrol, female, 87 years, German nationality:
1000 Persons  500 A, 500 B
1000 Persons  1000B
Tyrol, male, 87 years, German nationality:
500 occupation A
1500 occupation B
4/23/2008
S T A T I S T I K
A U S T R I A
15
Clustering
Groups must not be too big
Correct distribution only on highest level,
incorrect distribution on lower levels
Example:
2 Groups: Male / Female.
Distribution for males and for females are transferred to the target
distribution of source and target is the same for males and females.
But: distribution for regions, age,… can be incorrect!
 Optimal groups by cluster analysis
4/23/2008
S T A T I S T I K
A U S T R I A
16
Clustering
Occupation
First clustering:
Variables with many values (age, nationality, region,…)
Second clustering:
Since the groups after first clustering are too small, the groups
are clustered again
nationality
age
age
First aggregation
4/23/2008
nationality
S T A T I S T I K
Second aggregation
A U S T R I A
17
Results
Graduates:
No second clustering, 457 groups with about 14.000 Persons in each
group
Never more than 2.4% deviation on highest level to the
Labour force survey 2006
Occupation:
65 groups after second clustering with about 500 Persons in each
group
Never more than 1.7% deviation to the traditional 2001
census on highest level
Never more than 3.2% deviation on medium levels
4/23/2008
S T A T I S T I K
A U S T R I A
18