UNECE Workshop on Census Technology for SPECA and CIS member countries (Astana, 7-8 June 2007) Technology for census data coding, editing and imputation Paolo Valente.

Download Report

Transcript UNECE Workshop on Census Technology for SPECA and CIS member countries (Astana, 7-8 June 2007) Technology for census data coding, editing and imputation Paolo Valente.

UNECE Workshop on Census Technology
for SPECA and CIS member countries
(Astana, 7-8 June 2007)
Technology for census data coding,
editing and imputation
Paolo Valente (UNECE)
Paolo Valente - UNECE Statistical Division
Slide 1
Content:
1. Coding
2. Editing and imputation
Reference material:
Handbook
on Census Management for Population and
Housing Censuses (Chapter IV, sections D-F)
Handbook
on Population and Housing Census Editing
Paolo Valente - UNECE Statistical Division
Slide 2
1. Census data coding
Questions:
1. How did you code the data in the last
census?
2.
Were you satisfied or not with coding?
3.
What problems did you find in coding?
4.
Any problems with specific variables?
Paolo Valente - UNECE Statistical Division
Slide 3
Census data coding

Data coding = Assigning classification codes
to the responses written on the census form

Coding systems:
a) Manual
b) Computer assisted
c) Automatic
d) Mix of a), b) or c)

Coding methodologies:
a) Simple (1 or 2 words): ex. Birth place
b) Structured (> 1 question): ex. Occupation
c) Hierarchical: ex. Address
Paolo Valente - UNECE Statistical Division
Slide 4
Manual data coding

Clerks identify code using “code books”, and
write it in the census form for later processing

Pros:
 Easy to implement
 No technology needed

Cons:
 Time consuming
 Labor intensive
 Risk of inconsistency
Paolo Valente - UNECE Statistical Division
Slide 5
Computer-assisted coding



Assisted by computerized system
Computer-based code books
How it works:
1) Coder type only few characters
2) System selects matching list
3) Coder choose right code
4) Code automatically recorded by the
system
Paolo Valente - UNECE Statistical Division
Slide 6
Computer-assisted coding

Pros:
 Efficiency
 Good quality
 Particularly suitable for structured coding
(possibility to include coding rules)

Cons:
 Relatively complex system
 Long time needed for development
 Cost relatively high
Paolo Valente - UNECE Statistical Division
Slide 7
Automatic coding



Based on computerized algorithms
No human intervention
Text captured by ICR and matched against
indexes
 A score is assigned by the system to the
matched response:
 If score is above certain level, response
accepted
 If score is below level, human intervention is
needed (computer-assisted coding)
Paolo Valente - UNECE Statistical Division
Slide 8
Automatic coding

Matching rates depend on algorithms used and
type of variable
 Maximum matching rates in ideal circumstances:
 For simple variables (birth place), approx. 80%
 For complex variables (occupation, industry),
approx. 50%
 All responses not matched have to be processed
with computer assisted coding
Paolo Valente - UNECE Statistical Division
Slide 9
Automatic coding

Pros:
 High efficiency
 Good quality (if system developed accurately)
 Consistency
 Particularly suitable for structured coding
(possibility to include coding rules)

Cons:
 Very complex system
 Long time needed for development
 High cost
 Risk of systematic errors in case of faults in
matching algorithms or indexes
Paolo Valente - UNECE Statistical Division
Slide 10


Coding – Practices in 2000
round
In general CIS countries used manual coding
About half of UNECE countries used automatic
coding, in combination with computer-assisted
or manual coding


In most cases software developed in-house
Software for automatic coding:

ACTR (Automated Coding by Text Recognition)
developed by Statistics Canada, also used by Italy, UK
See “Measuring Population and Housing”, Chapter III

Integrated software system, including computer
assisted coding: CSPro (US Census Bureau)
Paolo Valente - UNECE Statistical Division
Slide 11
Coding in the 2010 census
round
Questions:
1.
What are your plans for coding data
of next census?
2.
Are you considering computer-assisted
coding?
3.
Why? …or why NOT?
Paolo Valente - UNECE Statistical Division
Slide 12
2. Editing and imputation
Questions on editing:
1.
Which data did you edit in the last census?
2.
How did you edit the data?
3.
Did you have any problems?
Paolo Valente - UNECE Statistical Division
Slide 13
2. Editing and imputation
Questions on imputation:
1.
Did you impute any missing data?
If yes:
For which variables?
3. What method and software you used?
4. Did you produce statistics on imputation rates?
2.
Paolo Valente - UNECE Statistical Division
Slide 14
Editing and imputation

Editing = Detecting and correcting
errors in census data

Imputation = assigning values to
missing data

The two concepts are related and the two
terms are sometimes used in different ways
Paolo Valente - UNECE Statistical Division
Slide 15
Editing and imputation

Different types of errors:





Coverage errors (ex. omissions, duplicates)
Enumerator errors
Respondent errors
Coding errors
Data entry errors
but also…

Editing errors!
Paolo Valente - UNECE Statistical Division
Slide 16
Editing and imputation

Important not only to detect errors,
but also to identify causes,
in order to take appropriate measures
and improve overall quality
 Objectives of editing and imputation:
 Improve quality of census data
 Facilitate analysis of census data
 Identify types and sources of errors
Paolo Valente - UNECE Statistical Division
Slide 17
Editing and imputation

Dilemma: what should be edited and
what should NOT be edited?

Complex editing systems can be difficult and
expensive to implement, and in some cases
may introduce distortions
 Go for relatively simple editing system!
Paolo Valente - UNECE Statistical Division
Slide 18
Editing and imputation

In general, the editing system
should be:
 Minimalist (only obvious errors)
 Automated (as much as possible)
 Systematic
 Compliant with other NSI procedures
 Compliant with intl. standards
Paolo Valente - UNECE Statistical Division
Slide 19
Editing and imputation
General guidelines for editing:



Make the fewest required changes possible
Eliminate obvious inconsistencies
Supply entries for erroneous or missing
items by using other entries for the housing
unit, person, or other persons in the
household or comparable group as a guide
Paolo Valente - UNECE Statistical Division
Slide 20
Editing and imputation
Example of inconsistent information 1:

Reference person and spouse have same sex
Paolo Valente - UNECE Statistical Division
Slide 21
Editing and imputation
Example of inconsistent information 2:

Excessive age difference between mother and children
Paolo Valente - UNECE Statistical Division
Slide 22
Editing and imputation
Editing approaches:

Top-down:
Items in sequence, from first to last
 Multiple variable (Fellegi-Holt):
A set of statements and relationships among
variables are checked in the household
2. The edit keeps track of all false statements
3. The system assess how to best changes the data
1.
Paolo Valente - UNECE Statistical Division
Slide 23
Editing and imputation
Imputation methods:

Static imputation (or “cold deck”)


Used mainly for missing values only
Value assigned from predetermined set, or
distribution of valid responses
 The set of values does not change over time

Dynamic imputation (or “hot deck”)


Used for missing or inconsistent values
Value assigned from “donor” with similar
characteristics, that changes constantly
 Response imputations change over time
See “Handbook on Census Editing”, Ch. II.E and Annex V
Paolo Valente - UNECE Statistical Division
Slide 24
Editing and imputation

Types of edits:

Fatal edits identify errors with certainty
 Query edits identify suspected errors

Structure edits


Check coverage and relations between different
units: persons, households, housing units,
enumeration areas etc.
Edits for population and housing items
See “Handbook on Census Editing”, Chapters III, IV and V
Paolo Valente - UNECE Statistical Division
Slide 25
Editing and imputation
Practices in 2000 round

Most ECE countries (33 out of 40) performed
computer-supported editing, including several
CIS countries

22 countries performed automatic imputations
 Most countries developed specific software
 Some countries used SAS, Oracle, SQL, CSPro
See “Measuring Population and Housing”, Chapter III
Paolo Valente - UNECE Statistical Division
Slide 26
Editing and imputation
Plans for 2010 round
Questions:
 What are your plans for editing and
imputation?
 What editing approaches/methods are
you considering?
Paolo Valente - UNECE Statistical Division
Slide 27
Editing and imputation
Plans for 2010 round
Questions:
 For which variables would you consider
imputation of missing values?
Paolo Valente - UNECE Statistical Division
Slide 28