Il progetto RELAIS - CROS

Download Report

Transcript Il progetto RELAIS - CROS

Results Of A Project On Record Linkage,
Statistical Matching And Micro Integration:
The ESSnet On Data Integration
Mauro Scanu
Istat
[email protected]
ESSnet workshop: Roma 4 December 2012
ESSnet “Data integration”
An ESSnet on data integration was launched on December 2009
Partners: Italy, Netherlands, Norway, Poland, Spain, Switzerland
Length: 2 years
Summary webpage: http://www.essnet-portal.eu/essnetprojects/ongoing-essnet-projects/data-integration
Objective: to spread data integration know-how in the ESS
WP1: state-of-the-art update
WP2: methodological developments
WP3: software issues
WP4: case studies
WP5: dissemination (on-the-job training courses on specific methods, one
course, one final workshop, contacts with other ESSnets, ISI conference)
ESSnet workshop
Roma 4 December 2012
2
What kind of “data integration”?
We focused on the statistical methods of data integration:
• Record linkage: look for the same unit in two sources
• Statistical matching: look for joint information on variables
observed in two samples with no units in common
and on the methods that make the integrated data set useful for
statistical analysis
• Micro integration processing
ESSnet workshop
Roma 4 December 2012
3
Micro integration
Micro-integration is the method that aims at improving the data quality
in combined sources by searching and correcting for the errors on
unit level, in such a way that:
• the validity and reliability of the statistical outcomes are optimized,
• only one figure on one phenomenon is published,
• variables from different sources can be combined, and
• accurate longitudinal outcomes can be published.
ESSnet workshop
Roma 4 December 2012
4
A model for possible errors in the joint analysis of two
data sets Representation
Measurement
Administrative concept
Target population
validity
administrative
concept
coverage
errors
Operationalisation
administrative concept
Registered
population elements
measurement
error adm
concept
linking error
Response
administrative concept
Linked population
elements
processing
error
correction error
Postlinking
corrections
Corrected response
statistical concept
ESSnet workshop
Roma 4 December 2012
register
outcome
5
Methoological developments and case studies
Methodological developments
• Consistency at the micro level
• Consistency at the macro level
• Statistical matching
• Record linkage
Case studies
• Constructing a register combining different sources on the
topic employment
• Construct the educational attainment variable
ESSnet workshop
Roma 4 December 2012
6
Consistency at the micro level
Example:
Business record
with data from
two sources
Problem:
Inconsistency
among figures due
to edit constraints
ESSnet workshop
Roma 4 December 2012
Variable
Name
Sample Register
Value
Value
x1
Profit
330
x2
Employees
employees)
x3
Turnover main (Turnover main 1000
activity)
x4
Turnover other (Turnover other 30
activities)
x5
Turnover (Total turnover)
x6
Wages (Costs of wages and 200
salaries)
x7
Other costs
500
x8
Total costs
700
(Number
of 20
1030
25
950
800
7
Consistency at the micro level
edit-rules:
• a1: x1 - x5 + x8 = 0 (Profit = Turnover – Total Costs)
• a2: x5 - x3 - x4 = 0 (Turnover = Turnover main + Turnover other)
• a3: x8 - x6 - x7 = 0 (Total Costs = Wages + Other costs)
Objective: we need to combine the different pieces of information,
survey values, register values and edit-rules, to obtain a record
such that
• the record contains the register values for variables for which
these are available and
• the record satisfies the edit constraints.
~
x  arg minx D(x, x 0 )
s.t. A~
x0
Results: different distances have been tested and applied; presence
of “hard” and “soft” constraints
ESSnet workshop
Roma 4 December 2012
8
Consistency at the macro level
Example: produce a set of hypercubes for the census, estimating
each hypercube from a different data source (or combination of
data sources), and imposing that each joint distribution available
from two hypercubes is the same (Dutch virtual census).
Available method: consistent repeated weighting, anyway this
method harmonizes one table at a time. It becomes harder the
larger the number of tables to reconcile (i.e. the larger the number
of contraints to impose)
Solution: the consistency problem is reduced to an optimization
problem:
1. Compute each hypercube
2. Look for the nearest hypercubes to the observed ones, under
constraints on the equality of the joint distribution of the variables
in common between each pair of hypercubes
ESSnet workshop
Roma 4 December 2012
9
Statistical matching
Example: produce an estimate of the joint distribution of a pair of
variables observed in distinct data sets, with no units in common
(e.g. expensitures in a survey and income in another)
Available method: statistical matching – the idea is to avoid creating
a fictitious syntetic data set with joint observations of the variables
at the unit level, but to estimate the distribution from the available
data sets. The absence of joint information produces uncertainty
(e.g. Fréchet bounds)
Objective: compare some of the alternatives available in the literature
(file concatenation, calibration)
Results: file concatenation seems attractive, although sometimes
difficult to apply when dealing with samples drawn according to
complex survey designs
ESSnet workshop
Roma 4 December 2012
10
Record linkage
Example: a data set on enterprises with
information on the enterprise
economic situation and patents
Available methods: record linkage
Results: We tested a new Bayesian
approach (Tancredi and Liseo, Annals
of applied statistics, 2011). A
comparison on real data shows that
results are similar to the ones obtained
by Fellegi and Sunter.
The main advantage is in the possibility to
estimate directly parameters related to
variables observed distinctly in the two
sources, without the creation of a
linked data file
ESSnet workshop
Roma 4 December 2012
11
Case study 1
Objective:
1. constructing a register combining different sources on the topic
“employment”.
2. quality evaluation of the register-based employment statistics of
Norway by comparing with LFS at small area level. The
comparison has been done not using the traditional approach of
considering unit-misclassification but by comparing at table level.
Assumptions and tools: no bias in LFS and no variance in REG. For
estimating MSE of the register-based statistics, a multilevel model
was used. By using the best linear unbiased predictor (EBLUP)
estimators from the multilevel model, we are able to compare the
MSEs, and find that the register-based method mostly outperforms
LFS at municipality level.
ESSnet workshop
Roma 4 December 2012
12
Case study 2
Objective:
• construct the educational attainment variable with the use of
register data (7), instead of using only a sample survey (eg LFS).
Description of the case study:
1. Description of the sources to integrate
2. Micro integration (attention is given to representativity of the files,
date of reference and consistency)
3. Estimates (estimator combines the part available from registers
and the one from LFS – see picture). Kuijvenhoven and Scholtus
(2011) show the conditions under which the combined estimator
has a lower mean square error (MSE) than a direct sample-based
estimator.
4. Measurement of accuracy
ESSnet workshop
Roma 4 December 2012
13
Case study 2
ESSnet workshop
Roma 4 December 2012
14
Other case studies and methodological developments
Case studies
• Integration of small and medium enterprises with fiscal statements
• First Steps in Profiling Italian Patenting Enterprises
Methodological developments
• editing errors in the relations between units when linking economic
data sets to a population frame
• handling incompleteness after linkage to a population frame:
incoherence in unit types, variables and periods
• bootstrapping combined estimators based on register and survey
data
ESSnet workshop
Roma 4 December 2012
15
Software issues
Record linkage
Relais, a software developed at Istat, has been improved with some
pre-processing techniques
Statistical matching
StatMatch, an R package, has been improved with tools on assessing
uncertainty, and with a vignette
ESSnet workshop
Roma 4 December 2012
16
Workshop and course
Some dissemination tools:
Three on-the-job training courses (Poland, UK, Latvia). Data sets
used for these on-the-job training courses are anonymous and can
be used for training and testing purposes.
One course on statistical matching, record linkage and micro
integration (people from 13 EU, 2 candidate countries, plus
Eurostat and ECB)
One workshop: Madrid 25-26 November 2011
http://www.ine.es/e/essnetdi_ws2011.html
5 invited speakers (W. E. Winkler, B. Liseo, P.L. Conti, M. Lenk, E.
Golata), 15 contributed papers from Europe, Australia and USA.
http://www.essnet-portal.eu/di/wp5-dissemination-towards-ess/finalworkshop-madrid-november-2011
ESSnet workshop
Roma 4 December 2012
17
Project members
Cristina Casciano, Nicoletta Cibella, Paolo Consolini, Marco Di Zio,
Marcello D’Orazio, Marco Fortini, Daniela Ichim, Filippo Oropallo,
Laura Peci, Francesca Romana Pogelli, Mauro Scanu, Monica
Scannapieco, Giovanni Seri, Tiziana Tuoto, Luca Valentino, Jeroen
Pannekoek, Arnout van Delden, Bart Bakker, Paul Knottnerus,
Léander Kuijvenhoven, Frank Linder, Nino Mushkudiani, Dominique van
Roon, Eric Schulte Nordholt, Jean-Pierre Renfer, Daniel Kilchmann,
Marcin Szymkowiak, Adam Ambroziak, Dehnel Grażyna,
Tomasz Józefowski, Tomasz Klimanek, Jacek Kowalewski, Ewa
Kowalka, Andrzej Młodak, Artur Owczarkowski, Jan Paradysz,
Wojciech Roszka, Pietrzak Beata Rynarzewska, Magdalena
Zakrzewska, Francisco Hernandez Jimenez, Gervasio-Luís Fernández
Trasobares, Miguel Guigó Pérez, Johan Fosen, Li-Chun Zhang
ESSnet workshop
Roma 4 December 2012
18