Transcript Titel - Istat.it
Assessing Quality for Integration Based Data
M. Denk, W. Grossmann Institute for Scientific Computing
Contents
• • • • •
Introduction Data Generating Processes Data Quality for Integration Based Production Assessing Quality for Integration Based Data Conclusions
Introduction – Aspects of Quality
•
Quality is discussed from two different points of view
The Processing View • What methods can be used in production of statistics ?
Specific statistical techniques for specific statistics • Development of models of best practice or standards The Reporting View • How should Quality reports look like?
Introduction – Reporting View
• •
Numerous formats for Quality Reports
SDSS, DQAF, Fed Stats, StaCan,….
Logic of the proposals according to so called hyperdimensions
– For example ESS: • Institutional Arrangements • Core Statistical Processes • Dimensions for Statistical Output – Inside the hyperdimensions so called quality dimensions • Relevance, Accuracy, Timeliness, Accessibility,……
Introduction – Reporting View
• •
Not so much agreement about the dimensions Possible Reason: Different methods / levels of Conceptualization
– Concepts of mental entities • e.g. quality dimensions in DQAF – Concepts as meaning of general terms • e.g. quality elements in DQAF – Concepts as units of knowledge • e.g. quality indicators of DQAF – Concepts as abstracts of kinds, attributes or properties • measureable quantities like sampling error, …
Introduction – Reporting View
•
Stronger matching of the processing and the reporting view seems necessary
– Starting point can be attributes and properties of statistical processes necessary for assessing quality • •
From basic quality concepts we build higher level elements by aggregation
•
Prerequisite for definition of necessary basic quality concepts:
– Empirical analysis of different production processes
Final result is a User Oriented Quality Certificate
Data Generating Processes
•
We can distinguish two broad classes of data generating processes
– The survey based data generating process – The integration based data generating process
Data Generating Processes – Survey based
•
Most considerations about reporting quality start from the traditional survey process
– Characteristics of the traditional survey process • One well defined target population (e.g. persons) • A rather homogeneous method for data collection (e.g. questionnaire) • A more or less linear sequence of processing steps (e.g. data cleaning, data editing, data imputation, output) • Final Output is one Output File
Data Generating Processes – Integration based
•
Many Statistics do not follow such a linear production scheme
– Examples: Indices, numerous balance sheets, National Accounts, …. • • •
Common characteristic: Data are produced from many different sources Let us call such processes as Data produced in such way are called integration based data integration based processes
Data Generating Processes – Integration based
– Characteristics of integration based data processing • Population: – The underlying population may be split into segments » Example: Expenditures for education: government, private enterprises, households – Many times more than one population is involved, possibly also one population at different times » Example: calculation of indices
Data Generating Processes – Integration based
– Characteristics of integration based data processing • Data collection: – Data collection is different for different segments and populations – Many times the collected data are the output of already existing data products • Main processing activities are alignment procedures making the different sources comparable • Output may be a set of organized Data Files
Data Generating Processes – Workflow View
•
Workflow for Survey Process
Additional Data Final Tables Regi ster Sampling Collect Sampling Data Editing, Imputation, Transformation Final Micro File
Data Generating Processes – Workflow View
•
Workflow for Integration Based Process
Data Source 1 Selection, Editing, Preparation Data Source 2 Selection, Editing, Preparation Integration by Matching Inte gration 1 Imputation, Computation Data Source 3 Selection, Editing, Preparation Trans formation Integration by Merging Eding, Imputation, Transformation Output Table Final Data Files
Data Quality for Integration Based Production
•
Two important aspects of data quality
– Content quality • Are the measured “concepts” really the target “concepts” ?
– Production quality • Are the used methods sound?
•
Data Quality for Integration Based Production – Content Quality Main reasons for lack of content quality
– Slight difference in the measurements of the variables (“concepts” ) in case of reuse of already existing data – Example: » Transport of goods on Austrian rails » Transport of goods according to data from railway authorities (taking not into account that transport may use partly German rails) – Slight differences in the definition of the segments in the underlying population
• • •
Data Quality for Integration Based Production – Content Quality Conclusion: Using data already collected for other purposes gives often only proxy variables for the intended variables Question: Is this in coincidence with your mental concept of the term “Non-Sampling Error”?
Manuals of international organizations are many times rather vague with respect to such problems
•
Data Quality for Integration Based Production – Content Quality Possible Strategies for Solution
– Statistical Models for aligning the concepts – More detailed description of the concepts by using additional variables characterizing the differences as formal properties of the data – More detailed description of the underlying populations by using additional variables characterizing the differences
•
Data Quality for Integration Based Production – Processing Quality Elements of processing quality
– Quality of methods used for the different components of the integration based statistic • This implies that we do not have one method of collection, one editing, one imputation,… but many activities of that kind – Quality of methods used in the integration process • Alignment of variables in order to overcome differences in concepts • Standard activities like plausibility, editing, imputation necessary for the integration activities
Assessing Quality for Integration Based Data
•
If we know the quality of all the components used in the integration process we have to think about transmission
•
of quality in the integration steps Starting point should be an “Authentic Data System”
– All data used in the integration process – Quality information about the different data sets of the system
Assessing Quality for Integration Based Data
•
Distinguish two types of quality transmission
– Quality compilation • Methods for representing quality of the overall product – Quality calculations • Algorithms for assessing quality •
In both cases we need
– Methods for assessing quality – Models of best practice / standards
Assessing Quality for Integration Based Data – Quality Compilations
•
In some cases the best we can do is better representation of the quality dimensions of the used components
– Distribution of quality indicators – Concentration of quality indicators
Assessing Quality for Integration Based Data – Quality Compilations
– Example: Coverage for integration based data • Structure of integrated sources together with coverage information Source 1 Coverage: high Source 2 Coverage: high S
ource 3
coverage: medium Source 4: Coverage low Source 5: Coverage: very low Source 6 Coverage: high Source 7 Coverage: high
Assessing Quality for Integration Based Data – Quality Compilations
– Coverage distribution
Proportion of Coverages
80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00% very low low medium high
Assessing Quality for Integration Based Data – Quality Compilations
– Coverage concentration with respect to target concept 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 0
Cumulative coverage
0,2 0,4 0,6
Proportion of Population
0,8 1 1
•
Assessing Quality for Integration Based Data – Quality Calculations Methods will be in most cases not formulas but advanced statistical procedures for different quality dimensions
– Examples: • Measurement of accuracy using variances, standard errors or coefficient of variation – Could be done by using bootstrap (e.g. applied for indices by NSO-GB) • Simulation techniques • Sensitivity analysis (“robustness”)
Conclusions
•
Assessing quality of integration based statistics needs
– Clear separation of content based quality and processing based quality – Better documentation / representation of complex production processes, Usage of Workflow Models – Documentation of the authentic data file – Definition of best practice / standards for integration processes – Algorithms for calculation quality dimensions – Methods for representation of quality indicators