Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics.
Download
Report
Transcript Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics.
Dealing with Big Data for Official Statistics:
IT Issues
Giulio Barcaroli
Stefano De Francisci
Monica Scannapieco
Donato Summa
Istat – Italian National Institute
of Statistics
Outline
1. Background: Istat Big Data strategy and
experimental projects
2. IT issues in experimental projects
3. Final remarks
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
2
Istat Big Data Strategy - 1
Istat (The Italian National Institute of Statistics) set up a
technical Commission
with the objective to
orient investments on Big Data adoption
in statistical production processes
Duration: from February 2013 to
February 2015
Members coming from different
areas: Official Statistics, Academy,
Private Sector
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
3
Objective of the talk
I will NOT deal with (just) technological issues
I will deal instead (mainly) with IT methodological
issues
Example:
.
MapReduce-Hadoop :
Open Source Framework
Map-Reduce Programming Model:
are OS stat methods mapreduce-able?
Current research for investigating mapreduceability of (classes of) computational problems
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
4
Istat Big Data Strategy - 2
The Commission will release a strategy for Big Data
adoption
Three experimental projects launched and monitored by
the Commission:
Persons and Places
Labour Market Estimation based on Google Trends
ICT Usage in enterprises based on Internet as a Data
Source (IaD)
Status: advanced implementation (first results already
available)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
5
Persons and Places
Purpose
Production of the origin/destination matrix of daily mobility for
purpose of work and study at the spatial granularity of
municipalities starting from phone (tracking) data
Actors involved in the project
Istat
National Research Council
University of Pisa
Methodology
Inference of population mobility profiles from GSM Call Data
Records (CDRs)
Comparison with data derived from administrative sources
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
6
Labour Market Estimation
Purpose
Test the usage of Google Trends for forecasting and
nowcasting purposes in the Labour Force domain
Actors involved in the project
Istat: Central Methodology Sector and Labour Force
Survey
Methodology
Autoregressive model vs. Usage of Google Trends
as prediction models
Comparison extended to macroeconomics prediction
models
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
7
ICT Usage in Enterprises
Purpose:
Evaluate the possibility of adopting Web scraping and text
mining techniques for estimates on the usage of ICT by
enterprises and public institutions
Actors involved in the project:
Istat: Survey on the ICT Usage in Enterprises
Cineca (Consortium of Italian universities,
National Research Council and Ministry of Education and
Research)
Methodology
Scraping of web sites for data extraction
Supervised classification task
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
8
Features of Experimental Projects
1: Persons & Places
2: Google Trends
• Web search
record
3: ICT Usage
• Web data
extraction
DATA SOURCE
• Mobility data
SCENARIO
(IMPACT ON THE
PRODUCTION
PROCESS)
• Deep impact:
• Considerable
source replaces
impact:
traditional
estimation
sampling and
phase
collection
KEY
TECHNOLOGIES
• Scraping
• Machine
• NoSql
learning
• Machine learning
libraries
• Google Trends
libraries
• MapReduce/
• MapReduce/
Hadoop (future)
Hadoop (future?)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
• Limited impact:
subset of data
gathered by
using IaD
9
Statistical Phases for Big Data Management
Inversion of the
two phases
Collapsed
phases
• Principal selected phases
• Inversion due to the fact that “traditional” design phase is
not anymore present for Big Data
• Collapse due to the fact that same methods can be used
for both phases
• Other phases, e.g. Dissemination, not (yet) involved in
Big Data
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
10
Collect: IT issues - 1
Access to Big Data sources:
Type 1: Access control mechanisms that the
Big data provider designedly set up and/or
Type 2: Technological barriers
Google Trends:
Absence of APIs, preventing from the
possibility of accessing GT data by a software
program
Not possible to foresee the usage of such a
facility in production processes
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
11
Collect: IT issues - 2
ICT Usage: Both type 1 and 2 problems
8.647 URLs of enterprises’ Web sites, but
only about 5.600 were actually accessed
Type 1: Scrapers deliberately blocked, e.g.
mechanisms in place to verify human access to
sites, like CAPTCHA
Type 2: Usage of some technologies like
Adobe Flash simply prevented from accessing
contents
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
12
Design: IT Issues - 1
Even if a traditional survey design cannot
take place, the problem of “understanding”
the data still present
Semantic extraction techniques
Knowledge representation and natural
language processing
E.g.: FRED (http://wit.istc.cnr.it/stlabtools/fred) permits to extract an ontology from
sentences in natural language
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
13
Design: IT Issues - 2
ICT Usage:
Human inspection refined by:
Some NLP techniques: Tokenization,
Stemming, Stopwords removal, etc,
Semantic enrichment by semantic
dictionaries (WordNet)
Images: tag extraction
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
14
Process/Analyse: IT Issues - 1
Big size, possibly solvable by Map-Reduce
algorithms
Model absence, possibly solvable by
learning techniques
Privacy constraints, solvable by privacypreserving techniques
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
15
Process/Analyse: IT Issues - 2
Map-Reduce algorithms: problem of
formulating algorithms that can be
implemented according to such a
paradigm ->”mapreduce-ability”
Recent state of the art Map-Reduce
algorithms for:
Basic graph problems, e.g. minimum spanning trees,
triangle counting and matching
Combinatorial optimization, e.g. maximum coverage,
densest subgraph, and k-means clustering
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
16
Process/Analyse: IT Issues - 3
Persons and Places:
Match mobility-related data with data stored in
Istat archives
Record linkage problem should be solved
(future task)
Model Absence: neither survey-based nor
“traditional” model-based approaches
directly applicable to Big Data
Possible usage of machine learning
techniques
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
17
Process/Analyse: IT Issues - 4
ICT Usage:
Supervised learning methods used to learn the
YES/NO answers to the questionnaire (including,
classification trees, random forests and adaptive
boosting)
Persons and Places:
Unsupervised learning technique, namely SOM
(Self Organizing Map) to learn mobility profiles
E.g. “free city users” vs. “embedded city users”
(more confidently estimated by deterministic
constraints)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
18
Process/Analyse: IT Issues - 5
Privacy constraints could potentially be solved in
a proactive way by relying on techniques that
work on anonymous data
Privacy-preserving data integration, e.g.
[DMKM-2004]
Privacy-preserving data mining, e.g. [TKDE 2004]
Persons and Places: Anonymous matching of
CDRs with Istat archives via privacy-preserving
record linkage
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
19
Concluding Remarks
Illustration of some IT issues considered
as relevant for Big Data adoption by OS
on the basis of practical experiences
Probably technology is not an issue but IT
methodology is an issue!!!
Some IT issues also share some statistical
methodological aspects
Other relevant IT issues:
Event data management
On-line analytics
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
20
Thank you for the attention!
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
21