Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics.

Download Report

Transcript Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics.

Dealing with Big Data for Official Statistics:
IT Issues
Giulio Barcaroli
Stefano De Francisci
Monica Scannapieco
Donato Summa
Istat – Italian National Institute
of Statistics
Outline
1. Background: Istat Big Data strategy and
experimental projects
2. IT issues in experimental projects
3. Final remarks
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
2
Istat Big Data Strategy - 1
 Istat (The Italian National Institute of Statistics) set up a
technical Commission
with the objective to
orient investments on Big Data adoption
in statistical production processes
 Duration: from February 2013 to
February 2015
 Members coming from different
areas: Official Statistics, Academy,
Private Sector
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
3
Objective of the talk
 I will NOT deal with (just) technological issues
 I will deal instead (mainly) with IT methodological
issues
 Example:
.
MapReduce-Hadoop :
Open Source Framework
Map-Reduce Programming Model:
are OS stat methods mapreduce-able?
Current research for investigating mapreduceability of (classes of) computational problems
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
4
Istat Big Data Strategy - 2
 The Commission will release a strategy for Big Data
adoption
 Three experimental projects launched and monitored by
the Commission:
 Persons and Places
 Labour Market Estimation based on Google Trends
 ICT Usage in enterprises based on Internet as a Data
Source (IaD)
 Status: advanced implementation (first results already
available)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
5
Persons and Places
 Purpose
 Production of the origin/destination matrix of daily mobility for
purpose of work and study at the spatial granularity of
municipalities starting from phone (tracking) data
 Actors involved in the project
 Istat
 National Research Council
 University of Pisa
 Methodology
 Inference of population mobility profiles from GSM Call Data
Records (CDRs)
 Comparison with data derived from administrative sources
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
6
Labour Market Estimation
 Purpose
 Test the usage of Google Trends for forecasting and
nowcasting purposes in the Labour Force domain
 Actors involved in the project
 Istat: Central Methodology Sector and Labour Force
Survey
 Methodology
 Autoregressive model vs. Usage of Google Trends
as prediction models
 Comparison extended to macroeconomics prediction
models
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
7
ICT Usage in Enterprises
 Purpose:
 Evaluate the possibility of adopting Web scraping and text
mining techniques for estimates on the usage of ICT by
enterprises and public institutions
 Actors involved in the project:
 Istat: Survey on the ICT Usage in Enterprises
 Cineca (Consortium of Italian universities,
National Research Council and Ministry of Education and
Research)
 Methodology
 Scraping of web sites for data extraction
 Supervised classification task
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
8
Features of Experimental Projects
1: Persons & Places
2: Google Trends
• Web search
record
3: ICT Usage
• Web data
extraction
DATA SOURCE
• Mobility data
SCENARIO
(IMPACT ON THE
PRODUCTION
PROCESS)
• Deep impact:
• Considerable
source replaces
impact:
traditional
estimation
sampling and
phase
collection
KEY
TECHNOLOGIES
• Scraping
• Machine
• NoSql
learning
• Machine learning
libraries
• Google Trends
libraries
• MapReduce/
• MapReduce/
Hadoop (future)
Hadoop (future?)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
• Limited impact:
subset of data
gathered by
using IaD
9
Statistical Phases for Big Data Management
Inversion of the
two phases
Collapsed
phases
• Principal selected phases
• Inversion due to the fact that “traditional” design phase is
not anymore present for Big Data
• Collapse due to the fact that same methods can be used
for both phases
• Other phases, e.g. Dissemination, not (yet) involved in
Big Data
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
10
Collect: IT issues - 1
 Access to Big Data sources:
 Type 1: Access control mechanisms that the
Big data provider designedly set up and/or
 Type 2: Technological barriers
 Google Trends:
 Absence of APIs, preventing from the
possibility of accessing GT data by a software
program
 Not possible to foresee the usage of such a
facility in production processes
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
11
Collect: IT issues - 2
 ICT Usage: Both type 1 and 2 problems
 8.647 URLs of enterprises’ Web sites, but
only about 5.600 were actually accessed
 Type 1: Scrapers deliberately blocked, e.g.
mechanisms in place to verify human access to
sites, like CAPTCHA
 Type 2: Usage of some technologies like
Adobe Flash simply prevented from accessing
contents
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
12
Design: IT Issues - 1
 Even if a traditional survey design cannot
take place, the problem of “understanding”
the data still present
 Semantic extraction techniques
 Knowledge representation and natural
language processing
 E.g.: FRED (http://wit.istc.cnr.it/stlabtools/fred) permits to extract an ontology from
sentences in natural language
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
13
Design: IT Issues - 2
 ICT Usage:
 Human inspection refined by:
 Some NLP techniques: Tokenization,
Stemming, Stopwords removal, etc,
 Semantic enrichment by semantic
dictionaries (WordNet)
 Images: tag extraction
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
14
Process/Analyse: IT Issues - 1
 Big size, possibly solvable by Map-Reduce
algorithms
 Model absence, possibly solvable by
learning techniques
 Privacy constraints, solvable by privacypreserving techniques
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
15
Process/Analyse: IT Issues - 2
 Map-Reduce algorithms: problem of
formulating algorithms that can be
implemented according to such a
paradigm ->”mapreduce-ability”
 Recent state of the art Map-Reduce
algorithms for:
 Basic graph problems, e.g. minimum spanning trees,
triangle counting and matching
 Combinatorial optimization, e.g. maximum coverage,
densest subgraph, and k-means clustering
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
16
Process/Analyse: IT Issues - 3
 Persons and Places:
 Match mobility-related data with data stored in
Istat archives
 Record linkage problem should be solved
(future task)
 Model Absence: neither survey-based nor
“traditional” model-based approaches
directly applicable to Big Data
 Possible usage of machine learning
techniques
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
17
Process/Analyse: IT Issues - 4
 ICT Usage:
 Supervised learning methods used to learn the
YES/NO answers to the questionnaire (including,
classification trees, random forests and adaptive
boosting)
 Persons and Places:
 Unsupervised learning technique, namely SOM
(Self Organizing Map) to learn mobility profiles
 E.g. “free city users” vs. “embedded city users”
(more confidently estimated by deterministic
constraints)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
18
Process/Analyse: IT Issues - 5
 Privacy constraints could potentially be solved in
a proactive way by relying on techniques that
work on anonymous data
 Privacy-preserving data integration, e.g.
[DMKM-2004]
 Privacy-preserving data mining, e.g. [TKDE 2004]
 Persons and Places: Anonymous matching of
CDRs with Istat archives via privacy-preserving
record linkage
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
19
Concluding Remarks
 Illustration of some IT issues considered
as relevant for Big Data adoption by OS
on the basis of practical experiences
 Probably technology is not an issue but IT
methodology is an issue!!!
 Some IT issues also share some statistical
methodological aspects
 Other relevant IT issues:
 Event data management
 On-line analytics
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
20
Thank you for the attention!
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
21