Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics.
Download ReportTranscript Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics.
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics Outline 1. Background: Istat Big Data strategy and experimental projects 2. IT issues in experimental projects 3. Final remarks Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 2 Istat Big Data Strategy - 1 Istat (The Italian National Institute of Statistics) set up a technical Commission with the objective to orient investments on Big Data adoption in statistical production processes Duration: from February 2013 to February 2015 Members coming from different areas: Official Statistics, Academy, Private Sector Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 3 Objective of the talk I will NOT deal with (just) technological issues I will deal instead (mainly) with IT methodological issues Example: . MapReduce-Hadoop : Open Source Framework Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduceability of (classes of) computational problems Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 4 Istat Big Data Strategy - 2 The Commission will release a strategy for Big Data adoption Three experimental projects launched and monitored by the Commission: Persons and Places Labour Market Estimation based on Google Trends ICT Usage in enterprises based on Internet as a Data Source (IaD) Status: advanced implementation (first results already available) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 5 Persons and Places Purpose Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data Actors involved in the project Istat National Research Council University of Pisa Methodology Inference of population mobility profiles from GSM Call Data Records (CDRs) Comparison with data derived from administrative sources Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 6 Labour Market Estimation Purpose Test the usage of Google Trends for forecasting and nowcasting purposes in the Labour Force domain Actors involved in the project Istat: Central Methodology Sector and Labour Force Survey Methodology Autoregressive model vs. Usage of Google Trends as prediction models Comparison extended to macroeconomics prediction models Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 7 ICT Usage in Enterprises Purpose: Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) Methodology Scraping of web sites for data extraction Supervised classification task Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 8 Features of Experimental Projects 1: Persons & Places 2: Google Trends • Web search record 3: ICT Usage • Web data extraction DATA SOURCE • Mobility data SCENARIO (IMPACT ON THE PRODUCTION PROCESS) • Deep impact: • Considerable source replaces impact: traditional estimation sampling and phase collection KEY TECHNOLOGIES • Scraping • Machine • NoSql learning • Machine learning libraries • Google Trends libraries • MapReduce/ • MapReduce/ Hadoop (future) Hadoop (future?) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 • Limited impact: subset of data gathered by using IaD 9 Statistical Phases for Big Data Management Inversion of the two phases Collapsed phases • Principal selected phases • Inversion due to the fact that “traditional” design phase is not anymore present for Big Data • Collapse due to the fact that same methods can be used for both phases • Other phases, e.g. Dissemination, not (yet) involved in Big Data Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 10 Collect: IT issues - 1 Access to Big Data sources: Type 1: Access control mechanisms that the Big data provider designedly set up and/or Type 2: Technological barriers Google Trends: Absence of APIs, preventing from the possibility of accessing GT data by a software program Not possible to foresee the usage of such a facility in production processes Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 11 Collect: IT issues - 2 ICT Usage: Both type 1 and 2 problems 8.647 URLs of enterprises’ Web sites, but only about 5.600 were actually accessed Type 1: Scrapers deliberately blocked, e.g. mechanisms in place to verify human access to sites, like CAPTCHA Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 12 Design: IT Issues - 1 Even if a traditional survey design cannot take place, the problem of “understanding” the data still present Semantic extraction techniques Knowledge representation and natural language processing E.g.: FRED (http://wit.istc.cnr.it/stlabtools/fred) permits to extract an ontology from sentences in natural language Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 13 Design: IT Issues - 2 ICT Usage: Human inspection refined by: Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc, Semantic enrichment by semantic dictionaries (WordNet) Images: tag extraction Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 14 Process/Analyse: IT Issues - 1 Big size, possibly solvable by Map-Reduce algorithms Model absence, possibly solvable by learning techniques Privacy constraints, solvable by privacypreserving techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 15 Process/Analyse: IT Issues - 2 Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability” Recent state of the art Map-Reduce algorithms for: Basic graph problems, e.g. minimum spanning trees, triangle counting and matching Combinatorial optimization, e.g. maximum coverage, densest subgraph, and k-means clustering Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 16 Process/Analyse: IT Issues - 3 Persons and Places: Match mobility-related data with data stored in Istat archives Record linkage problem should be solved (future task) Model Absence: neither survey-based nor “traditional” model-based approaches directly applicable to Big Data Possible usage of machine learning techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 17 Process/Analyse: IT Issues - 4 ICT Usage: Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting) Persons and Places: Unsupervised learning technique, namely SOM (Self Organizing Map) to learn mobility profiles E.g. “free city users” vs. “embedded city users” (more confidently estimated by deterministic constraints) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 18 Process/Analyse: IT Issues - 5 Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data Privacy-preserving data integration, e.g. [DMKM-2004] Privacy-preserving data mining, e.g. [TKDE 2004] Persons and Places: Anonymous matching of CDRs with Istat archives via privacy-preserving record linkage Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 19 Concluding Remarks Illustration of some IT issues considered as relevant for Big Data adoption by OS on the basis of practical experiences Probably technology is not an issue but IT methodology is an issue!!! Some IT issues also share some statistical methodological aspects Other relevant IT issues: Event data management On-line analytics Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 20 Thank you for the attention! Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 21