Transcript Slide 1
Historical Data Integration based on Collective Intelligence Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh NADM Group V. Zadorozhny WHD Colloquium, March 27, 2012 1 Challenge Consolidated Structured Information WHD Data Integration Infrastructure Diverse , Heterogeneous, Semi-structured Data Sources V. Zadorozhny 2 Web of Data? • Linked Data: using the Web to create typed links between data from different sources • Linked Data uses RDF (Resource Description Framework) to make typed statements (triples) • Expected result: Web of Data extending the Web with a global data space connecting diverse domains (people, companies, publications , etc.) • In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure V. Zadorozhny WHD Colloquium, March 27, 2012 3 Dataverse Network? • An open source application to publish, share, reference, extract and analyze research data that facilitates making data available to others • "Dataverse owners can upload any file type and format (excel, txt,pdf, doc, etc.), and the files will be stored and made available in the original format“ (http://thedata.org/files/dataversehandout.pdf) • Information consumers should further integrate data sources to perform analysis using multiple "dataverses". While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data. To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task. V. Zadorozhny WHD Colloquium, March 27, 2012 4 General WHD Architecture Information Consumers Data Submission System Wrapper Generation … Internal Data Reliability Assessment Annotated historical data Data Fusion Fused historical data Wrapper Wrapper Registration Structured homogeneous historical data Wrapper External Data Reliability Assessment Heterogeneous historical data sources Information Providers V. Zadorozhny WHD Colloquium, March 27, 2012 5 select * from Population Simple Scenario WHD Infrastructure Extendable Target Schema (relational is not mandatory): Source | Location | From | To | Population | Keep Data Remotely Materialize Data |Mauritania |Mauritania | Senegal | Senegal | 01/01/1950 12/31/1950| 692,000 | Territories ->| Location | 01/01/1960 | 12/31/1960| | Population -> Population 892,000 | 01/01/1950 | 12/31/1950| Data Aggregation -> Total 2,543,000 | | 01/01/1960 |12/31/1960 | 3,277,000 | Data Source: s1 (xl) Year -> From,To Wrapper s1 s1 s1 s1 Source|Location s2 | Liberia s2 |Liberia s2 |Ivory Coals s2 |Ivory Coast | From Mapping: |To | Population| region ->| 12/31/1950| Location | 01/01/1950 824000 | Population -> Population | 01/01/1960 | 12/31/1960| 1,052,000 | Data Aggregation -> Total | 01/01/1950 | 12/31/1950| 2,505, 000 | Year -> From,To | 01/01/1950 | 12/31/1950| 3,692,000 | Wrapper Mapping: Data Source: s2 (doc) According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly Big Picture: continuously growing infrastructure (a la Wikipedia) WHD Infrastructure Data Utilization Data Curation Data Collection V. Zadorozhny WHD Colloquium, March 27, 2012 7 WHD Prototype • Group of graduate IS students: special project in Advanced Data Management class (INFSCI2711) • Content Management → Pligg ( Open Source Content Management System, Apache, PHP, and MySQL based) • Data Integration Engine → Pentaho Kettle (Open Source Data Integration Engine, Java-based GUI and Command Line Tools, XML based data transformation file) • Data providers download Wrapper Generating Software configure wrappers on their workstation ( using preconfigured templates) register wrappers on WHD Server V. Zadorozhny WHD Colloquium, March 27, 2012 8 Data Source Data Transformation Transformed Data XML Wrapper 10 V. Zadorozhny WHD Colloquium, March 27, 2012 11 Data Reliability Assessment and Data Fusion • The systems based on crowdsourcing require mechanisms to ensure data quality. • WHD Infrastructure will support efficient data curation strategies based on advanced data reliability assessment and data fusion methods. • As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence. • WHS uses a measure of inconsistency caused by a report to assess its internal reliability. • WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability. •WHD utilizes subjective logic to combine internal and external 12 reliability assessment Historical Data: Redundancy Temporal Overlaps t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700 t2 | source_ref2 | Measles | NYC |10/20/1910 | 10/30/1930 | 300 700 Measles reports: 300 1900 1910 1920 1930 Total number of Measles cases in New York City from 1900 to 1930: 700+300 = 1000 ??? Temporal overlap between t1 and t2 Spatial Overlaps t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500 t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600 Smallpox reports: 500 (NY) 1900 1910 1920 Total number of Smallpox cases in New York State from 1900 to 1930: 500+600 = 1100 ??? Spatial overlap between t3 and t4 600 (NYC) 1930 Naming Overlaps t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700 t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700 t7 | source_ref4 | Hepatitis B | NY| 10/20/1910 | 10/30/1930 | 300 Total number of Hepatitis cases in New York State from 1920 to 1930: 700+700+300 =1700 ??? Naming overlap between t5, t6 and t7 13 Historical Data: Inconsistency Redundant and Inconsistent : 200 Measles reports in NYC: 400 ………. R1: 700 R2: 500 300 time V. Zadorozhny WHD Colloquium, March 27, 2012 14 Information Consumer Toolset: Data Visualization Dashboard ICTS: Map Exhibits and Timeline Widgets ICTS: Motion Chart Animation CV CV CV Conclusion • We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence • We implement this approach in WHD infrastructure for consolidation heterogeneous historical data • Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository? – contributions from CHAI members (only a small fraction of Wikipedia users contributes information to ensure its growth) – as the infrastructure evolves users may become interested in “embedding” their data in a larger context to perform global analysis and to utilize WHD tools – open development platform (extendable data transformation library and toolsets) V. Zadorozhny WHD Colloquium, March 27, 2012 18 Acknowledgements Doctoral Students: Ying-Feng Hsu Julian Lee V. Zadorozhny Graduate IS Students (WHD system development team): Andrew Barnett (team leader) Andrew Entin Thomas Junker Jidapa Kraisangka Han Liao Eric Miller Ye Peng Evan Pulgino Henry Quattrone Mark Swartz Miao Tan Liu Yuchen Lihong Zhang WHD Colloquium, March 27, 2012 19