Transcript Slide 1

Historical Data Integration based on
Collective Intelligence
Vladimir Zadorozhny
Graduate Information Science and Technology Program
School of Information Sciences
University of Pittsburgh
NADM Group
V. Zadorozhny
WHD Colloquium, March 27, 2012
1
Challenge
Consolidated
Structured
Information
WHD Data
Integration
Infrastructure
Diverse ,
Heterogeneous,
Semi-structured
Data Sources
V. Zadorozhny
2
Web of Data?
• Linked Data: using the Web to create typed links between data
from different sources
• Linked Data uses RDF (Resource Description Framework) to make
typed statements (triples)
• Expected result: Web of Data extending the Web with a global
data space connecting diverse domains (people, companies,
publications , etc.)
• In general, Web of Data has a potential (still questionable) to
support loose data coupling that may facilitate more efficient data
utilization
 While WHD can utilize LD and related Web mashup
technologies to some extent, it would be premature to rely
upon the Linked Data infrastructure
V. Zadorozhny
WHD Colloquium, March 27, 2012
3
Dataverse Network?
• An open source application to publish, share, reference, extract and
analyze research data that facilitates making data available to others
• "Dataverse owners can upload any file type and format (excel, txt,pdf,
doc, etc.), and the files will be stored and made available in the
original format“ (http://thedata.org/files/dataversehandout.pdf)
• Information consumers should further integrate data sources to
perform analysis using multiple "dataverses".
 While WHD aims to be a part of the Dataverse Network, it
would not encourage users to contribute data in ANY format.
Instead, users integrate their data into the WHD repository while
submitting the data.
 To summarize, WHD infrastructure crowdsourses the data
integration task, not just data contribution task.
V. Zadorozhny
WHD Colloquium, March 27, 2012
4
General WHD Architecture
Information Consumers
Data Submission
System
Wrapper
Generation
…
Internal
Data
Reliability
Assessment
Annotated
historical data
Data
Fusion
Fused
historical data
Wrapper
Wrapper
Registration
Structured
homogeneous
historical data
Wrapper
External
Data Reliability
Assessment
Heterogeneous
historical data sources
Information Providers
V. Zadorozhny
WHD Colloquium, March 27, 2012
5
select * from Population
Simple Scenario
WHD Infrastructure
Extendable Target Schema (relational is not mandatory):
Source | Location
| From
| To
| Population |
Keep Data
Remotely
Materialize
Data
|Mauritania
|Mauritania
| Senegal
| Senegal
| 01/01/1950
12/31/1950| 692,000
|
Territories ->| Location
| 01/01/1960
| 12/31/1960|
|
Population ->
Population 892,000
| 01/01/1950
| 12/31/1950|
Data Aggregation
-> Total 2,543,000 |
| 01/01/1960
|12/31/1960
| 3,277,000 |
Data Source: s1 (xl) Year -> From,To
Wrapper
s1
s1
s1
s1
Source|Location
s2
| Liberia
s2
|Liberia
s2
|Ivory Coals
s2
|Ivory Coast
| From Mapping:
|To
| Population|
region ->| 12/31/1950|
Location
| 01/01/1950
824000
|
Population
-> Population
| 01/01/1960
| 12/31/1960|
1,052,000 |
Data Aggregation
-> Total
| 01/01/1950
| 12/31/1950|
2,505, 000 |
Year
->
From,To
| 01/01/1950 | 12/31/1950| 3,692,000 |
Wrapper
Mapping:
Data Source: s2 (doc)
According to the 2006 revision of the World Population Prospects
the total population in the region of Liberia in 1950 was 824,000.
The average population growth percent per year for the following
ten years was 2.5. For Ivory Coast those numbers are 2,505,000
and 3.6 correspondingly
Big Picture: continuously growing
infrastructure (a la Wikipedia)
WHD Infrastructure
Data Utilization
Data Curation
Data Collection
V. Zadorozhny
WHD Colloquium, March 27, 2012
7
WHD Prototype
• Group of graduate IS students: special project in Advanced Data
Management class (INFSCI2711)
• Content Management → Pligg ( Open Source Content Management
System, Apache, PHP, and MySQL based)
• Data Integration Engine → Pentaho Kettle (Open Source Data
Integration Engine, Java-based GUI and Command Line Tools, XML
based data transformation file)
• Data providers



download Wrapper Generating Software
configure wrappers on their workstation ( using
preconfigured templates)
register wrappers on WHD Server
V. Zadorozhny
WHD Colloquium, March 27, 2012
8
Data
Source
Data
Transformation
Transformed
Data
XML
Wrapper
10
V. Zadorozhny
WHD Colloquium, March 27, 2012
11
Data Reliability Assessment and
Data Fusion
• The systems based on crowdsourcing require mechanisms to
ensure data quality.
• WHD Infrastructure will support efficient data curation strategies
based on advanced data reliability assessment and data fusion
methods.
• As system continuously receives new historical reports, WHD
estimates reliability of this data, which evolves with respect to new
evidence.
• WHS uses a measure of inconsistency caused by a report to assess
its internal reliability.
• WHD also allows users to submit their subjective feedback on
reliability of data to assess external reliability.
•WHD utilizes subjective logic to combine internal and external
12
reliability assessment
Historical Data: Redundancy
Temporal Overlaps
t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700
t2 | source_ref2 | Measles | NYC |10/20/1910 | 10/30/1930 | 300
700
Measles reports:
300
1900
1910
1920
1930
Total number of Measles cases in New York City from 1900 to 1930:
700+300 = 1000 ??? Temporal overlap between t1 and t2
Spatial Overlaps
t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500
t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600
Smallpox reports:
500 (NY)
1900
1910
1920
Total number of Smallpox cases in New York State from 1900 to 1930:
500+600 = 1100 ??? Spatial overlap between t3 and t4
600 (NYC)
1930
Naming Overlaps
t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700
t6 | source_ref2 | Hepatitis
| NY|10/10/1900 | 10/10/1920 | 700
t7 | source_ref4 | Hepatitis B | NY| 10/20/1910 | 10/30/1930 | 300
Total number of Hepatitis cases in New York State from 1920 to 1930:
700+700+300 =1700 ??? Naming overlap between t5, t6 and t7
13
Historical Data: Inconsistency
Redundant and Inconsistent :
200
Measles reports in NYC:
400
……….
R1: 700
R2: 500
300
time
V. Zadorozhny
WHD Colloquium, March 27, 2012
14
Information Consumer Toolset:
Data Visualization Dashboard
ICTS: Map Exhibits and Timeline Widgets
ICTS: Motion Chart Animation
CV
CV
CV
Conclusion
• We explore a novel approach to reliable, large-scale historical data
integration based on collective intelligence
• We implement this approach in WHD infrastructure for
consolidation heterogeneous historical data
• Major challenge: how to engage a large community of researchers
to share their data and collectively resolve the data
heterogeneities in a continuously growing large-scale distributed
historical repository?
– contributions from CHAI members (only a small fraction of Wikipedia
users contributes information to ensure its growth)
– as the infrastructure evolves users may become interested in
“embedding” their data in a larger context to perform global analysis
and to utilize WHD tools
– open development platform (extendable data transformation library and
toolsets)
V. Zadorozhny
WHD Colloquium, March 27, 2012
18
Acknowledgements
Doctoral Students:
Ying-Feng Hsu
Julian Lee
V. Zadorozhny
Graduate IS Students (WHD system development team):
Andrew Barnett (team leader)
Andrew Entin
Thomas Junker
Jidapa Kraisangka
Han Liao
Eric Miller
Ye Peng
Evan Pulgino
Henry Quattrone
Mark Swartz
Miao Tan
Liu Yuchen
Lihong Zhang
WHD Colloquium, March 27, 2012
19