Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003 Entity Mediated Schema Phenotype Gene Sequenceable Entity Protein OMIM Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics Structured Vocabulary LocusLink GO Entrez GEO Query: For the micro-array experiment I just ran, what are.

Download Report

Transcript Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003 Entity Mediated Schema Phenotype Gene Sequenceable Entity Protein OMIM Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics Structured Vocabulary LocusLink GO Entrez GEO Query: For the micro-array experiment I just ran, what are.

Information Integration:
A Status Report
Alon Halevy
University of Washington, Seattle
IJCAI 2003
Entity
Mediated Schema
Phenotype
Gene
Sequenceable
Entity
Protein
OMIM
Experiment
Nucleotide
Sequence
Microarray
Experiment
SwissProt
HUGO
GeneClinics
Structured
Vocabulary
LocusLink
GO
Entrez
GEO
Query: For the micro-array experiment I just ran, what are the
related nucleotide sequences and for what protein do they code?
Motivation and Activity
Application areas of data integration:




Enterprise information integration ($$)
The government
Data sources on the web
Scientific data sharing.
Several data sharing architectures:

Virtual data integration, warehousing, messagepassing, web-services.
Many research projects:

Mine: Information Manifold, Tukwila, LSD, Piazza.
EII: a new industry buzzword.
Today’s Agenda
Recent progress



Mediation languages
Query processing (XML and other)
Some lessons from commercial world.
Current challenges



Enabling large-scale data sharing: peer-data
management systems.
The age of problem: semantic heterogeneity.
A new agenda item for AI: corpus-based KR.

AI is more vital than ever for progress here!
Mediation Languages
Q
Goal:
Language for
Specifying
Semantic
Relationships
(not full FOL)
Q’
Source
Mediated Schema
Q’
Source
Q’
Source
Q’
Source
Assume: data at the sources is structure (or seems so).
Q’
Source
Global-as-View (GAV)
Actor(x,y) :- R1(x,y,z)
Actor(x,y) :- R2(x,z), R3(z,y)
Mediated Schema
Title, Actor, …
Source
R1
Source
R2
Source
R3
Source
R4
Source
R5
Local-as-View (LAV,GLAV)
R1(x,y,z) :- Title(x,y), Actor(x,z), y< 1970
R5(x,y,z) :- Movie(x,y,”French”)
Mediated Schema
Title, Actor …
Source
R1
Source
R2
Source
R3
Source
R4
Source
R5
Mediation Languages:
Summary
A lot of nice theory and practical algorithms.
Careful choice of expressive power mattered.
Algorithms for answering queries using views
are in every commercial DBMS.
Description Logics – also an attractive
formalism for mediation.
Bottleneck is coming up with the mapping
expressions.
Outline
Recent progress



Mediation languages
Query processing (XML and other)
Some lessons from commercial world.
Current challenges



Enabling large-scale data sharing: peer-data
management systems.
The age old problem: semantic heterogeneity.
A new agenda item for AI: corpus-based KR.
Adaptive Query Processing
Problem: no stats, network unstable
Cannot ‘Plan and then execute’
Need to adapt plan during execution.
Ideas already in


Ingres (1976) (early database system)
Interleaving planning and execution (AI)
Key question: when and granularity of
adaptation:


For every tuple? Materialization points?
See [Ives et al. 2002] for our solution.
Convergent Query Processing
[Ives et al., 2002]
Join In-stock, Orders, Shipping

(I  O  S)
I2 O2S2
I0OS
O0S0
I1 O1S1
“Cleanup”
query plan
I2S2
I0 O0
IO
II0
0
SS
O
O0
O1S1
I1
O1
S1
O2
I2
S2
XML Query Processing
XML facilitates integration.

Mediator query processor may manipulate XML
directly.
Challenges:


XML is not flat, but nested; Path queries.
Can be irregular; doesn’t adhere to a strict
schema.
Progress:


Defining and optimizing XQuery.
Going back and forth: XML to relational.
The Commercial World
Some startups:

Nimble, MetaMatrix, Calixa, Composite, Enosys
Big guys making announcements:


IBM, BEA, MS, (Oracle still being defiant).
Integration technology in different layers:

E.g., reporting companies want it (Actuate)
Progress: analysts have buzzword -- EII.
Challenges:



Integration with EAI?
Yet another middleware?
Horizontal vs. vertical?
What Worked?
Performance was not an issue.
Tools, tools, tools

For managing sources and creating
mediated schemas.
XML query processing was needed.
Concordance: need common keys to
join sources:

Active research area!
Outline
 Recent progress
 Mediation languages
 Query processing (XML and other)
 Some lessons from commercial world.
Current challenges



Enabling large-scale data sharing: peer-data
management systems.
The age old problem: semantic heterogeneity.
A new agenda item for AI: corpus-based KR.
Limitations of Mediated
Schema
Q
Mediated Schema
Q’
Source
Q’
Source
Q’
Source
Q’
Source
Q’
Source
Peer Data-Management
PDMS: a network of peers (data sources)
Peers can:
Export base data, or combinations of data
 Serve as logical mediators for other peers

A peer can be both a server and a client.
Semantic relationships are specified locally
(between small sets of peers).
This is a Semantic Web (different angle)
Network of Mappings (Piazza)
Q’’
CiteSeer
Q’
UW
Stanford
Q’’
Q’’
GAV, LAV
GLAV
DBLP
Q
Roma
Vienna
Q’
Paris
Q’’
Advantages of PDMS
No need for a central mediated schema.
Can map data opportunistically, as is most
convenient.
Queries are posed using the peer’s schema.
Answers come from anywhere in the system.
Infrastructure for Semantic Web applications
This is not P2P file sharing.


Data has rich semantics
Membership is not as dynamic.
Schema Mediation for PDMS
When can LAV and
GAV be combined to
form such a network
structure?
Q’
(semantics not yet
obvious.
[ICDE-03],
[WWW-03 for XML]
Q’’
CiteSeer
UW
Stanford
Q’’
Q’’
GAV, LAV
GLAV
DBLP
Q
Roma
Vienna
Q’
Paris
Q’’
Efficient Query Answering
Problems:
• redundant paths
• expensive
reformulation.
Q’
Q’’
CiteSeer
UW
Stanford
Q’’
Possible solution:
• Pre-compose
some paths
Q’’
DBLP
Q
Roma
Vienna
Q’
Paris
Q’’
Mapping Composition
[Jayant Madhavan and Halevy, VLDB 2003]
Incredibly subtle!
In general, composition can be an
infinite set of GLAV formulas.
Results:
Finite in many cases
 Even when infinite, often has finite, useful
encoding.
 Hence, compositions can usually be preoptimized.

Other Research Issues
Intelligent data
placement
Management of
mapping networks
Q’’
CiteSeer
Q’
UW
Stanford
Q’’
Improving networks:
finding additional
connections.
Q’’
DBLP
Handling
inconsistencies
Q
Saarbruecken
Berlin
Q’
Leipzig
Q’’
PDMS-Related Projects
Hyperion (Toronto)
PeerDB (Singapore)
Local relational models (Trento)
Edutella (Hannover, Germany)
Semantic Gossiping (EPFL Zurich)
Raccoon (UC Irvine)
Orchestra (Ives, U. Penn)
Outline
 Recent progress
 Mediation languages
 Query processing (XML and other)
 Some lessons from commercial world.
Current challenges



Enabling large-scale data sharing: peer-data
management systems.
The age old problem: semantic heterogeneity.
A new agenda item for AI: corpus-based KR.
Schema/Ontology Matching
Hotel, Restaurant,
AdventureSports,
HistoricalSites
Data Source
Consumer
Hotel, Gaststätte
Brauerei, Kathedrale
Mediator
Data Source
Data Source
Lodges, Restaurants
Beaches, Volcanoes
Schema heterogeneity: a key roadblock for
information integration


Different data sources speak their own schema
Mapping is key to any data sharing architecture
Schema Matching
Books
BooksAndMusic
Title
Author
Publisher
ItemID
ItemType
SuggestedPrice
Categories
Keywords
Inventory
Database A
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
BookCategories
ISBN
Category
CDCategories
ASIN
Category
CDs
Album
ASIN
Price
DiscountPrice
Studio
Artists
ASIN
ArtistName
GroupName
Inventory Database B
Schema Matching: Discovering correspondences between similar
elements
Eventually… BooksAndMusic(x:Title,…) = Books(x:Title,…) 
CDs(x:Album,…)
Typical Approaches
Multiple sources of evidences in the schemas

Schema element names


Descriptions and documentation




DateTime  Integer,
addresses have similar formats
Schema structure


ItemID: unique identifier for a book or a CD
ISBN: unique identifier for any book
Data types, data instances


BooksAndCDs/Categories ~ BookCategories/Category
All books have similar attributes
In isolation,
techniques are
incomplete or
brittle
Use domain knowledge
Combine multiple techniques to exploit all available evidence
Philosophy of Solutions
Effective schema matching requires a
principled combination of techniques.
Like human experts, the matcher should
improve over time
LSD:



Mapping data sources to a mediated schema.
Use a few mappings as training examples to learn
hypotheses for elements of the mediated schema.
See [Doan et al., SIGMOD-2001, MLJ-2003]
Next step: corpus-based matching.
Corpus-Based Matching
Music
Books
Authors
Authors
Items
Collection of schemas
and mappings
Artists
Information
Publisher
Litreture
CDs
Categories
Artists
Corpus of Books and Inventory Schemas
Identify common concepts
and patterns
Books, Authors, Publishers, …
Books Title, Author, Price, Publisher
Reuse extracted information
to match new schemas
Mapping Knowledge Base
Learners: extract
knowledge from schemas
and mappings
Learned models: for each
unique element in any
schema.
Name Learner
Data Type
Learner
C1
Data Instances
Learner
Structure
Learner
Description
Learner
Meta Learner
CN
NL:… DIL:…
DTL:… DL:…
SL:…
ML:…
NL:… DIL:…
DTL:… DL:…
SL:…
ML:…
Schemas and mappings:
accumulated over time
Mapping Knowledge Base
Preliminary results:
Corpus is useful
Shipping Domain
15
Avg Number of Matches
Only MKB
Only BASIC
10
5
0
-5
P1a
P1b
P2a
P2b
P3a
-10
-15
Schema Pairs
P3b
P4a
P4b
With and without the corpus
Inventory Domain
1
MKB
BASIC
COMB
0.8
Recall
0.6
0.4
0.2
0
P1a
P1b
P2a
P2b
P3a
Schema Pairs
P3b
P4a
P4b
Outline
 Recent progress
 Mediation languages
 Query processing (XML and other)
 Some lessons from commercial world.
Current challenges



Enabling large-scale data sharing: peer-data
management systems.
The age old problem: semantic heterogeneity.
A new agenda item for AI: corpus-based KR.
Corpus vs. Traditional KR
A large corpus of uncoordinated
knowledge fragments
vs.
Carefully designed knowledge base
Can a corpus offer a more attractive
solution for some KR problems?
Pause: KR vs. Corpus
Knowledge base:
Hard to engineer, brittle at the boundaries
 Only one way of saying things.

Corpus:
“Easier” to build, coverage not predefined.
 Many views of the domain.

See proceedings for full argument.
Corpus-based KR
Contents:

Schemas, ontologies, meta-data, data,
queries, mappings.
Collect statistics on the corpus:
How often does a word appear as a
relation name?
 When it does, what tend to be the attribute
names? What other tables are there?

Support a KR-style interface on the
corpus (OKBC-like)
Other Applications of C-B-KR
Question answering on the web
Focused crawling
Natural language interfaces to DB’s
Schema and ontology authoring
Semantic query optimization.
Whenever we need knowledge to help
us rank multiple answers/plans.
Example Queries
How are two terms related?
GPA(studentID, $value),
 Student(studentID, GPA, address)

Find different ways of saying the same:
Class(Lexus, Luxury)
 LuxuryCar(Lexus, Toyota)

When do two terms play similar roles?
IJCAIReview(p1, rev2, accept)
 AIJReferees(round2, p3, rev4, reject)

Challenges for C-B-KR
Building the corpus.
How focused should the corpus be?
Is human tuning needed or helpful?
How do we accommodate inference?
How do we leverage traditional KR?
Summary
The vision: data authoring, querying and
sharing by everyone.

We got the plumbing to work. To go further, we
need AI techniques.
Challenge: cross the structure chasm:



It’s hard to author & query structured data!
PDMS: architecture for ad-hoc sharing.
Ontology/schema matching is key!
Are we providing the right tools?

Corpus-based knowledge representation.
We need benchmarks!
Some References
www.cs.washington.edu/homes/alon
Piazza: ICDE03, WWW03, VLDB-03
The Structure Chasm: CIDR-03
Mediation surveys: VLDB Journal 01

Lenzerini tutorial.
Schema matching:

Rahm and Bernstein, VLDB Journal 01.
Workshops: IJCAI, Semantic Web Conf.
Teaching integration to undergraduates:
SIGMOD Record, September, 2003.