Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006

Download Report

Transcript Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006

Interactive Query Formulation
over Web Service-Accessed Sources
Michalis Petropoulos
Alin Deutsch
Yannis Papakonstantinou
ACM SIGMOD, June 2006
Large-Scale Data Integration Systems

End User

Developer
Web
Domain
Application
Domain

Application

Integration
Engineer

Integration
Domain
Source
Domain
• CNET’s Top Combinations
• CNET’s Search Desktops
• PCWorld’s Product Finder
Web Forms
& Reports
Mediator
Web
Service

• CNET
• PCWorld
Application
Compatible Combinations
of Computers, Routers
and Printers
Integrated
Schema
Web
Web
Service Service
…
Source
Owner
Source
Schema
Data
Source
Data
Source
Source
Schema
Computer
Portals
…
• Dell Computers by CPU
• Cisco Routers by Rate
• HP Printers by Speed
• Dell Computers
• Cisco Routers
• HP Printers
2
Large-Scale Data Integration Systems

End User

Developer
Web
Domain
Web Forms
& Reports
Application
Domain

Application

Integration
Engineer

Integration
Domain
Source
Domain
Mediator

What queries can the
mediator answer for me?
CLIDE
Application
Integrated
Schema
Web
Service
Web
Web
Service Service
Data
Source
Source
Data
Schema
Source
…
Source
Owner
Source
Schema
…
3
Running Example
Parameterized Views
Dell
Cisco
Schema
Schema
Computers(cid, cpu, ram, price)
NetCards(cid, rate, standard, interface)
Views
V1 ComByCpu(cpu)  (Computer)*
SELECT DISTINCT Com1.*
FROM Computers Com1
WHERE Com1.cpu=cpu
Routers(rate, standard, price, type)
Views
Computers
V3 RouWired()  (Router)*
for a given cpu
SELECT DISTINCT Rou1.*
FROM Routers Rou1
WHERE Rou1.type='Wired'
Computers & NetCards
 (Router)*
for a given V4
cpuRouWireless()
& rate
V2 ComNetByCpuRate(cpu, rate) 
(Computer, NetCard)*
SELECT DISTINCT Com1.*, Net1.*
FROM Computers Com1, Network Net1
WHERE Com1.cid=Net1.cid
AND Com1.cpu=cpu
AND Net1.rate=rate
Wired
Routers
Wireless
Routers
SELECT DISTINCT Rou1.*
FROM Routers Rou1
WHERE Rou1.type='Wireless'
Conjunctive Queries CQ
•
•
Equality & Comparison Conditions
Parameters
4
Running Example
Integrated Schema


Developer
Application
Mediator
V1
V2
Dell
Integrated
Schema
V3
V4
• Integrated schema puts together
the Dell and Cisco schemas
Attribute Associations
• (Computers.cid, NetCards.cid)
• (NetCards.rate, Routers.rate)
• (NetCards.standard, Routers.standard)
Cisco
5
Sophisticated Mediators Make
Feasible Queries Hard to Predict
Feasible Queries FQ
• Equivalent CQ query rewritings using the views
• Might involve more than one views
• Order might matter
Query: Infeasible
Feasible
Get all Computers
‘P4’ Computers, together with their NetCards
and their compatible ‘Wireless’ Routers
Computers.*
NetCards.*
Routers.*
A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless
B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless
B
Routers.*
10 .11b 50 Wireless
54 .11g 120 Wireless
E
Computers.*
NetCards.*
512 400 A123 10 .11b USB
B123 P4 1024 550 B123 54 .11g USB
MediatorA123 P4
D
Mediator
A
RouWireless()
V4
C
ComNetByCpuRate(‘P4’, ‘10’)
V1
ComNetByCpuRate(‘P4’, ‘54’)
V2
6
Problem
1. Large number of sources
2. Large number of views (web-services)
3. Mediator capabilities
Developer formulates an application query
 Is an application query feasible?
 If not, how do I know which ones are feasible?
Previous options:
–
–
The developer had to browse the view definitions and
somehow formulate a feasible query
Or formulate queries until a feasible one is found
(trial-and-error)
No system-provided guidance
7
The CLIDE Solution
CLIDE


Developer
Application
Mediator
V1
V2
Dell
Integrated
Schema
V3
A query formulation interface,
which interactively guides the
developer toward feasible queries
by employing a coloring scheme
V4
Cisco
8
QBE-Like Interfaces
Microsoft SQL-Server
9
CLIDE Interface
Feasibility Flag
Selection Boxes
• Table, selection, projection and join actions
• Feasibility Flag
• Color-based suggestions
10
Example Interaction
Snapshot 1
Yellow  required action
– All feasible queries require this action
White  optional action
– Feasible queries can be formulated
w/ or w/o these actions
11
Example Interaction
Snapshot 2
Blue  required choice of action
– At least one of them is required to be
taken in order to reach a feasible
query
Join Lines:
• Only yellow and blue are displayed
• Must appear in Attribute Associations
13
CLIDE Properties
• Completeness of Suggestions
– Every feasible query can be formulated by
performing yellow and blue actions at every step
• Minimality of Suggestions
– At every step, only a minimal number of actions is
suggested, i.e., the ones that are needed to
preserve completeness
• Rapid Convergence By Following
Suggestions
– The shortest sequence of actions from a query to
any feasible query consists of suggested actions
14
Example Interaction
Snapshot 3
• *  any other constant
• Red  prohibited action
– Does not appear in any feasible query
– Lead to “Dead End” state
15
Example Interaction
Snapshot 4
ram price rate interface price
512 400 10
USB
50
1024 550 54
USB
120
F
Mediator
A
Routers.*
10 .11b 512 Wireless
54 .11g 1024 Wireless
RouWireless()
V4
B
D
Computers.*
NetCards.*
A123 P4 512 400 A123 10 .11b 50
B123 P4 1024 550 B123 54 .11g 120
ComNetByCpuRate(‘P4’, rate)
V2
E
16
CLIDE Properties
• Completeness of Suggestions
– Every feasible query can be formulated by
performing yellow and blue actions at every step
• Minimality of Suggestions
– At every step, only a minimal number of actions is
suggested, i.e., the ones that are needed to
preserve completeness
• Rapid Convergence By Following
Suggestions
– The shortest sequence of actions from a query to
any feasible query consists of suggested actions
17
Interaction Graph
s
Com1
Com1.ram
Com1.price
Selection
Action
Table
Action
Join Action
Com1.cpu=‘P4’
Net1
Com1.cid=Net1.cid
Rou1
…
…
…
…
…
…
…
…
…
…
• Nodes are queries
– One for each qCQ
• Edges are actions
– Table, selection, projection and join actions
• Green nodes are feasible queries
• Infinitely big structure
– All CQ queries
– All possible combinations of actions formulating them
18
Interaction Graph: Colors
• Yellow action 
– Every path from current node n to a
feasible node contains 
• Blue action 
– At least one path from current node
n to a feasible node nF contains 
– There is not path from n to nF that
contains a feasible node
Com1.cid
Com1.cpu
• Red action 
…
Current
Node
…
– No path to a feasible node contains 
Com1.cid=*
…
…
Com1.cpu=*
Net1
Current
Node
Com1.cid=Net1.cid
Net1.rate=’54Mbps’
…
…
…
…
Com1.ram=*
…
…
…
Net1
Com1.cpu=*
Net1.rate=’54Mbps’
Rou1
Com1.cid=Net1.cid
Net1.rate=Rou1.rate
…
…
…
…
…
…
Rou1
Com1.cid=Net1.cid
…
…
Com1.price=*
Com2
…
Com2
Com2.cid=Net1.cid
Com2.cpu=‘P4’
Net1.rate=‘54Mbps’
…
…
…
…
…
19
CLIDE Properties
• Rapid Convergence By Following
Suggestions
– The shortest path from the current node to any
feasible node consists of suggested actions
20
CLIDE Architecture

Actions
Front-End
Current Query
Colored Actions + Feasibility Flag
User
Back-End
Color Algorithm
Seed Queries SQ
Parameters Algorithm
Closest Feasible Queries FQC
Closest Feasible Queries Algorithm
Aliases Collapse Rule
Maximally-Contained Rewriter
Schemas
Views
Minimal Feasible
Extension Queries
Attribute
Associations
• Back-End invoked every time the user performs an action
– i.e., the user arrives at a new node in the interactions graph
21
Color Determined
By a Finite Set of Feasible Queries
Challenge: Infinitely Many Feasible Queries
?
…
…
n
…
…
…
…
Closest Feasible
Queries FQC
Solution: • Finite Set of Feasible Queries
• Closest Feasible Queries FQC
• FQC is sufficient to color actions
Challenge: How far can the Closest Feasible Queries FQC be?
22
Closest Feasible Queries FQC Algorithm
Solution: Based on Maximally Contained Queries FQMC
Maximally Contained
Queries FQMC
…
…
n
…
…
…
…
Closest Feasible
Queries FQC
• Compute maximally contained queries FQMC
• Radius pL is the longest path to a maximally contained query
• Theorem: All FQC queries are reachable
via a path of length p  pL
Challenge: The set of nodes within pL can be too many to explore
23
Closest Feasible Queries FQC Algorithm
Solution: Find the Closest Feasible Queries Directly
Maximally Contained
Queries FQMC
…
…
n
…
…
…
…
Closest Feasible
Queries FQC
More feasible nodes
• Theorem: All queries in FQMC are in FQC
• But not all queries in FQC are in FQMC
• Collapse Aliases to compute FQC \ FQMC
24
Remaining Issues of the Algorithm
• Color remaining actions white or red
• Handling projections
• Dealing with parameterized queries
25
CLIDE Implementation Issues
• Maximally contained queries need to be minimized
(affect CLIDE’s rapid convergence and minimality)
 Implemented the minimization module from scratch
 MiniCon computes mappings between the query and
the views
 Exposed these mappings to guide minimization
• Optimizations to bring the response time below 3 sec
 Amortize the computation across multiple interactions
steps by observing that query is changed minimally at
every step
 Instead of calling MiniCon from scratch at every step,
we incrementally compute the closest feasible queries
26
Thank you
Play with our demo:
http://www.clide.info
27
Maximally Contained Queries FQMC
Maximally Contained Query
Query: Q1
Get all Computers
Query: Q2
Get all Computers
with a given cpu
Not Maximally
Maximally
Contained
Contained
Query
Query: Q4
Q3
Get all Computers
with a given ram
cpu & ram
• Assuming fixed SELECT clause (projection list)
• Covered extensively in literature
– MiniCon, Bucket, InverseRules Algorithms
• FQMC is finite
28
Closest Feasible Queries FQC Algorithm
Solution: Collapse Aliases
…
Closest
Feasible
Queries FQC
n
…
Maximally
Contained
Feasible
Queries FQMC
…
…
…
…
• Collapse Aliases to compute FQC \ FQMC
• Check satisfiability
29
Color Algorithm
Yellow and Blue
• An action  is colored based on which closest feasible
queries it appear in
• Yellow, if  appears in all queries in FQC
• Blue, if  appears in at least one (but not all) query in FQC
White and Red
• Attach Maximum Projection Lists to Closest Feasible Queries
– Projections that can be added to a feasible query, without
compromising feasibility
• Projection  is white if in the maximum projection list
• Color selections based on projections
30
CLIDE Implementation & Optimizations
Maximally-Contained Rewriter
Front-End
Current Query
Minimal Feasible ExtensionColored
QueriesActions
FQME + Feasibility Flag
+ Maximum Projection Lists
Back-End
Redundant Actions Removal
Color Algorithm
Maximally-Contained Feasible Extension Queries
+ Maximum
Projection
Lists
Seed
Queries
SQ
Redundant Queries Removal
Parameters Algorithm
Feasible Extension Queries
Closest
Feasible
Queries
FQC
+ Maximum
Projection
Lists
Views Expansion
Closest Feasible
Queries Algorithm
Aliases Collapse Rule
Maximally-Contained
Feasible Queries over Views
Minimal Feasible
+ Containment Mappings
Extension Queries
Maximally-Contained Rewriter
MiniCon
Containment Mappings Logging
Attribute
Schemas
Views
Associations
• Views expansion introduce redundancy
– Affects CLIDE’s rapid convergence and minimality
• Efficient containment test crucial to redundancy removal
31
CLIDE Performance
Chains of Stars
Time (sec)
7
6
112
168
224
280
5
4
Views
Views
Views
Views
3
2
(2
,1
)
(2
,2
)
(4
,3
)
(4
,4
)
(6
,5
)
(6
,6
)
(8
,7
)
(8
,8
(1 )
0,
(1 9)
0,
1
(1 0)
2,
1
(1 1)
2,
1
(1 2)
4,
1
(1 3)
4,
14
)
1
0
(# of Joins, # of Selections) in Current Query
C1
B
B11
… … ……
A
B2Bi1
Bi
BiM
BK
… … …
…
• Schema
Queries
Views
C
C1C i1
i
A-span = 7
CiM
B-span = 3
C1C
L
Selections = 4,6,8,10
32