Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006
Download
Report
Transcript Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006
Interactive Query Formulation
over Web Service-Accessed Sources
Michalis Petropoulos
Alin Deutsch
Yannis Papakonstantinou
ACM SIGMOD, June 2006
Large-Scale Data Integration Systems
End User
Developer
Web
Domain
Application
Domain
Application
Integration
Engineer
Integration
Domain
Source
Domain
• CNET’s Top Combinations
• CNET’s Search Desktops
• PCWorld’s Product Finder
Web Forms
& Reports
Mediator
Web
Service
• CNET
• PCWorld
Application
Compatible Combinations
of Computers, Routers
and Printers
Integrated
Schema
Web
Web
Service Service
…
Source
Owner
Source
Schema
Data
Source
Data
Source
Source
Schema
Computer
Portals
…
• Dell Computers by CPU
• Cisco Routers by Rate
• HP Printers by Speed
• Dell Computers
• Cisco Routers
• HP Printers
2
Large-Scale Data Integration Systems
End User
Developer
Web
Domain
Web Forms
& Reports
Application
Domain
Application
Integration
Engineer
Integration
Domain
Source
Domain
Mediator
What queries can the
mediator answer for me?
CLIDE
Application
Integrated
Schema
Web
Service
Web
Web
Service Service
Data
Source
Source
Data
Schema
Source
…
Source
Owner
Source
Schema
…
3
Running Example
Parameterized Views
Dell
Cisco
Schema
Schema
Computers(cid, cpu, ram, price)
NetCards(cid, rate, standard, interface)
Views
V1 ComByCpu(cpu) (Computer)*
SELECT DISTINCT Com1.*
FROM Computers Com1
WHERE Com1.cpu=cpu
Routers(rate, standard, price, type)
Views
Computers
V3 RouWired() (Router)*
for a given cpu
SELECT DISTINCT Rou1.*
FROM Routers Rou1
WHERE Rou1.type='Wired'
Computers & NetCards
(Router)*
for a given V4
cpuRouWireless()
& rate
V2 ComNetByCpuRate(cpu, rate)
(Computer, NetCard)*
SELECT DISTINCT Com1.*, Net1.*
FROM Computers Com1, Network Net1
WHERE Com1.cid=Net1.cid
AND Com1.cpu=cpu
AND Net1.rate=rate
Wired
Routers
Wireless
Routers
SELECT DISTINCT Rou1.*
FROM Routers Rou1
WHERE Rou1.type='Wireless'
Conjunctive Queries CQ
•
•
Equality & Comparison Conditions
Parameters
4
Running Example
Integrated Schema
Developer
Application
Mediator
V1
V2
Dell
Integrated
Schema
V3
V4
• Integrated schema puts together
the Dell and Cisco schemas
Attribute Associations
• (Computers.cid, NetCards.cid)
• (NetCards.rate, Routers.rate)
• (NetCards.standard, Routers.standard)
Cisco
5
Sophisticated Mediators Make
Feasible Queries Hard to Predict
Feasible Queries FQ
• Equivalent CQ query rewritings using the views
• Might involve more than one views
• Order might matter
Query: Infeasible
Feasible
Get all Computers
‘P4’ Computers, together with their NetCards
and their compatible ‘Wireless’ Routers
Computers.*
NetCards.*
Routers.*
A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless
B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless
B
Routers.*
10 .11b 50 Wireless
54 .11g 120 Wireless
E
Computers.*
NetCards.*
512 400 A123 10 .11b USB
B123 P4 1024 550 B123 54 .11g USB
MediatorA123 P4
D
Mediator
A
RouWireless()
V4
C
ComNetByCpuRate(‘P4’, ‘10’)
V1
ComNetByCpuRate(‘P4’, ‘54’)
V2
6
Problem
1. Large number of sources
2. Large number of views (web-services)
3. Mediator capabilities
Developer formulates an application query
Is an application query feasible?
If not, how do I know which ones are feasible?
Previous options:
–
–
The developer had to browse the view definitions and
somehow formulate a feasible query
Or formulate queries until a feasible one is found
(trial-and-error)
No system-provided guidance
7
The CLIDE Solution
CLIDE
Developer
Application
Mediator
V1
V2
Dell
Integrated
Schema
V3
A query formulation interface,
which interactively guides the
developer toward feasible queries
by employing a coloring scheme
V4
Cisco
8
QBE-Like Interfaces
Microsoft SQL-Server
9
CLIDE Interface
Feasibility Flag
Selection Boxes
• Table, selection, projection and join actions
• Feasibility Flag
• Color-based suggestions
10
Example Interaction
Snapshot 1
Yellow required action
– All feasible queries require this action
White optional action
– Feasible queries can be formulated
w/ or w/o these actions
11
Example Interaction
Snapshot 2
Blue required choice of action
– At least one of them is required to be
taken in order to reach a feasible
query
Join Lines:
• Only yellow and blue are displayed
• Must appear in Attribute Associations
13
CLIDE Properties
• Completeness of Suggestions
– Every feasible query can be formulated by
performing yellow and blue actions at every step
• Minimality of Suggestions
– At every step, only a minimal number of actions is
suggested, i.e., the ones that are needed to
preserve completeness
• Rapid Convergence By Following
Suggestions
– The shortest sequence of actions from a query to
any feasible query consists of suggested actions
14
Example Interaction
Snapshot 3
• * any other constant
• Red prohibited action
– Does not appear in any feasible query
– Lead to “Dead End” state
15
Example Interaction
Snapshot 4
ram price rate interface price
512 400 10
USB
50
1024 550 54
USB
120
F
Mediator
A
Routers.*
10 .11b 512 Wireless
54 .11g 1024 Wireless
RouWireless()
V4
B
D
Computers.*
NetCards.*
A123 P4 512 400 A123 10 .11b 50
B123 P4 1024 550 B123 54 .11g 120
ComNetByCpuRate(‘P4’, rate)
V2
E
16
CLIDE Properties
• Completeness of Suggestions
– Every feasible query can be formulated by
performing yellow and blue actions at every step
• Minimality of Suggestions
– At every step, only a minimal number of actions is
suggested, i.e., the ones that are needed to
preserve completeness
• Rapid Convergence By Following
Suggestions
– The shortest sequence of actions from a query to
any feasible query consists of suggested actions
17
Interaction Graph
s
Com1
Com1.ram
Com1.price
Selection
Action
Table
Action
Join Action
Com1.cpu=‘P4’
Net1
Com1.cid=Net1.cid
Rou1
…
…
…
…
…
…
…
…
…
…
• Nodes are queries
– One for each qCQ
• Edges are actions
– Table, selection, projection and join actions
• Green nodes are feasible queries
• Infinitely big structure
– All CQ queries
– All possible combinations of actions formulating them
18
Interaction Graph: Colors
• Yellow action
– Every path from current node n to a
feasible node contains
• Blue action
– At least one path from current node
n to a feasible node nF contains
– There is not path from n to nF that
contains a feasible node
Com1.cid
Com1.cpu
• Red action
…
Current
Node
…
– No path to a feasible node contains
Com1.cid=*
…
…
Com1.cpu=*
Net1
Current
Node
Com1.cid=Net1.cid
Net1.rate=’54Mbps’
…
…
…
…
Com1.ram=*
…
…
…
Net1
Com1.cpu=*
Net1.rate=’54Mbps’
Rou1
Com1.cid=Net1.cid
Net1.rate=Rou1.rate
…
…
…
…
…
…
Rou1
Com1.cid=Net1.cid
…
…
Com1.price=*
Com2
…
Com2
Com2.cid=Net1.cid
Com2.cpu=‘P4’
Net1.rate=‘54Mbps’
…
…
…
…
…
19
CLIDE Properties
• Rapid Convergence By Following
Suggestions
– The shortest path from the current node to any
feasible node consists of suggested actions
20
CLIDE Architecture
Actions
Front-End
Current Query
Colored Actions + Feasibility Flag
User
Back-End
Color Algorithm
Seed Queries SQ
Parameters Algorithm
Closest Feasible Queries FQC
Closest Feasible Queries Algorithm
Aliases Collapse Rule
Maximally-Contained Rewriter
Schemas
Views
Minimal Feasible
Extension Queries
Attribute
Associations
• Back-End invoked every time the user performs an action
– i.e., the user arrives at a new node in the interactions graph
21
Color Determined
By a Finite Set of Feasible Queries
Challenge: Infinitely Many Feasible Queries
?
…
…
n
…
…
…
…
Closest Feasible
Queries FQC
Solution: • Finite Set of Feasible Queries
• Closest Feasible Queries FQC
• FQC is sufficient to color actions
Challenge: How far can the Closest Feasible Queries FQC be?
22
Closest Feasible Queries FQC Algorithm
Solution: Based on Maximally Contained Queries FQMC
Maximally Contained
Queries FQMC
…
…
n
…
…
…
…
Closest Feasible
Queries FQC
• Compute maximally contained queries FQMC
• Radius pL is the longest path to a maximally contained query
• Theorem: All FQC queries are reachable
via a path of length p pL
Challenge: The set of nodes within pL can be too many to explore
23
Closest Feasible Queries FQC Algorithm
Solution: Find the Closest Feasible Queries Directly
Maximally Contained
Queries FQMC
…
…
n
…
…
…
…
Closest Feasible
Queries FQC
More feasible nodes
• Theorem: All queries in FQMC are in FQC
• But not all queries in FQC are in FQMC
• Collapse Aliases to compute FQC \ FQMC
24
Remaining Issues of the Algorithm
• Color remaining actions white or red
• Handling projections
• Dealing with parameterized queries
25
CLIDE Implementation Issues
• Maximally contained queries need to be minimized
(affect CLIDE’s rapid convergence and minimality)
Implemented the minimization module from scratch
MiniCon computes mappings between the query and
the views
Exposed these mappings to guide minimization
• Optimizations to bring the response time below 3 sec
Amortize the computation across multiple interactions
steps by observing that query is changed minimally at
every step
Instead of calling MiniCon from scratch at every step,
we incrementally compute the closest feasible queries
26
Thank you
Play with our demo:
http://www.clide.info
27
Maximally Contained Queries FQMC
Maximally Contained Query
Query: Q1
Get all Computers
Query: Q2
Get all Computers
with a given cpu
Not Maximally
Maximally
Contained
Contained
Query
Query: Q4
Q3
Get all Computers
with a given ram
cpu & ram
• Assuming fixed SELECT clause (projection list)
• Covered extensively in literature
– MiniCon, Bucket, InverseRules Algorithms
• FQMC is finite
28
Closest Feasible Queries FQC Algorithm
Solution: Collapse Aliases
…
Closest
Feasible
Queries FQC
n
…
Maximally
Contained
Feasible
Queries FQMC
…
…
…
…
• Collapse Aliases to compute FQC \ FQMC
• Check satisfiability
29
Color Algorithm
Yellow and Blue
• An action is colored based on which closest feasible
queries it appear in
• Yellow, if appears in all queries in FQC
• Blue, if appears in at least one (but not all) query in FQC
White and Red
• Attach Maximum Projection Lists to Closest Feasible Queries
– Projections that can be added to a feasible query, without
compromising feasibility
• Projection is white if in the maximum projection list
• Color selections based on projections
30
CLIDE Implementation & Optimizations
Maximally-Contained Rewriter
Front-End
Current Query
Minimal Feasible ExtensionColored
QueriesActions
FQME + Feasibility Flag
+ Maximum Projection Lists
Back-End
Redundant Actions Removal
Color Algorithm
Maximally-Contained Feasible Extension Queries
+ Maximum
Projection
Lists
Seed
Queries
SQ
Redundant Queries Removal
Parameters Algorithm
Feasible Extension Queries
Closest
Feasible
Queries
FQC
+ Maximum
Projection
Lists
Views Expansion
Closest Feasible
Queries Algorithm
Aliases Collapse Rule
Maximally-Contained
Feasible Queries over Views
Minimal Feasible
+ Containment Mappings
Extension Queries
Maximally-Contained Rewriter
MiniCon
Containment Mappings Logging
Attribute
Schemas
Views
Associations
• Views expansion introduce redundancy
– Affects CLIDE’s rapid convergence and minimality
• Efficient containment test crucial to redundancy removal
31
CLIDE Performance
Chains of Stars
Time (sec)
7
6
112
168
224
280
5
4
Views
Views
Views
Views
3
2
(2
,1
)
(2
,2
)
(4
,3
)
(4
,4
)
(6
,5
)
(6
,6
)
(8
,7
)
(8
,8
(1 )
0,
(1 9)
0,
1
(1 0)
2,
1
(1 1)
2,
1
(1 2)
4,
1
(1 3)
4,
14
)
1
0
(# of Joins, # of Selections) in Current Query
C1
B
B11
… … ……
A
B2Bi1
Bi
BiM
BK
… … …
…
• Schema
Queries
Views
C
C1C i1
i
A-span = 7
CiM
B-span = 3
C1C
L
Selections = 4,6,8,10
32