Optimization of Multi-Domain Queries on the Web
Download
Report
Transcript Optimization of Multi-Domain Queries on the Web
Optimization of Multi-Domain Queries on the Web
Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi
Dipartimento di Elettronica e Informazione – Politecnico di Milano
VLDB 2008
2009. 02. 19.
Presented by Babar Tareen, IDS Lab., Seoul National University
Based on Conference Presentation
Center for E-Business Technology
Seoul National University
Seoul, Korea
Mutli-Domain Queries
Queries that can be answered by combining knowledge from
two or more domains
Example
Where can I attend an interesting database workshop close to a
sunny beach?
Who are the strongest experts on service computing based upon
their recent publication record and accepted European projects ?
Can I spend an April week-end in a city served by a low-cost
direct flight from Milano offering a Mahler's symphony?
Copyright 2008 by CEBT
2
Intro
General-purpose search engines (e.g. Yahoo, Google)
Very large search space, yet
Not able to index deep Web data
Domain-specific search engines (e.g. an airline’s flight search
form, Amazon’s book search facility)
Typically of high quality, but
Limited to restricted domains
We lack the ability to answer multi-domain queries
Copyright 2008 by CEBT
3
Scenario: a multi-domain query
•
Reference query:
–
•
“Find all database conferences in the next six months in locations
the average temperature is at least 28°C degrees and for
Inwhere
general:
which a cheap travel solution including a luxury accommodation
“Given
exists.” a query over a set of services, fin
d the query
plan
that
Answering
this query
requires:
minimizes the exp
cost according
to a give
– ected
Finding execution
interesting conferences
in the desired timeframe
via online
by the
n services
metric
in scientific
order community;
to obtain the best k a
– Understanding whether the conference location is served by lownswers.”
cost flights;
–
Finding luxury hotels close to the conference location with
available rooms; and
–
Checking the expected average temperature of the location
Copyright 2008 by CEBT
4
Overall Picture
Copyright 2008 by CEBT
5
Preliminaries – (1)
Characteristics of information sources (services)
Search services: return answers in ranking order
Exact services: indistinguishible tuples (no ranking)
Services have access patterns
–
Combination of Input and Output parameters corresponding to
different ways of invocation
Copyright 2008 by CEBT
6
Preliminaries – (2)
Characteristics of information sources (services)
Expected result size per invocation (ERSPI):
–
proliferative (ERSPI>1)
–
selective (0≤ERSPI≤ 1) services
Chunking/paging of result sets: bulk vs. chunked services
Joins
Can be considered system services
ERSPI: selectivity of the join condition, ERSPIs of services
–
Product of the ERSPI values of the services multiplied by the
selectivity of the join condition
Copyright 2008 by CEBT
7
Preliminaries – (3)
Query plan: indicates the invocations of services and their
conjunctive composition through joins
Represented as directed acyclic graphs (DAGs)
Nodes = atoms in the conjuncitve query (service, join)
Arcs = precedence constaints + data flows
Joins: join strategy + number of fetches per service
Directed Acyclic Graph
Copyright 2008 by CEBT
8
Preliminaries – (4)
Cost metrics: associate a cost to a plan
Sum cost metric = sum of the costs of each operator
Execution time metric = expected time from query input to
result output
Request-response cost metric = special case of sum cost metric
where each invocation has a costs of 1
Copyright 2008 by CEBT
9
Optimization Approach
Exploring a highly combinatorial solution space
1st Phase: selection of a given query rewriting such that every
service is called with one of available access patterns
2nd Phase: selection of query plan
3rd Phase: assignment of the exact number of fetches to be
performed over chunked services
Copyright 2008 by CEBT
10
Services, access patterns, queries
Web services and access patterns:
Services with
alternative access
patterns
• The example query (in Datalog-like syntax):
Copyright 2008 by CEBT
11
Query plans
Representation as DAGs
Placing a node = invoking the respective service/join
Two nodes connected by an arc = sequential execution
Two nodes without connection = parallel execution
Graphical notation (note the parallel vs. pipe join):
Copyright 2008 by CEBT
12
Joing strategies for parallel joins
Nested loop: one service “dominates” the other
Merge-scan: no a-priori distinction of services
Copyright 2008 by CEBT
13
Annotated query plans
In order to estimate the number of tuples in output, we further
need to know:
The number of tuples in output of each service
The number of fetches for each chunked service
The join strategy for each parallel join
The final annotation is the output of the optimization
Copyright 2008 by CEBT
14
Instrumented branch and bound
Access pattern selection
Heuristic: “Bound is better” = the more input fields in the access
pattern, the better
Query plan selection
Heuristic:
and parallel are better” = selective services in
Possible“Selective
service combinations:
series (with increasing ERSPI) and proliferative services in parallel
α1 has more input fields than α2
Not feasible: City would
Heuristic: “Greedy and square are better”
increment
need =toeither
be an we
input
parameter
to the (greedy)
query! or
the number of fetches to chunked services
individually
Chunked service selection
together (square)
Copyright 2008 by CEBT
15
Final annotation of query plan
Execution time cost metric:
Service characterization:
Fetching factors:
Annotated query plan
Copyright 2008 by CEBT
16
Query execution
Execution environment
Service registration: signature, patterns, ERSPI, repsonse times,
chunk sizes, indication of join strategy,...
Service orchestration: query execution
Multi-threading: to leverage parallelisms
Logical caching (speed + elimination of duplicates)
No cache = each call individually repeated
One-call cache = caching of the last call to each service
Optimal cache = all calls to all services are cached
Copyright 2008 by CEBT
17
# of calls under varying chache settings
Copyright 2008 by CEBT
18
Results of the optimal plan
Screenshot of the prototype query engine
Copyright 2008 by CEBT
19
Conclusion
In this work, we have
defined an formal model for the optimization of multi-domain queries
over web services (conjunctive queries)
defined query plans similar to relational physical access plans
derived an optimization technique based on a classical branch and
bound technique
given experimental evidence that the proposed model fits real world
settings (existing web service and wrapped ones)
Next
Generic query engine + declarative rep. of query plans
User interface for the mashup of sevices/queries
Copyright 2008 by CEBT
20
Discussion
Very Simple Experimental Setup
No details about Semi-automatically generated Wrappers
How to decide which service to select for a specific domain?
How to map Input Output parameters between different
services?
If we have to pre-program the system for new domains, it is
like developing a special purpose application
How effective is the system for answering Multi-Domain
Queries?
Copyright 2008 by CEBT
21