Optimization of Multi-Domain Queries on the Web

Download Report

Transcript Optimization of Multi-Domain Queries on the Web

Optimization of Multi-Domain Queries on the Web
Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi
Dipartimento di Elettronica e Informazione – Politecnico di Milano
VLDB 2008
2009. 02. 19.
Presented by Babar Tareen, IDS Lab., Seoul National University
Based on Conference Presentation
Center for E-Business Technology
Seoul National University
Seoul, Korea
Mutli-Domain Queries
 Queries that can be answered by combining knowledge from
two or more domains
 Example

Where can I attend an interesting database workshop close to a
sunny beach?

Who are the strongest experts on service computing based upon
their recent publication record and accepted European projects ?

Can I spend an April week-end in a city served by a low-cost
direct flight from Milano offering a Mahler's symphony?
Copyright  2008 by CEBT
2
Intro
 General-purpose search engines (e.g. Yahoo, Google)

Very large search space, yet

Not able to index deep Web data
 Domain-specific search engines (e.g. an airline’s flight search
form, Amazon’s book search facility)

Typically of high quality, but

Limited to restricted domains
 We lack the ability to answer multi-domain queries
Copyright  2008 by CEBT
3
Scenario: a multi-domain query
•
Reference query:
–
•
“Find all database conferences in the next six months in locations
the average temperature is at least 28°C degrees and for
Inwhere
general:
which a cheap travel solution including a luxury accommodation
“Given
exists.” a query over a set of services, fin
d the query
plan
that
Answering
this query
requires:
minimizes the exp
cost according
to a give
– ected
Finding execution
interesting conferences
in the desired timeframe
via online
by the
n services
metric
in scientific
order community;
to obtain the best k a
– Understanding whether the conference location is served by lownswers.”
cost flights;
–
Finding luxury hotels close to the conference location with
available rooms; and
–
Checking the expected average temperature of the location
Copyright  2008 by CEBT
4
Overall Picture
Copyright  2008 by CEBT
5
Preliminaries – (1)
 Characteristics of information sources (services)

Search services: return answers in ranking order

Exact services: indistinguishible tuples (no ranking)

Services have access patterns
–
Combination of Input and Output parameters corresponding to
different ways of invocation
Copyright  2008 by CEBT
6
Preliminaries – (2)
 Characteristics of information sources (services)


Expected result size per invocation (ERSPI):
–
proliferative (ERSPI>1)
–
selective (0≤ERSPI≤ 1) services
Chunking/paging of result sets: bulk vs. chunked services
 Joins

Can be considered system services

ERSPI: selectivity of the join condition, ERSPIs of services
–
Product of the ERSPI values of the services multiplied by the
selectivity of the join condition
Copyright  2008 by CEBT
7
Preliminaries – (3)
 Query plan: indicates the invocations of services and their
conjunctive composition through joins

Represented as directed acyclic graphs (DAGs)

Nodes = atoms in the conjuncitve query (service, join)

Arcs = precedence constaints + data flows

Joins: join strategy + number of fetches per service
Directed Acyclic Graph
Copyright  2008 by CEBT
8
Preliminaries – (4)
 Cost metrics: associate a cost to a plan

Sum cost metric = sum of the costs of each operator

Execution time metric = expected time from query input to
result output

Request-response cost metric = special case of sum cost metric
where each invocation has a costs of 1
Copyright  2008 by CEBT
9
Optimization Approach
 Exploring a highly combinatorial solution space

1st Phase: selection of a given query rewriting such that every
service is called with one of available access patterns

2nd Phase: selection of query plan

3rd Phase: assignment of the exact number of fetches to be
performed over chunked services
Copyright  2008 by CEBT
10
Services, access patterns, queries
 Web services and access patterns:
Services with
alternative access
patterns
• The example query (in Datalog-like syntax):
Copyright  2008 by CEBT
11
Query plans
 Representation as DAGs

Placing a node = invoking the respective service/join

Two nodes connected by an arc = sequential execution

Two nodes without connection = parallel execution
 Graphical notation (note the parallel vs. pipe join):
Copyright  2008 by CEBT
12
Joing strategies for parallel joins
 Nested loop: one service “dominates” the other
 Merge-scan: no a-priori distinction of services
Copyright  2008 by CEBT
13
Annotated query plans
 In order to estimate the number of tuples in output, we further
need to know:

The number of tuples in output of each service

The number of fetches for each chunked service

The join strategy for each parallel join
 The final annotation is the output of the optimization
Copyright  2008 by CEBT
14
Instrumented branch and bound
 Access pattern selection

Heuristic: “Bound is better” = the more input fields in the access
pattern, the better
 Query plan selection

Heuristic:
and parallel are better” = selective services in
Possible“Selective
service combinations:
series (with increasing ERSPI) and proliferative services in parallel
α1 has more input fields than α2
Not feasible: City would
Heuristic: “Greedy and square are better”
increment
need =toeither
be an we
input
parameter
to the (greedy)
query! or
the number of fetches to chunked services
individually
 Chunked service selection

together (square)
Copyright  2008 by CEBT
15
Final annotation of query plan
Execution time cost metric:
Service characterization:
Fetching factors:
Annotated query plan
Copyright  2008 by CEBT
16
Query execution
 Execution environment

Service registration: signature, patterns, ERSPI, repsonse times,
chunk sizes, indication of join strategy,...

Service orchestration: query execution

Multi-threading: to leverage parallelisms
 Logical caching (speed + elimination of duplicates)

No cache = each call individually repeated

One-call cache = caching of the last call to each service

Optimal cache = all calls to all services are cached
Copyright  2008 by CEBT
17
# of calls under varying chache settings
Copyright  2008 by CEBT
18
Results of the optimal plan
 Screenshot of the prototype query engine
Copyright  2008 by CEBT
19
Conclusion
 In this work, we have

defined an formal model for the optimization of multi-domain queries
over web services (conjunctive queries)

defined query plans similar to relational physical access plans

derived an optimization technique based on a classical branch and
bound technique

given experimental evidence that the proposed model fits real world
settings (existing web service and wrapped ones)
 Next

Generic query engine + declarative rep. of query plans

User interface for the mashup of sevices/queries
Copyright  2008 by CEBT
20
Discussion
 Very Simple Experimental Setup
 No details about Semi-automatically generated Wrappers
 How to decide which service to select for a specific domain?
 How to map Input Output parameters between different
services?
 If we have to pre-program the system for new domains, it is
like developing a special purpose application
 How effective is the system for answering Multi-Domain
Queries?
Copyright  2008 by CEBT
21