Top-K Query Processing Techniques for Distributed Environments

Download Report

Transcript Top-K Query Processing Techniques for Distributed Environments

An Overview of Distributed
Top-K Ranking Algorithms
30-min presentation by
Demetris Zeinalipour
Lecturer
School of Pure and Applied Sciences
Open University of Cyprus
Friday, December 12th, 2008, 16:00-16:30
Communication Systems Group (CSG), ETH Zurich, Switzerland
http://www.cs.ucy.ac.cy/~dzeina/
1
Top-k Queries: Introduction
•
Top-K Queries are a long studied topic in the
database and information retrieval communities
•
The main objective has been to return the K
highest-ranked answers quickly and efficiently.
A Top-K query returns the subset of most relevant
answers, in place of ALL answers, for two
reasons:
•
–
–
i) to minimize the cost metric that is associated with
the retrieval of all answers (e.g., disk, network, etc.)
ii) to maximize the quality of the answer set, such that
the user is not overwhelmed with irrelevant results
Demetris Zeinalipour (Open University of Cyprus)
2
Top-k Queries: Then
SELECT TOP-2 pictures
FROM PICTURES
WHERE SIMILAR(picture,
{
}
)
Query
Processing
Assumptions
•
The data is available locally on disks or over a “highspeed”, “always-on” network
Trade-off
•
•
Clients want to get the right answers quickly
Service Providers want to consume the least
possible resources
Demetris Zeinalipour (Open University of Cyprus)
3
Top-k Queries: Now
In-Network Top-k Query
Processing
Base
Station
A few motivating queries:
•
•
•
Snapshot Query: Find the K nodes with the highest
temperature values
Continuous Query: For the next one hour continuously
report the K rooms with the highest average temperature
Historic Query (nodes store all data locally): Find the
K nodes with the highest average temperature during the
last 6 months
Demetris Zeinalipour (Open University of Cyprus)
4
Top-k Queries: Now
•
Assume a cluster of n=5 Web-servers
Each server maintains locally a replica of the
same m=5 static Web-pages
When a web page is accessed by a client, the
respective server increases a local hit counter
by one
Scoring
Table
v2
v3
v4
v5
v1
Hits++
TOP-1
v1 v2 v3 v4 v5
Hits
PageID
(M)
Timestamp
s
•
•
{
o3,.99
o1,.66
o0,.63
o2,.48
o4,.44
o1,.91 o1,.92 o3,.74
o3,.90 o3,.75 o1,.56
http://www.amazon.com/
o0,.61 o4,.70 o2,.56
o4,.07 o2,.16 o0,.28
o2,.01 o0,.01 client
o4,.19
o3,.67
o4,.67
o1,.58
o2,.54
o0,.35
o3,4.05/5=.81
o1,3.63/5=.73
o4,2.07/5=.41
o0,1.88/5=.32
o2,1.75/5=.29
5
TOP-1 Query: “Find the webpage with the highest
(N) Web-servers
number of
hits across
servers”
Demetris
Zeinalipourall(Open
University of Cyprus)
5
Presentation Outline
A. Introduction
B. Centralized Top-K Query Processing
•
The Threshold Algorithm (TA)
C. Distributed Top-K Query Processing
•
•
The Threshold Join Algorithm (TJA)
Experimentation using 75 workstations
D. Other Applications of Top-K Queries
•
•
Distributed Spatio-temporal Trajectory
Retrieval
In-Network
Top-K Views (MINT Views)
Demetris Zeinalipour (Open University of Cyprus)
6
Centralized Top-K Query Processing
Fagin’s* Threshold Algorithm (TA):
(In ACM PODS’02)
* Concurrently developed by 3 groups
The most widely recognized algorithm for Top-K Query
Processing in database systems
ΤΑ Algorithm
v1
o3,
o1,
o0,
o2,
o4,
99
66
63
48
44
v2
o1,
o3,
o0,
o4,
o2,
91
90
61
07
01
v3
o1,
o3,
o4,
o2,
o0,
92
75
70
16
01
v4
o3,
o1,
o2,
o0,
o4,
74
56
56
28
19
v5
o3,
o4,
o1,
o2,
o0,
67
67
58
54
35
1) Access the n lists in parallel.
2) While some object oi is seen, perform a random
access to the other lists to find the complete score for oi.
3) Do the same for all objects in the current row.
4) Now compute the threshold τ as the sum of scores in
the current row.
5)The algorithm stops after K objects have been found
with a score above τ.
Demetris Zeinalipour (Open University of Cyprus)
7
Centralized Top-K: The TA Algorithm (Example)
v1
v2
v3
v4
v5
TOP-K
o3, 99
o1, 66
o0, 63
o2, 48
o4, 44
o1, 91
o3, 90
o0, 61
o4, 07
o2, 01
o1, 92
o3, 75
o4, 70
o2, 16
o0, 01
o3, 74
o1, 56
o2, 56
o0, 28
o4, 19
o3, 67
o4, 67
o1, 58
o2, 54
o0, 35
o3,4.05/5=.81
O3, 405
O1, 363
O4, 207
Iteration 1 Threshold
τ = 99 + 91 + 92 + 74 + 67 => τ = 423
Have we found K=1 objects with a score above τ?
=>
ΝΟ
Iteration 2 Threshold
τ (2nd row)= 66 + 90 + 75 + 56 + 67 => τ = 354
Have we found K=1 objects with a score above τ?
=>
YES!
Why is the threshold correct?
8
Demetris
Zeinalipour
(Open
University
of
Cyprus)
It gives us the maximum score for the objects we have not seen yet (<= τ)
Presentation Outline
A. Top-K Algorithms: Definitions
B. Centralized Top-K Query Processing
•
The Threshold Algorithm (TA)
C. Distributed Top-K Query Processing
•
•
The Threshold Join Algorithm (TJA)
Experimentation using 75 workstations
D. Other Applications of Top-K Queries
•
•
Distributed Spatio-temporal Trajectory
Retrieval
In-Network
Top-K Views (MINT Views
Demetris Zeinalipour (Open University of Cyprus)
9
The Centralized Join Algorithm (CJA)
• Problem: To overcome the arbitrary phases
of the Threshold Algorithm?
TOP-1
v1
1,2,3,4,5
2:
1:
• Naive solution:
3:
2:
– Perform the computation in
one phase: each node sends
3:
its complete list of scores
v3
– Each intermediate node
forwards all received lists
• Disadvantage
4:
5:
v2
3:
4:
5:
4:
5:
v4
5:
v5
–
Overwhelming amount of
messages.
– Huge Query Response Time
Demetris Zeinalipour (Open University of Cyprus)
o3, 67
o4, 67
o1, 58
o2, 54
o0, 35
10
The Staged Join Algorithm (SJA)
• Improved Solution: Aggregate
the lists before these are
forwarded to the parent:
1,2,3,4,5
1,2,3,4,5
2,3,4,5:
v2
3:
• This is the In-network
aggregation approach
TOP1
v1
2,3 4,5
4,5:
v3
v4
5:
v5
• Advantage: Only O(n) messages
• Disadvantage: The size of each
message is still very large in size
(i.e., the complete list)
Demetris Zeinalipour (Open University of Cyprus)
4
5
o3, 74
o1, 56
o2, 56
o0, 28
o4, 19
o3, 67
o4, 67
o1, 58
o2, 54
o0, 35
11
Threshold Join Algorithm (TJA)
•
TJA is our 3-phase algorithm that
optimizes top-k query execution in
distributed (hierarchical) environments.
•
Advantage:
– It usually completes in 2 phases.
– It never completes in more than 3 phases
(LB Phase, HJ Phase and CL Phase)
–
It is therefore highly appropriate for distributed
environments
• “The Threshold Join Algorithm for Top-k Queries in Distributed Sensor
Networks", D. Zeinalipour-Yazti et. al, In VLDB’s DMSN’05.
• “Finding the K Highest-Ranked Answers in a Distributed Network”, D.
Zeinalipour-Yazti et. al, Computer Networks, Elsevier, 2008.
Demetris Zeinalipour (Open University of Cyprus)
12
Step 1 - LB (Lower Bound) Phase
•
•
TJA
U v1
Recursively send the K
1,2,3,4,5
1) LB Phase
highest objectIDs of each
2,3,4,5:
1
node to the sink.
v2 2,3U4,5
3:
Each intermediate node
4,5:
v3
v4 U
performs a union of the
4 5
5:
received results (defined as τ)
v5
v1
o3,
o1,
o0,
o2,
o4,
99
66
63
48
44
v2
o1,
o3,
o0,
o4,
o2,
91
90
61
07
01
v3
o1,
o3,
o4,
o2,
o0,
92
75
70
16
01
v4
o3,
o1,
o2,
o0,
o4,
74
56
56
28
19
v5
o3,
o4,
o1,
o2,
o0,
67
67
58
54
35
Ltotal
{1,3}
Empty Oij
Occupied Oij
LB
Τ= {o3, o1}
Query: TOP-1
Demetris Zeinalipour (Open University of Cyprus)
13
Step 2 – HJ (Hierarchical Join) Phase
•
•
•
+
R
TJA
U
Disseminate τ to all nodes
1,2,3,4,5 v1
{1,3,4}
2) HJ Phase
Each node sends back all
2,3,4,5:
objects with score above the
+
v2 U
2,3 4,5
objectIDs in τ
3:
4,5:
Before sending the objects,
v3
v4 U
+
4 5
each node tags as incomplete,
5:
Empty O
scores that could not be
Occupied O
v5
Incomplete O
computed exactly
total
ij
ij
ij
v1
o3,
o1,
o0,
o2,
o4,
99
66
63
48
44
v2
v3
v4
v5
HJ
o1, 91
o1, 92
o3, 74
o3, 67
o3, 405
o3, 90
o3, 75
o1, 56
o4, 67
o1, 363
o0, 61
o4, 70
o2, 56
o1, 58
o4',354
o4, 07
o2, 16
o0, 28
o2, 54
o2, 01
o0, 01
o4,19
o0, 35
Demetris Zeinalipour (Open University of Cyprus)
}
Complete
Incomplete
14
Step 3 – CL (Cleanup) Phase
•
Have we found K objects with a complete
score that is above all incomplete scores?
– Yes: The answer has been found!
– No: Find the complete score for each
incomplete object (all in a single batch phase)
•
CL ensures correctness
•
This phase is rarely required in practice!
Demetris Zeinalipour (Open University of Cyprus)
15
Experimental Evaluation
• We have implemented a P2P middleware in
JAVA (sockets + binary transfer protocol).
• We tested our implementation with a
network of 1000 real nodes using 75 Linux
workstations.
• We use a trace driven experimentation
methodology with data from an
Environmental Monitoring Facility in
Washington / Oregon
Summary of Findings
Bytes: CJA = 10xTJA; SJA = 3xTJA
Time: TJA:3.7s [LB:1.0s,HJ:2.7s,CL:0.08s];
SJA: 8.2s; CJA:18.6s
Messages:TJA:259, SJA:183, CJA:246
Demetris Zeinalipour (Open University of Cyprus)
16
Presentation Outline
A. Top-K Algorithms: Definitions
B. Centralized Top-K Query Processing
•
The Threshold Algorithm (TA)
C. Distributed Top-K Query Processing
•
•
The Threshold Join Algorithm (TJA)
Experimentation using 75 workstations
D. Other Applications of Top-K Queries
•
•
Distributed Spatio-temporal Trajectory
Retrieval (UB-K and UBLB-K Algorithms)
In-Network
Top-K Views (MINT Views)
Demetris Zeinalipour (Open University of Cyprus)
17
Application 2: SpatioTemporal Similarity Search
•
Similarity Search: Given a query Q, find the degree of
similarity (Euclidean distance, DTW, LCSS) between Q
and a set of m target trajectories {A1,A2,…,Am}.
Each Αi (i<=m) is segmented into a number of nonoverlapping cells {C1,C2,…,Cn} that maintain the local
subsequences.
Challenge: How can we find the K most similar
trajectories to Q without pulling together all subsequences
•
•
trajectories
y
cell
A2
A1
Q
"Distributed Spatio-Temporal Similarity Search”,
D. Zeinalipour-Yazti, S. Lin, D. Gunopulos, ACM
15th Conference on Information and Knowledge
Management, (ACM CIKM 2006), November 611, Arlington, VA, USA, pp.14-23, August 2006.
G
moving object
Access Point x
Demetris Zeinalipour (Open University of Cyprus)
18
Application 2: Spatiotemporal Query Processing
m
Solution Outline
•
Each cell computes a lower bound and an upper bound on
the matching of Q to its local subsequences.
•
The distributed scoring table now contains score
bounds (lower,upper) rather than exact scores.
•
•
v1
id,lb,ub
v2
id,lb,ub
v3
id,lb,ub
METADATA
A2,3,6
A0,4,8
A4,5,10
A7,7,9
A3,8,11
A9,8,9
....
A4,4,5
A2,5,6
A0,5,7
A3,5,6
A9,8,10
A7,12,13
....
A4,1,3
A0,6,10
A2,5,7
A9,6,7
A3,7,10
A7,11,13
....
A4,10,18
A2,13,19
A0,15,25
A3,20,27
A9,22,26
A7,30,35
....
id,lb,ub
trajectories
y
cell
A2
A1
Q
G
moving object Access Point x
n
We have proposed two iterative algorithms: UB-K and
UBLB-K, which combine these score bounds.
UB-K and UBLB-K find the K most similar trajectories to
19
Q withoutDemetris
pullingZeinalipour
together(Open
theUniversity
distributed
subsequences.
of Cyprus)
Application 3: ΜΙΝT
•
ΜΙΝΤ : a framework for optimizing the execution of
continuous monitoring queries in sensor networks.
•
"MINT Views: Materialized In-Network Top-k Views in Sensor
Networks"
D. Zeinalipour-Yazti, P. Andreou, P. Chrysanthis and G. Samaras, In
IEEE 8th International Conference on Mobile Data Management,
Mannheim, Germany, May 7 – 11, 2007
Query: Find the K=1 rooms with the highest average temperature
Demetris Zeinalipour (Open University of Cyprus)
20
ΜΙΝΤ Views: Problem
Objective: To prune away tuples locally at each sensor
such that messaging is minimized.
Naïve Solution: Each node eliminates any tuple with a
score lower than its top-1 result.
D,76.5
C,75
B,41
Problem:
(B,40)
We received a incorrect
answer i.e., (D,76.5)
instead of (C,75).
Demetris Zeinalipour (Open University of Cyprus)
21
ΜΙΝΤ Views: Main Idea
•
•
τ
Bound above each tuple with its maximum possible value.
K-covered Bound-set : Includes all the objects which
have an upper bound (vub) greater or equal to the kth
highest lower bound (τ), i.e., vub > τ
vlb
vub
Demetris Zeinalipour (Open University of Cyprus)
sum
22
ΜΙΝΤ Views: Main Idea
•
•
τ
Bound above each tuple with its maximum possible value.
K-covered Bound-set : Includes all the objects which
have an upper bound (vub) greater or equal to the kth
highest lower bound (τ), i.e., vub > τ
vlb
vub
Demetris Zeinalipour (Open University of Cyprus)
sum
23
An Overview of Distributed
Top-K Ranking Algorithms
Thank you!
Demetris Zeinalipour
This presentation is available at:
http://www2.cs.ucy.ac.cy/~dzeina/talks.html
Related Publications available at:
http://www2.cs.ucy.ac.cy/~dzeina/publications.htm
24
Backup Slides
Main Findings:
Dataset: Environmental Measurements from atmospheric monitoring
stations in Washington & Oregon. (2003-2004)
Query: Find the K timestamps on which the average temperature across
all stations was maximum.
Network: Random Graph (degree=4, diameter 10)
Evaluation Criterions: i) Bytes, ii) Time, iii) Messages
Experimental Results
TJA requires one order of magnitude less
Demetrisbytes
Zeinalipour
(Open
University of Cyprus)
than
CJAs!
26
Experimental Results
TJA: 3.7sec [ LB:1.0sec, HJ:2.7sec, CL:0.08sec ]
SJA:
8.2sec
CJA:18.6sec
Demetris
Zeinalipour
(Open
University of Cyprus)
27
Experimental Results
259
246
183
Although TJA consumes more messages than SJA
28
these
are small-size
messages
Demetris
Zeinalipour
(Open University
of Cyprus)
The TPUT Algorithm
v1
v2
v3
v4
v5
TOP-1
o3, 99
o1, 66
o0, 63
o2, 48
o4, 44
o1, 91
o3, 90
o0, 61
o4, 07
o2, 01
o1, 92
o3, 75
o4, 70
o2, 16
o0, 01
o3, 74
o1, 56
o2, 56
o0, 28
o4, 19
o3, 67
o4, 67
o1, 58
o2, 54
o0, 35
o1=183,
o3=240
o3=405
Q: TOP-1
P1
P2
o1=363
o2’=158
o4’=137
o0’=124
P3
Phase 1 : o1 = 91+92 = 183, o3 = 99+67+74 = 240
τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48
Phase 2 : Have we computed K exact scores ?
Computed Exactly: [o3, o1]
Incompletely Computed: [o4,o2,o0]
Demetris
(OpenisUniversity
of Cyprus)
Drawback:
TheZeinalipour
threshold
uniform
(too
coarse)
29
TJA vs. TPUT
Demetris Zeinalipour (Open University of Cyprus)
30
ΜΙΝΤ Views: Experimentation
•
We obtained a real trace of atmospheric data collected by
UC-Berkeley on the Great Duck Island (Maine) in 2002.
We then performed a trace-driven experimentation using
XBows TELOSB sensor.
Our query was as follows:
•
•
–
–
–
SELECT TOP-K area, Avg(temp)
FROM sensors
GROUP BY area
77%
39%
34%
0%
12%
Demetris Zeinalipour (Open University of Cyprus)
31