Transcript pptx

Scalable Data Partitioning
Techniques for Parallel
Sliding Window Processing
over Data Streams
DMSN 2011
Cagri Balkesen & Nesime Tatbul
Talk Outline
• Intro & Motivation
• Stream Partitioning Techniques
– Basic window partitioning
– Batch partitioning
– Pane-based partitioning
• Ring-based Query Evaluation
• Experimental Evaluation
• Conclusions & Future Work
[email protected]
2
Intro & Motivation
DSMS
[email protected]
3
Architectural Overview
input
stream
Split
stage
Split node
Query
Query
Query
Query nodes
Merge
stage
output
stream
Merge node
QoS:
latency < 5 seconds
disorder < 3 tuples
• Classical Split-Merge pattern from Parallel DBs
• Adjustable parallelism level, d
• QoS on max latency & order
[email protected]
4
Related Work: How to Partition?
• Content-sensitive
– FluX: Fault-tolerant, load balancing Exchange [1,2]
– Use group-by values from the query to partition
– Need explicit load-balancing due to skewed data
• Content-insensitive
– GDSM: Window-based parallelization (fixed-size tumbling wins) [3]
– Win-Distribute: Partition at window boundaries
– Win-Split: Partition each win into equi-length subwins
• The Problem:
– How to handle sliding windows?
– How to handle queries without group-by or a few groups?
[1] Flux: An Adaptive Partitioning Operator for Continuous Query Systems, ICDE‘03
[2] Highly-Available, Fault-Tolerant, Parallel Dataflows, SIGMOD ‘04
[3] Customizable Parallel Execution of Scientific Stream Queries, VLDB ‘05
[email protected]
5
Stream Partitioning Techniques
Approach 1: Basic Sliding Window Partitioning
• Independently processable chunking
– Window aware splitting of the stream
• Each window has an id & tuples are marked
– (first-winid, last-winid, is-win-closer)
• Tuples are replicated for each of their windows
Node1
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . .
W1
Split
W2
Node2
W3
W4
w = 6 units, s = 2 units, Replication = 6/2 = 3
[email protected]
Node3
7
Approach 1: Basic Sliding Window Partitioning
The Problem with Basic sliding window partitioning:
• Tuples belong to many windows depending on slide
• Excessive replication of tuples for each window
• Increase in output data volume of split
Node1
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . .
W1
Split
W2
Node2
W3
W4
w = 6 units, s = 2 units, Replication = 6/2 = 3
[email protected]
Node3
8
Approach 2: Batch-based Partitioning
• Batch several windows together to reduce replication
• “Batch-window”: wb = w+(B-1)*s ; sb = B*s
– All the tuples in a batch go to the same partition
– Only tuples overlapping btw. batches are replicated
• Replication reduced to wb/sb partitions instead of w/s
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . .
w1
w = 3, s = 1
B = 3  wb = 5, sb = 3
Replication : 3  5/3
[email protected]
w2
w3
Definitions:
w : window-size
s : slide-size
B : batch-size
B1
w4
w5
w6
B2
w7
w8
9
The Panes Technique
•
•
•
•
Divide overlapping windows into disjoint panes
Reduce cost by sub-aggregation and sharing
Each window has w/gcd(w,s) panes of size gcd(w,s)
Query is decomposed: pane-level (PLQ) & window-level
(WLQ) queries
panes
p1 p2 p3 p4 p5 p6 p7 p8 . . .
windows
w1
w2
w3
w4
w5
...
[1] No Pane, No Gain: Efficient Evaluation of Sliding Window Aggregates over Data Streams, SIGMOD Record ‘05
[email protected]
10
Approach 3: Pane-based Partitioning
• Mark each tuple with pane-id + win-id
– Treat panes as tumbling window with wp = sp = gcd(w,s)
• Route tuples to a node based on pane-id
• Nodes compute PLQ with pane tuples
• Combine all PLQ results of a window to form WLQ
– Need for an organized topology of nodes
– We propose organization of nodes in a ring
Node1
Split
w = 6 units, s = 2 units
[email protected]
Node2
Node3
11
Ring-based Query Evaluation
1
2
Pane1
3
4
5
Pane2
6
Input Source
Pane3
Window1
5
Pane3
…
P9
P8
P3
P2
P1
8
9
Pane4
W = 6, S = 4 tuples
P = GCD(6,4) = 2 tuples
10
Pane5
…
P11
P10
Split
R9
R3
W2
W1
R13
Node3
9
11 12
13 14
Pane6
Pane7
Window3
Node2
R11
R5
10
Pane5
P5
P4
Merge
W3
[email protected]
7
Window2
Node1
R7
6
…
P13
P12
...
P7
P6
• High amount of pipelined
result sharing among nodes
• Organized communication
topology
12
Assignment of Windows and Panes to Nodes
• All pane results only arrive from predecessors
• Pane results sent to successor is only local panes
– Each node is assigned n consecutive windows
– Min n st.
Definitions:
ww : win-size in # of panes
sw : slide-size in # of panes
[email protected]
13
Flexible Result Merging
Fullyordered
FIFO
*k=0
k-ordered: k-ordering constraint [1],
certain disorder allowed
Defn: For any tuple s, s’ arrives at
least k+1 tuples after s st. s’.A ≥ s.A
[1] Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams. ACM TODS ‘04
[email protected]
14
Experimental Evaluation
• Implementation of techniques in Borealis
• Workload adapted from Linear Road Benchmark
– Slightly modified segment statistics queries
– Basic aggregation functions with different window/slide
ratios
[email protected]
15
Maximum input rate (tuples/second)
Scalability of Split Operator
window-size/slide ratio (window overlap)
• Pane-partitioning: cost & tput constant regardless of overlap ratio
• Window & batch –partitioning: cost ↑ and tput↓ as overlap ↑
• Excessive replication in window-partitioning is reduced by batching
[email protected]
16
Scalability of Partitioning Techniques
* w/s = overlap ratio = 100
• Pane-based scales close to linear until split is saturated
– per tuple cost is constant
• Window & batch based: exteremely high replication
– Split is not saturated, but scales very slowly
[email protected]
17
Summary & Conclusions
1) Window-based
2) Batch-based
3) Pane-based
• Pane-partitioning is the choice of partitioning
– Avoids tuple replication
– Incurs less overhead in split and aggregate
– Scales close to linear
[email protected]
18
Ongoing & Future Work
•
•
•
•
Generalization of the framework
Support for adaptivity during runtime
Extending complexity of query plans
Extending performance analysis & experiments
[email protected]
19
Thank You!
[email protected]
20