Transcript slides

Parallel Evaluation of
Conjunctive Queries
Paraschos Koutris and Dan Suciu
University of Washington
PODS 2011, Athens
Motivation
• Massive parallelism is necessary nowadays for handling huge
amounts of data
• Parallelism has been popularized in various forms:
• The MapReduce architecture
• Languages on the top of MapReduce: PigLatin, Hive
• Systems for data analytics: Dremmel, SCOPE
• What is a good theoretical model to capture computation in
such massively parallel systems?
Today’s Parallel Models
• Classic models for parallelism:
• Circuit complexity, PRAM (Parallel Random Access Machines)
• The BSP (Bulk-Synchronous Parallel) model [Valiant, ‘90]
• The LogP model [Culler at al, ‘93]
• The main bottlenecks:
Communication + Synchronization + Data Skew
Communication
Synchronization
Data Skew
[Afrati and Ullman,EDBT’10]
minimize
1 step
n/a
[Karlof et al., SODA’10]
implicit restriction
minimize
memory O(nε), ε<1
[Hellerstein,SIGMOD’10]
(Coordination Complexity)
n/a
minimize
n/a
Our Approach
O(n)
minimize
load balancing
Our Approach
•
•
•
•
Strict bounds on communication and data skew
Minimize synchronization
Parallel complexity = # synchronization steps
Example:
• Algorithms A and B process the same amount of data
• Algorithm B is more efficient than algorithm A
Algorithm A
Algorithm B
The Massively Parallel Model
• A universe U, a relational schema and a database instance D
• P servers: relation R partitioned to R1 , R2 , …, RP
• A value a from U is generic:
• Copy a
• Test for equality: is a = b ?
• Feed to a hash function : h(a), h’(a, b)
• hash functions can be chosen randomly at the beginning
• Computation proceeds in parallel steps, each with 3 phases:
• Broadcast Phase: The P servers exchange some data B globally,
shared among all servers. We require size(B) = O(nε), ε < 1
• Communication Phase: Each server sends data to other servers
• Computation Phase: local computation
• An algorithm for a query Q is load balanced if the expected
maximum load is O(n / P) where
• n = size of input + output data
Datalog Notation for MP
• R(@s,x,y): the fragment of relation R stored at server s
• Broadcasting to all servers:
R(@*,x) :- S(@s,x),T(@s,x)
• Point-to-point communication using a hash function h:
R(@h(x,y),x,y,z) :- S(@s,x,y,z),T(@s,x)
• Local computation at server s:
R(@s,x,y) :- S(@s,x,y),T(@s,x)
Intersection Q(x):-R(x),S(x)
Communication Phase
R2(@h(x),x) :- R(@s,x)
T2(@h(x),x) :- S(@s,x)
Computation Phase
Q(@s,x) :- R2(@s,x), T2(@s,x)
The Main Result
• We study relational queries which are:
• Conjunctive: conjunction of atoms
• Full: every variable must appear in the head of the query
• Q : Which full conjunctive queries can be answered by a load
balanced algorithm in one MP step?
Main Theorem
• Every tall-flat conjunctive query can be evaluated in one MP
step by a load balanced algorithm
• Conversely, if a query is not tall-flat, then any algorithm
consisting of one MP step can not be load balanced
Tall-Flat Queries
• Tall Queries:
Q(x,y,z):- R(x),S(x,y),T(x,y,z)
• Flat Queries:
Q(x,y,z,w) :- R(x,y),S(x,z),T(x,w)
• Combine them to get the tall-flat queries:
L(x1,x2,x3,x4,y1,y2,y3) :−
R1(x1),
R2(x1,x2),
Tall part
R3(x1,x2,x3),
R4(x1,x2,x3,x4),
S1(x1,x2,x3,x4,y1),
Flat part
S2(x1,x2,x3,x4,y2),
S3(x1,x2,x3,x4,y3)
Outline
• Algorithms for
• Semijoin
• Flat Queries
• Tall Queries
• Combine for Tall-Flat Queries
• Impossible Queries
Semijoin: a naïve approach
• Semijoin operator Q(x,y):- R(x), S(x,y)
• Communication Phase: send tuples S(a,b),R(a) to server h(a)
• Computation Phase: locally perform the semijoin
S(0,a) S(0,d) S(0,e) S(0,w) S(0,c)
S(3,d) S(3,f)
Hashing
S(1,c)
S(4,a)
S(2,b) S(2,a)
S(5,a)
Load balanced?
✔
✗
A better approach
Semijoin
Broadcast Phase
compute frequent values : set F = frequent(S)
Communication Phase
R2(@h(x),x) :- R(@s,x), not F(@s,x)
S(@h(x),x,y) :- S(@s,x, y), not F(@s,x)
R2(@*,x)
:- R(@s,x), F(@s,x)
S(@h2(x,y),x):- S(@s,x,y), F(@s,x)
Computation Phase
Q(@s,x,y)
:- R2(@s,x), S2(@s,x,y)
• Same approach as SkewJoin in PigLatin
• Computing frequent elements : given a relation R(x,…)find
the values of x with frequency more than a threshold τ
• Sampling
• Local Counting
The Broadcast Phase
• Do we really need a broadcast phase before distributing the
data to the servers?
Theorem
Any algorithm computing a semijoin in 1 MP step without a
broadcast phase is not load balanced
• The purpose of the broadcast phase is to extract information
on the data distribution (e.g. identify the frequent values)
Full Join
Join Q(x,y,z):-R(x,y),S(y,z)
Communication Phase
CASE : frequent(R)
HR(@h(x,y),x,y) :- R(@s,x,y), RF(y)
DS(@*,y,z)
:- S(@s,y,z), RF(y)
CASE : frequent(S) , not frequent(R)
HS(@h2(y,z),y,z):- S(@s,y,z), SF(y), not RF(y)
DR(@*,x,y)
:- R(@s,x,y), SF(y), not RF(y)
CASE : not frequent(R) , not frequent(S)
TR(@h3(y),x,y):- R(@s,x,y), not RF(y), not RS(y)
TS(@h3(y),y,z):- S(@s,y,z), not RF(y), not RS(y)
Computation Phase
J1(@s,x,y,z) :- HR(@s,x,y), DS(y,z)
J2(@s,x,y,z) :- DR(x,y), HS(@s,y,z)
J3(@s,x,y,z) :- TR(@s,x,y), TS(@s,y,z)
Q(@s,x,y,z) :- J1(@s,x,y,z);J2(@s,x,y,z);J3(@s,x,y,z)
• Similar idea to [Zu et al., SIGMOD ‘08]
Flat Queries
• How can we extend the above ideas to compute flat queries?
Q(x,y,z,w) :- R(x,y),S(x,z),T(x,w)
• We introduce a second step in the broadcast phase to find the
frequent values that definitely appear in the final result
• Why would that be a problem?
• a is frequent in R, S and does not exist in T
• The cost of replication of a-tuples would not be justified by the
output size
• The idea generalizes for any flat query, with only 2 broadcast
steps
Tall Queries
• Compute a tall query
Q(x,y,z) :- R(x),S(x,y),T(x,y,z)
• Construct a decision tree to decide whether a tuple will be
hashed (and how) or broadcast
• Example: a tuple t = S(a,b)
t
x in frequent(T) Yes!
x in frequent(S)
x,y in frequent(T)
Send to h(a,b)No!
@h(x)
@h(x,y)
@h(x,y)
Yes!
@h(x,y,z)
Broadcast
The Main Algorithm
• Reminder: A tall-flat query consists of a tall and a flat part
• Tall-query techniques (decision tree) handle the tall part
• Flat-query techniques handle the flat part
• We can thus design an algorithm which computes any tall-flat
query in 1 MP step (with a 2-step broadcast phase)
Main Theorem (Part 1)
• Every tall-flat conjunctive query can be evaluated in one MP
step by a load balanced algorithm
Impossibility Theorems
Lemma 1
The query
RST(x,y):- R(x),S(x,y),T(y)
can not be computed in 1 MP step by a load balanced algorithm
Lemma 2
The query
J(x,y):- R(x),S(x),T(y)
can not be computed in 1 MP step by a load balanced algorithm
Main Theorem (Part 2)
Any non tall-flat query can not be computed in 1 MP step by a load
balanced algorithm
Open Questions
• How can we leverage data statistics (e.g. relation sizes, value
distributions) to design better MP algorithms?
• What is the minimum number of parallel steps for any query?
• What is the parallel complexity of other classes of queries (e.g.
with union, projections)?
• At what point does it become more expensive in practice to
have a broadcast phase instead of 2 steps?
Questions ??