Transcript slides
Parallel Evaluation of Conjunctive Queries Paraschos Koutris and Dan Suciu University of Washington PODS 2011, Athens Motivation • Massive parallelism is necessary nowadays for handling huge amounts of data • Parallelism has been popularized in various forms: • The MapReduce architecture • Languages on the top of MapReduce: PigLatin, Hive • Systems for data analytics: Dremmel, SCOPE • What is a good theoretical model to capture computation in such massively parallel systems? Today’s Parallel Models • Classic models for parallelism: • Circuit complexity, PRAM (Parallel Random Access Machines) • The BSP (Bulk-Synchronous Parallel) model [Valiant, ‘90] • The LogP model [Culler at al, ‘93] • The main bottlenecks: Communication + Synchronization + Data Skew Communication Synchronization Data Skew [Afrati and Ullman,EDBT’10] minimize 1 step n/a [Karlof et al., SODA’10] implicit restriction minimize memory O(nε), ε<1 [Hellerstein,SIGMOD’10] (Coordination Complexity) n/a minimize n/a Our Approach O(n) minimize load balancing Our Approach • • • • Strict bounds on communication and data skew Minimize synchronization Parallel complexity = # synchronization steps Example: • Algorithms A and B process the same amount of data • Algorithm B is more efficient than algorithm A Algorithm A Algorithm B The Massively Parallel Model • A universe U, a relational schema and a database instance D • P servers: relation R partitioned to R1 , R2 , …, RP • A value a from U is generic: • Copy a • Test for equality: is a = b ? • Feed to a hash function : h(a), h’(a, b) • hash functions can be chosen randomly at the beginning • Computation proceeds in parallel steps, each with 3 phases: • Broadcast Phase: The P servers exchange some data B globally, shared among all servers. We require size(B) = O(nε), ε < 1 • Communication Phase: Each server sends data to other servers • Computation Phase: local computation • An algorithm for a query Q is load balanced if the expected maximum load is O(n / P) where • n = size of input + output data Datalog Notation for MP • R(@s,x,y): the fragment of relation R stored at server s • Broadcasting to all servers: R(@*,x) :- S(@s,x),T(@s,x) • Point-to-point communication using a hash function h: R(@h(x,y),x,y,z) :- S(@s,x,y,z),T(@s,x) • Local computation at server s: R(@s,x,y) :- S(@s,x,y),T(@s,x) Intersection Q(x):-R(x),S(x) Communication Phase R2(@h(x),x) :- R(@s,x) T2(@h(x),x) :- S(@s,x) Computation Phase Q(@s,x) :- R2(@s,x), T2(@s,x) The Main Result • We study relational queries which are: • Conjunctive: conjunction of atoms • Full: every variable must appear in the head of the query • Q : Which full conjunctive queries can be answered by a load balanced algorithm in one MP step? Main Theorem • Every tall-flat conjunctive query can be evaluated in one MP step by a load balanced algorithm • Conversely, if a query is not tall-flat, then any algorithm consisting of one MP step can not be load balanced Tall-Flat Queries • Tall Queries: Q(x,y,z):- R(x),S(x,y),T(x,y,z) • Flat Queries: Q(x,y,z,w) :- R(x,y),S(x,z),T(x,w) • Combine them to get the tall-flat queries: L(x1,x2,x3,x4,y1,y2,y3) :− R1(x1), R2(x1,x2), Tall part R3(x1,x2,x3), R4(x1,x2,x3,x4), S1(x1,x2,x3,x4,y1), Flat part S2(x1,x2,x3,x4,y2), S3(x1,x2,x3,x4,y3) Outline • Algorithms for • Semijoin • Flat Queries • Tall Queries • Combine for Tall-Flat Queries • Impossible Queries Semijoin: a naïve approach • Semijoin operator Q(x,y):- R(x), S(x,y) • Communication Phase: send tuples S(a,b),R(a) to server h(a) • Computation Phase: locally perform the semijoin S(0,a) S(0,d) S(0,e) S(0,w) S(0,c) S(3,d) S(3,f) Hashing S(1,c) S(4,a) S(2,b) S(2,a) S(5,a) Load balanced? ✔ ✗ A better approach Semijoin Broadcast Phase compute frequent values : set F = frequent(S) Communication Phase R2(@h(x),x) :- R(@s,x), not F(@s,x) S(@h(x),x,y) :- S(@s,x, y), not F(@s,x) R2(@*,x) :- R(@s,x), F(@s,x) S(@h2(x,y),x):- S(@s,x,y), F(@s,x) Computation Phase Q(@s,x,y) :- R2(@s,x), S2(@s,x,y) • Same approach as SkewJoin in PigLatin • Computing frequent elements : given a relation R(x,…)find the values of x with frequency more than a threshold τ • Sampling • Local Counting The Broadcast Phase • Do we really need a broadcast phase before distributing the data to the servers? Theorem Any algorithm computing a semijoin in 1 MP step without a broadcast phase is not load balanced • The purpose of the broadcast phase is to extract information on the data distribution (e.g. identify the frequent values) Full Join Join Q(x,y,z):-R(x,y),S(y,z) Communication Phase CASE : frequent(R) HR(@h(x,y),x,y) :- R(@s,x,y), RF(y) DS(@*,y,z) :- S(@s,y,z), RF(y) CASE : frequent(S) , not frequent(R) HS(@h2(y,z),y,z):- S(@s,y,z), SF(y), not RF(y) DR(@*,x,y) :- R(@s,x,y), SF(y), not RF(y) CASE : not frequent(R) , not frequent(S) TR(@h3(y),x,y):- R(@s,x,y), not RF(y), not RS(y) TS(@h3(y),y,z):- S(@s,y,z), not RF(y), not RS(y) Computation Phase J1(@s,x,y,z) :- HR(@s,x,y), DS(y,z) J2(@s,x,y,z) :- DR(x,y), HS(@s,y,z) J3(@s,x,y,z) :- TR(@s,x,y), TS(@s,y,z) Q(@s,x,y,z) :- J1(@s,x,y,z);J2(@s,x,y,z);J3(@s,x,y,z) • Similar idea to [Zu et al., SIGMOD ‘08] Flat Queries • How can we extend the above ideas to compute flat queries? Q(x,y,z,w) :- R(x,y),S(x,z),T(x,w) • We introduce a second step in the broadcast phase to find the frequent values that definitely appear in the final result • Why would that be a problem? • a is frequent in R, S and does not exist in T • The cost of replication of a-tuples would not be justified by the output size • The idea generalizes for any flat query, with only 2 broadcast steps Tall Queries • Compute a tall query Q(x,y,z) :- R(x),S(x,y),T(x,y,z) • Construct a decision tree to decide whether a tuple will be hashed (and how) or broadcast • Example: a tuple t = S(a,b) t x in frequent(T) Yes! x in frequent(S) x,y in frequent(T) Send to h(a,b)No! @h(x) @h(x,y) @h(x,y) Yes! @h(x,y,z) Broadcast The Main Algorithm • Reminder: A tall-flat query consists of a tall and a flat part • Tall-query techniques (decision tree) handle the tall part • Flat-query techniques handle the flat part • We can thus design an algorithm which computes any tall-flat query in 1 MP step (with a 2-step broadcast phase) Main Theorem (Part 1) • Every tall-flat conjunctive query can be evaluated in one MP step by a load balanced algorithm Impossibility Theorems Lemma 1 The query RST(x,y):- R(x),S(x,y),T(y) can not be computed in 1 MP step by a load balanced algorithm Lemma 2 The query J(x,y):- R(x),S(x),T(y) can not be computed in 1 MP step by a load balanced algorithm Main Theorem (Part 2) Any non tall-flat query can not be computed in 1 MP step by a load balanced algorithm Open Questions • How can we leverage data statistics (e.g. relation sizes, value distributions) to design better MP algorithms? • What is the minimum number of parallel steps for any query? • What is the parallel complexity of other classes of queries (e.g. with union, projections)? • At what point does it become more expensive in practice to have a broadcast phase instead of 2 steps? Questions ??