Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu,

Transcript Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu,

Sampling Based Range Partition for Big Data Analytics

+ Some Extras

Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012

Big Data Analytics

• Our goal:

innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data

• Some figures of scale – Peta / Tera bytes of online services data processed daily – 200M tweets per day (Twitter) – 1B of content pieces shared per day (Facebook) – 8,000 Exabytes of global data by 2015 (The Economist) 2

Research Agenda

Database queries Machine learning Optimization Distributed computing system 3

Outline

• Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 4

1-100 1

Range Partition

101-250 . . .

2 950-1024 m (120,5) (120,4) (120,10) 12052 1024 24 8 1 83 120 1024 24 8 2 . . .

• Special interest: balanced range partition 83 52 1024 120 24 k 5

Range Partition Requirements

• Given 𝜖 and 𝛿 and desired relative partition sizes (𝑝 1 , 𝑝 2 , … , 𝑝 𝑚 ) • 𝜖 -accurate range partition: 𝑄 𝑖 ≤ 1 + 𝜖 𝑝 𝑖 𝑛, ∀ 𝑖 with probability at least 1 − 𝛿 𝑄 𝑖 𝑄 𝑖 = number of data items assigned to range 𝑖 6 𝑖

Two Approaches

• Sampling based methods – Take a sample of data items – Compute partition boundaries using the sample • Quantile summary methods – At each node compute a local quantile summary – Merge at the coordinator node 7

Related Work

• Sampling based estimation of histograms studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998) Required sample size: 𝑚 𝜖 2 ) • Communication cost to draw 𝑠 samples without replacement (Trithapura and Woodruff, 2011) : 𝑂(𝑘 log( 𝑛 𝑠) log(1 + 𝑘 𝑠 ) ) For 𝑠 ≥ 𝑘 8 : 𝑂 𝑠 log 𝑛 𝑠 , o therwise: 𝑂(𝑘 log( 𝑛 𝑠 ) log( 𝑘 𝑠 ) ) 8

Related Work (cont’d)

• Quantile summaries based approach (Greenwald and Khanna, 2001) Communication cost = 𝑚 ) 𝜖 • • Pros – Deterministic guarantee Cons – It requires sorting of data items – Largest frequency of an item must be at most 2𝜖 9

Problem

• Range partition data while making one pass through data with minimal communication between the coordinator and sites 10

Sampling Based Method

• Collect 𝑡 samples and partition using the samples 1 coordinator • • Pros – simplicity, scalability Cons – how many samples to take from each site?

data size imbalance : number of data input records per machine may differ from one machine to another k 2 11

Data Sizes Imbalance

Dataset

DataSet-1 DataSet-2 DataSet-3 DataSet-4 DataSet-5

Records Bytes

62M 150G 37M 13M 7M 106M 25G 0.26G

1.2T

Sites

262 80 1 301 5652 12

Origins of Data Sizes Imbalance

• JOIN SELECT FROM A INNER JOIN B ON A.KEY==B.KEY

ORDER BY COL • Lookup Table If the record value of column X is in the lookup table, then return the row • UNPIVOT Input: Output: Col 1 1 2 … Col 2 2, 3 3, 9, 8, 13 (1,2), (1,3), (2,3), (2,9), … 13

Weighted Sampling Scheme

• SAMPLE: Each site reports a random sample of t/k data items and the total number of items • MERGE: Summary created by adding each data item from site 𝑖 for 𝑛 𝑖 times • PARTITION: Use the summary to determine partition boundaries Note: the total number of data items reported by a site only once available – the site made one pass through local data 14

coordinator 𝑆 𝑖 = {𝑎 𝑖 1 , 𝑎 𝑖 2 , … , 𝑎 𝑡 𝑖 }

SAMPLE

(𝑛 1 , 𝑆 1 ) (𝑛 2 , 𝑆 2 ) (𝑛 𝑘 , 𝑆 𝑘 ) .

1 2 k 15

coordinator

MERGE

𝑛 1 , 𝑆 1 = {𝑎 1 1 , 𝑎 1 2 , … , 𝑎 𝑡 1 } 𝑛 𝑖 replicas 𝑆 = {… , 𝑎 𝑖 2 , 𝑎 𝑖 2 , … , 𝑎 𝑖 2 , … } 𝑛 2 , 𝑆 2 = {𝑎 2 1 , 𝑎 2 2 , … , 𝑎 𝑡 2 } .

𝑛 𝑖 , 𝑆 𝑖 = {𝑎 1 𝑖 , 𝑎 𝑖 2 , … , 𝑎 𝑡 𝑖 } .

𝑛 𝑘 , 𝑆 𝑘 = {𝑎 1 𝑘 , 𝑎 𝑘 2 , … , 𝑎 𝑡 𝑘 } 16

coordinator

PARTITION

Empirical CDF of data summary 𝑆 1 0 Range 1 2 3 4 5 17

Sufficient Sample Size

• Assume 𝜙 < 𝜋𝜖.

For sample size 𝑡 ≥ 2 1 − 𝜋 + 𝛼 𝜋𝜖 2 1 − 𝜙 𝜋𝜖 2 ⋅ 𝑝𝑜𝑙𝑦𝑙𝑜𝑔 1 , 𝑚, 𝑛, 𝑘 𝛿 • • • 𝜖 -accurate range partition w. p. ≥ 1 − 𝛿 𝛼 ≥ 𝜙 ≥ 1 4 ( 1 𝑘 𝑘 𝑖=1 𝑘𝑛 𝑖 𝑛 2 − 1) largest frequency of a data value 𝜋 = min 𝑖 𝑝 𝑖 18

Constant Factor Imbalance

• Suppose that for some 𝜌 ≥ 1, max i 𝑛 𝑖 ≤ 𝜌 𝑛 𝑘 • Then 𝛼 = 𝜌 2 −1 4 19

Proof Outline

• 𝑝 𝑒 ≤ 𝑚 𝑖=1 Pr 𝑞 𝑖 > 1 + 𝜖 𝑝 𝑖 • • 𝑞 𝑖 = 𝑘 𝑗=1 𝑛 𝑗 𝑛 𝑞 𝑖,𝑗 𝜈 𝑗 Large deviation analysis of the error exponent: 20

• DataSet-1

Performance

• 100K data records per range, 𝛿 = 0.1

Performance (cont’d)

𝑎 = 1 𝑎 = 2 • • 𝑛 = 100,000, 𝑘 = 4, 𝑚 = 5, 𝜖 = 0.1

𝑛 1 ≥ 𝑛 2 = ⋯ = 𝑛 𝑘 = 𝑛 1 , 𝑎 ≥ 1 𝑎 𝑎 = 4 22

Summary for Range Partitioning

• Novel weighted sampling scheme • Provable performance guarantees • Simple and practical – Coder transfer to Cosmos • More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012 23

Outline

• Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 24

SUM Tracking Problem

Maintain estimate 𝑆 𝑡 : 1 − 𝜖 𝑆 𝑡 ≤ 𝑡 ≤ 1 + 𝜖 𝑆 𝑡 1 2 𝑋 1 3 k 𝑋 2 𝑋 3 𝑋 5 SUM: 𝑆 𝑡 = 𝑖≤t 𝑋 𝑖 𝑋 4 25

SUM Tracking

(1 + 𝜖)𝑆 𝑡 𝑆 𝑡 (1 − 𝜖)𝑆 𝑡 𝑡 26

Applications

• Ex 1: database queries SELECT SUM(AdBids)from Ads • Ex 2: iterative solving 𝒙 𝑡+1 = 𝒙 𝑡 + 𝛾𝑓(𝒙 𝑡 , 𝝃 𝑡 ) input data 27

State of the Art

• Count tracking [Huang, Yi and Zhang, 2011] – Worst-case input, monotonic sum – Expected total communication: 𝜖 𝑘 log 𝑛 messages • Lower bound for worst case input [Arackaparambil, Brody and Chakrabarti, 2009] – Expected total communication Ω( 𝑛 𝑘 ) messages 28

The Challenge

• Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams?

– Random permutation – Random i.i.d.

– Fractional Brownian motion 29

Communication Complexity Bounds • Lower bound: Ω( 𝜖 𝑘 𝑛) • Upper bound: O( 𝜖 𝑘 𝑛 log 𝑛) 𝑛 Sublinear, “price of non-monotonicity”: log 𝑛 30

Communication Complexity Bounds Unknown Drift Case • Input: i.i.d. Bernoulli 𝐏 𝑋 𝑖 1+𝜇 , 𝜇 ∈ [−1,1] = −1 = 1 − 𝐏 𝑋 : unknown drift parameter 𝑖 2 = 1 = Expected total communication: 𝜖 𝑘 1 min{ |𝜇| , 𝑛} ) messages • Generalizes monotonic case to constant drift case 31

Our Tracker Algorithm

• Each site reports to the coordinator upon receiving a value update 𝑡 with probability 𝑝 𝑡 = min 𝛼 log 𝛽 𝜖𝑆 𝑡 2 𝑛 , 1 • Sync all whenever the coordinator receives an update from a site

S = S 1 + … + S k S 1

S, S 1 site coordinator

S k

S, S k site M i = 1

X i

Two Applications

• Second Frequency Moment • Bayesian Linear Regression 33

App 1: Second Frequency Moment

• Input: 𝑎 𝑡 = 𝛼 𝑡 , 𝑧 𝑡 • Counter of value 𝑖 : 𝑚 𝑖 𝑡 = 𝑠≤𝑡:𝛼 𝑠 =𝑖 𝑧 𝑠 • Second frequency moment: 𝐹 2 𝑡 = 𝑖∈[𝑚] 𝑚 𝑖 2 (𝑡) • Goal: track 𝐹 2 𝑡 within relative accuracy 𝜖 > 0 34

𝑖 1 1 𝑗 𝑆 𝑡 𝑖,𝑗

AMS Sketch

𝑠 1 ⋯ 𝑆 𝑡 1 • 𝑆 𝑡 𝑖 = 1 𝑠 1 𝑗 𝑆 𝑡 𝑖,𝑗 𝑠 2 ⋯ 𝑆 𝑡 𝑠 2 𝑆 𝑡 = median(𝑆 𝑡 𝑖 ) • • 𝑆 𝑡 𝑖,𝑗 = 𝑠≤𝑡 𝑧 𝑠 ℎ(𝛼 𝑠 ) = 𝑎∈[𝑚] ℎ 𝑎 𝑚 𝑎 𝑡 {0,1} valued hash For 𝑠 1 = 16 𝜖 and 𝑠 2 = 2 log 1 𝛿 , 𝑆 𝑡 within 1 ± 𝜖 𝐹 2 (𝑡) w. p. ≥ 1 − 𝛿 35

App 1: Second Frequency Moment (cont’d) • Sum tracking: 𝑆 𝑖,𝑗 𝑡+1 = 𝑆 𝑡 𝑖,𝑗 + 𝑧 𝑡 ℎ 𝛼 𝑡 • Expected total communication: 𝑘 𝜖 2 𝑛, 𝑛} 36

App 2: Bayesian Linear Regression

𝑦 𝑡 𝑥 𝑡 • • • Feature vector 𝒙 𝑡 𝑦 𝑡 = 𝒘 𝑇 𝑨 𝑡 ∈ 𝑅 𝑑 + 𝑁 0, 𝛽 −1 , output 𝑦 𝑡 , 𝑨 𝑡 ∈ 𝑅 = 𝒙 1 , … , 𝒙 𝑡 Prior 𝒘 ∼ 𝑁 𝒎 0 , 𝑺 0 𝑇 , posterior 𝒘 ∼ 𝑁 𝒎 𝑡 , 𝑺 𝑡 37

App 2: Bayesian Linear Regression (cont’d) • Posterior mean and precision: • 𝒎 𝑡 𝑺 𝑡 −1 = 𝑺 𝑡 = 𝑺 0 𝑺 −1 −1 0 𝒎 0 + 𝛽𝑨 𝑡 𝑇 + 𝛽𝑨 𝑡 𝑇 𝑨 𝑡 𝒚 𝑡 Sum tracking: 𝑺 −1 𝑡+1 = 𝑺 𝑡 −1 + 𝛽𝒙 𝑇 𝑡+1 𝒙 𝑡+1 • Under random permutation input, the expected communication cost = O(𝑑 2 𝑘 𝜖 𝑛 log 𝑛) 38

Summary for Sum Tracking

• Studied the sum tracking problem with non monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012 39

Outline

• Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 40

Problem

• Partition a graph with two objectives – Sparsely connected components – Balanced number of vertices per component • Applications – Parallel processing – Community detection 41

1 2 3

Problem (cont’d)

k • • Requirements – Streaming algorithm – Single pass / incremental – Efficient computing Desired – Approximation guarantees – Average-case efficient 42

Summary for Graph Partitioning

• Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon 43

Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu,

Transcript Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu,