Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu,
Download ReportTranscript Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu,
Sampling Based Range Partition for Big Data Analytics
+ Some Extras
Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012
Big Data Analytics
• Our goal:
innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data
• Some figures of scale – Peta / Tera bytes of online services data processed daily – 200M tweets per day (Twitter) – 1B of content pieces shared per day (Facebook) – 8,000 Exabytes of global data by 2015 (The Economist) 2
Research Agenda
Database queries Machine learning Optimization Distributed computing system 3
Outline
• Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 4
1-100 1
Range Partition
101-250 . . .
2 950-1024 m (120,5) (120,4) (120,10) 12052 1024 24 8 1 83 120 1024 24 8 2 . . .
• Special interest: balanced range partition 83 52 1024 120 24 k 5
Range Partition Requirements
• Given 𝜖 and 𝛿 and desired relative partition sizes (𝑝 1 , 𝑝 2 , … , 𝑝 𝑚 ) • 𝜖 -accurate range partition: 𝑄 𝑖 ≤ 1 + 𝜖 𝑝 𝑖 𝑛, ∀ 𝑖 with probability at least 1 − 𝛿 𝑄 𝑖 𝑄 𝑖 = number of data items assigned to range 𝑖 6 𝑖
Two Approaches
• Sampling based methods – Take a sample of data items – Compute partition boundaries using the sample • Quantile summary methods – At each node compute a local quantile summary – Merge at the coordinator node 7
Related Work
• Sampling based estimation of histograms studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998) Required sample size: 𝑚 𝜖 2 ) • Communication cost to draw 𝑠 samples without replacement (Trithapura and Woodruff, 2011) : 𝑂(𝑘 log( 𝑛 𝑠) log(1 + 𝑘 𝑠 ) ) For 𝑠 ≥ 𝑘 8 : 𝑂 𝑠 log 𝑛 𝑠 , o therwise: 𝑂(𝑘 log( 𝑛 𝑠 ) log( 𝑘 𝑠 ) ) 8
Related Work (cont’d)
• Quantile summaries based approach (Greenwald and Khanna, 2001) Communication cost = 𝑚 ) 𝜖 • • Pros – Deterministic guarantee Cons – It requires sorting of data items – Largest frequency of an item must be at most 2𝜖 9
Problem
• Range partition data while making one pass through data with minimal communication between the coordinator and sites 10
Sampling Based Method
• Collect 𝑡 samples and partition using the samples 1 coordinator • • Pros – simplicity, scalability Cons – how many samples to take from each site?
.
.
.
data size imbalance : number of data input records per machine may differ from one machine to another k 2 11
Data Sizes Imbalance
Dataset
DataSet-1 DataSet-2 DataSet-3 DataSet-4 DataSet-5
Records Bytes
62M 150G 37M 13M 7M 106M 25G 0.26G
1.2T
7T
Sites
262 80 1 301 5652 12
Origins of Data Sizes Imbalance
• JOIN SELECT FROM A INNER JOIN B ON A.KEY==B.KEY
ORDER BY COL • Lookup Table If the record value of column X is in the lookup table, then return the row • UNPIVOT Input: Output: Col 1 1 2 … Col 2 2, 3 3, 9, 8, 13 (1,2), (1,3), (2,3), (2,9), … 13
Weighted Sampling Scheme
• SAMPLE: Each site reports a random sample of t/k data items and the total number of items • MERGE: Summary created by adding each data item from site 𝑖 for 𝑛 𝑖 times • PARTITION: Use the summary to determine partition boundaries Note: the total number of data items reported by a site only once available – the site made one pass through local data 14
coordinator 𝑆 𝑖 = {𝑎 𝑖 1 , 𝑎 𝑖 2 , … , 𝑎 𝑡 𝑖 }
SAMPLE
(𝑛 1 , 𝑆 1 ) (𝑛 2 , 𝑆 2 ) (𝑛 𝑘 , 𝑆 𝑘 ) .
.
.
1 2 k 15
coordinator
MERGE
𝑛 1 , 𝑆 1 = {𝑎 1 1 , 𝑎 1 2 , … , 𝑎 𝑡 1 } 𝑛 𝑖 replicas 𝑆 = {… , 𝑎 𝑖 2 , 𝑎 𝑖 2 , … , 𝑎 𝑖 2 , … } 𝑛 2 , 𝑆 2 = {𝑎 2 1 , 𝑎 2 2 , … , 𝑎 𝑡 2 } .
.
.
𝑛 𝑖 , 𝑆 𝑖 = {𝑎 1 𝑖 , 𝑎 𝑖 2 , … , 𝑎 𝑡 𝑖 } .
.
.
𝑛 𝑘 , 𝑆 𝑘 = {𝑎 1 𝑘 , 𝑎 𝑘 2 , … , 𝑎 𝑡 𝑘 } 16
coordinator
PARTITION
Empirical CDF of data summary 𝑆 1 0 Range 1 2 3 4 5 17
Sufficient Sample Size
• Assume 𝜙 < 𝜋𝜖.
For sample size 𝑡 ≥ 2 1 − 𝜋 + 𝛼 𝜋𝜖 2 1 − 𝜙 𝜋𝜖 2 ⋅ 𝑝𝑜𝑙𝑦𝑙𝑜𝑔 1 , 𝑚, 𝑛, 𝑘 𝛿 • • • 𝜖 -accurate range partition w. p. ≥ 1 − 𝛿 𝛼 ≥ 𝜙 ≥ 1 4 ( 1 𝑘 𝑘 𝑖=1 𝑘𝑛 𝑖 𝑛 2 − 1) largest frequency of a data value 𝜋 = min 𝑖 𝑝 𝑖 18
Constant Factor Imbalance
• Suppose that for some 𝜌 ≥ 1, max i 𝑛 𝑖 ≤ 𝜌 𝑛 𝑘 • Then 𝛼 = 𝜌 2 −1 4 19
Proof Outline
• 𝑝 𝑒 ≤ 𝑚 𝑖=1 Pr 𝑞 𝑖 > 1 + 𝜖 𝑝 𝑖 • • 𝑞 𝑖 = 𝑘 𝑗=1 𝑛 𝑗 𝑛 𝑞 𝑖,𝑗 𝜈 𝑗 Large deviation analysis of the error exponent: 20
• DataSet-1
Performance
• 100K data records per range, 𝛿 = 0.1
21
Performance (cont’d)
𝑎 = 1 𝑎 = 2 • • 𝑛 = 100,000, 𝑘 = 4, 𝑚 = 5, 𝜖 = 0.1
𝑛 1 ≥ 𝑛 2 = ⋯ = 𝑛 𝑘 = 𝑛 1 , 𝑎 ≥ 1 𝑎 𝑎 = 4 22
Summary for Range Partitioning
• Novel weighted sampling scheme • Provable performance guarantees • Simple and practical – Coder transfer to Cosmos • More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012 23
Outline
• Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 24
SUM Tracking Problem
Maintain estimate 𝑆 𝑡 : 1 − 𝜖 𝑆 𝑡 ≤ 𝑡 ≤ 1 + 𝜖 𝑆 𝑡 1 2 𝑋 1 3 k 𝑋 2 𝑋 3 𝑋 5 SUM: 𝑆 𝑡 = 𝑖≤t 𝑋 𝑖 𝑋 4 25
SUM Tracking
(1 + 𝜖)𝑆 𝑡 𝑆 𝑡 (1 − 𝜖)𝑆 𝑡 𝑡 26
Applications
• Ex 1: database queries SELECT SUM(AdBids)from Ads • Ex 2: iterative solving 𝒙 𝑡+1 = 𝒙 𝑡 + 𝛾𝑓(𝒙 𝑡 , 𝝃 𝑡 ) input data 27
State of the Art
• Count tracking [Huang, Yi and Zhang, 2011] – Worst-case input, monotonic sum – Expected total communication: 𝜖 𝑘 log 𝑛 messages • Lower bound for worst case input [Arackaparambil, Brody and Chakrabarti, 2009] – Expected total communication Ω( 𝑛 𝑘 ) messages 28
The Challenge
• Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams?
– Random permutation – Random i.i.d.
– Fractional Brownian motion 29
Communication Complexity Bounds • Lower bound: Ω( 𝜖 𝑘 𝑛) • Upper bound: O( 𝜖 𝑘 𝑛 log 𝑛) 𝑛 Sublinear, “price of non-monotonicity”: log 𝑛 30
Communication Complexity Bounds Unknown Drift Case • Input: i.i.d. Bernoulli 𝐏 𝑋 𝑖 1+𝜇 , 𝜇 ∈ [−1,1] = −1 = 1 − 𝐏 𝑋 : unknown drift parameter 𝑖 2 = 1 = Expected total communication: 𝜖 𝑘 1 min{ |𝜇| , 𝑛} ) messages • Generalizes monotonic case to constant drift case 31
Our Tracker Algorithm
• Each site reports to the coordinator upon receiving a value update 𝑡 with probability 𝑝 𝑡 = min 𝛼 log 𝛽 𝜖𝑆 𝑡 2 𝑛 , 1 • Sync all whenever the coordinator receives an update from a site
S = S 1 + … + S k S 1
S, S 1 site coordinator
S k
S, S k site M i = 1
X i
32
Two Applications
• Second Frequency Moment • Bayesian Linear Regression 33
App 1: Second Frequency Moment
• Input: 𝑎 𝑡 = 𝛼 𝑡 , 𝑧 𝑡 • Counter of value 𝑖 : 𝑚 𝑖 𝑡 = 𝑠≤𝑡:𝛼 𝑠 =𝑖 𝑧 𝑠 • Second frequency moment: 𝐹 2 𝑡 = 𝑖∈[𝑚] 𝑚 𝑖 2 (𝑡) • Goal: track 𝐹 2 𝑡 within relative accuracy 𝜖 > 0 34
𝑖 1 1 𝑗 𝑆 𝑡 𝑖,𝑗
AMS Sketch
𝑠 1 ⋯ 𝑆 𝑡 1 • 𝑆 𝑡 𝑖 = 1 𝑠 1 𝑗 𝑆 𝑡 𝑖,𝑗 𝑠 2 ⋯ 𝑆 𝑡 𝑠 2 𝑆 𝑡 = median(𝑆 𝑡 𝑖 ) • • 𝑆 𝑡 𝑖,𝑗 = 𝑠≤𝑡 𝑧 𝑠 ℎ(𝛼 𝑠 ) = 𝑎∈[𝑚] ℎ 𝑎 𝑚 𝑎 𝑡 {0,1} valued hash For 𝑠 1 = 16 𝜖 and 𝑠 2 = 2 log 1 𝛿 , 𝑆 𝑡 within 1 ± 𝜖 𝐹 2 (𝑡) w. p. ≥ 1 − 𝛿 35
App 1: Second Frequency Moment (cont’d) • Sum tracking: 𝑆 𝑖,𝑗 𝑡+1 = 𝑆 𝑡 𝑖,𝑗 + 𝑧 𝑡 ℎ 𝛼 𝑡 • Expected total communication: 𝑘 𝜖 2 𝑛, 𝑛} 36
App 2: Bayesian Linear Regression
𝑦 𝑡 𝑥 𝑡 • • • Feature vector 𝒙 𝑡 𝑦 𝑡 = 𝒘 𝑇 𝑨 𝑡 ∈ 𝑅 𝑑 + 𝑁 0, 𝛽 −1 , output 𝑦 𝑡 , 𝑨 𝑡 ∈ 𝑅 = 𝒙 1 , … , 𝒙 𝑡 Prior 𝒘 ∼ 𝑁 𝒎 0 , 𝑺 0 𝑇 , posterior 𝒘 ∼ 𝑁 𝒎 𝑡 , 𝑺 𝑡 37
App 2: Bayesian Linear Regression (cont’d) • Posterior mean and precision: • 𝒎 𝑡 𝑺 𝑡 −1 = 𝑺 𝑡 = 𝑺 0 𝑺 −1 −1 0 𝒎 0 + 𝛽𝑨 𝑡 𝑇 + 𝛽𝑨 𝑡 𝑇 𝑨 𝑡 𝒚 𝑡 Sum tracking: 𝑺 −1 𝑡+1 = 𝑺 𝑡 −1 + 𝛽𝒙 𝑇 𝑡+1 𝒙 𝑡+1 • Under random permutation input, the expected communication cost = O(𝑑 2 𝑘 𝜖 𝑛 log 𝑛) 38
Summary for Sum Tracking
• Studied the sum tracking problem with non monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012 39
Outline
• Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 40
Problem
• Partition a graph with two objectives – Sparsely connected components – Balanced number of vertices per component • Applications – Parallel processing – Community detection 41
1 2 3
Problem (cont’d)
k • • Requirements – Streaming algorithm – Single pass / incremental – Efficient computing Desired – Approximation guarantees – Average-case efficient 42
Summary for Graph Partitioning
• Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon 43