등식제약하의 비선형계획모형 - Intelligent Data Systems Laboratory
Download
Report
Transcript 등식제약하의 비선형계획모형 - Intelligent Data Systems Laboratory
Optimizing Joins in a Map-Reduce Environment
EDBT 2010
Presented by Foto Afrati, Jeffrey D. Ullman
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2010-11-12
Summarized by Jaeseok Myung
Outline
Introduction
2-Way Join vs. Multi-Way Join
Optimization of Multi-Way Joins
Important Special Cases
Experiments
Conclusion
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 2/33
A Model for Cluster Computing
Files: A file is a set of tuples. It is stored in a file system such as GFS
Many processes can read and write a file in parallel
Assumption: infinite supply of processors
Any process (job) can be assigned to any one processor
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 3/33
The Cost Measure for MR Algorithms
The communication cost of a process is the size of the input to
the process
This paper does not count the output size for a process
–
The output must be input to at least one other process
–
The final output is much smaller than its input
The total communication cost is the sum of the communication
costs of all processes that constitute an algorithm
The elapsed communication cost is defined on the acyclic graph
of processes
Consider a path through this graph, and sum the communication
costs of the processes along that path
The maximum sum, over all paths is the elapsed communication
cost
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 4/33
In this paper,
We begin an investigation into optimization issues for algorithms
implemented in the MR environment
In particular, we are interested in algorithms that minimize the
total communication cost
We begin the study of 2-way and multi-way joins
We introduce the notion of a “share” for each attribute of the mapkey. The product of the shares is a fixed constant k, which is the
number of Reduce processes we shall use to implement the join
The heart of the paper explores how to choose the map-key and
shares to minimize the communication cost
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 5/33
Outline
Introduction
2-Way Join vs. Multi-Way Join
Optimization of Multi-Way Joins
Important Special Cases
Experiments
Conclusion
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 6/33
2-Way Join in MapReduce
R(A,B)
Input
R
S(B,C)
Reduce input
A
B
K
V
a0
b0
b0
(a0, R)
a1
b1
b0
(c0, S)
a2
b2
b0
(c1, S)
…
…
…
…
Map
S
B
C
b0
c0
b0
c1
b1
c2
…
…
Center for E-Business Technology
K
V
b1
(a1, R)
b1
(c2, S)
…
…
Copyright 2010 by CEBT
Final output
Reduce
A
B
C
a0
b0
c0
a0
b0
c1
a1
b1
c2
…
…
…
IDS Lab. Seminar – 7/33
2-Way Join in MapReduce
A
B
K
V
a0
b0
b0
(a0, R)
a1
b1
b0
(c0, S)
a2
b2
b0
(c1, S)
…
…
…
…
Center for E-Business Technology
Suppose we use k Reduce processes
The output of any Map process with
key b is sent to the Reduce process
for hash value h(b)
Copyright 2010 by CEBT
IDS Lab. Seminar – 8/33
Joining Several Relations at Once
R(A,B)
Input
S(B,C)
T(C,D)
Reduce input
R
S
Final output
Map
Reduce
T
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 9/33
Joining Several Relations at Once
R(A,B)
S(B,C)
T(C,D)
Suppose we use k=m2 Reduce processes for some m
Values of B and C will each be hashed to m buckets
Let h be a hash function with range 1, 2, …, m
Each tuple S(b, c) is sent to the Reduce process (h(b), h(c))
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 10/33
Joining Several Relations at Once
R(A,B)
S(B,C)
Let h be a hash function
with range 1, 2, …, m
h(c) = 0
S(b, c) -> (h(b), h(c))
R(a, b) -> (h(b), all)
h(b) = 0
T(c, d) -> (all, h(c))
1
Each Reduce process
computes the join of the
tuples it receives
T(C,D)
h(T.c) = 1
1
2
h(S.b) = 2
h(S.c) = 1
3
2
3
h(R.b) = 2
Reduce processes
(# of Reduce processes: 42 = 16)
m=4, k=16
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 11/33
Joining Several Relations at Once
R(A,B)
S(B,C)
T(C,D)
h(b) = one of { 0, 1, 2, …, 9 }, h(c) = one of { a, b, c, …, z }
Your map-key would be one of
{ 0a, 0b, …, 0z, 1a, …, 1z, …, 9z }
For relation S
Each tuple (b, c) can be a value, and a key is one of map-keys
For relation R
Each tuple (a, b) will be replicated, a key is one of h(b)a or h(b)b,
…
For relation T
Each tuple (c, d) will be replicated, a key is one of 0h(c) or 1h(c), …
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 12/33
Outline
Introduction
2-Way Join vs. Multi-Way Join
Optimization of Multi-Way Joins
Formalize of Optimization Problem
General algorithm for Optimization
Important Special Cases
Experiments
Conclusion
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 13/33
Formalize of Optimization Problem
R(A,B)
S(B,C)
T(A,C)
The communication cost: rc + sa + tb, where
r, s, t: # of tuples in relations R, S, T
a, b, c: # of buckets for the attributes (shares)
Why?
Consider a tuple (x, y) in relation R
(x, y) must be replicated and sent to the c different reducers
We must minimize the expression rc+sa+tb subject to the
constraint that abc=k
Each of a, b, and c must be a positive integer
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 14/33
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
n개의 결정변수(x1, x2, …, xn)와 m개의 등식제약하의 비선형모형
Max.(또는 Min.) f(x1, x2, …, xn)
s. t.
g1(x1, x2, …, xn) = 0
g2(x1, x2, …, xn) = 0
:
gm(x1, x2, …, xn) = 0
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
라그랑지 승수법(Lagrange multiplier method)
원래의 모형에 대해 라그랑지 승수를 도입하여 목적함수와 등식의
제약식을 연결하는 라그랑지 함수(Lagrange function)를 만들어 제
약이 없는 비선형계획모형으로 변환한 후 극치를 찾는다.
i 번째 제약식에 대응하는 라그랑지 승수를 λi라 하면, 라그랑지 함수
L(x1, x2, …, xn, λ1, λ2, …, λm)
= f(x1, x2, …, xn) + λ1[g1(x1, x2, …, xn)]
+ λ2[g2(x1, x2, …, xn)]
:
+ λm[gm(x1, x2, …, xn)]
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
등식제약하에서 라그랑지승수법의 필요조건
필요조건
(x1, x2, …, xn)가 원래 모형의 최적해가 되려면,
라그랑지 함수 L에 대하여 다음의 조건을 만족하여야 한다.
∂L
── = 0, j = 1, 2, …, n
∂xj
∂L
── = 0, i = 1, 2, …, m
∂λi
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
예제 모형
S기계의 특수장비 생산계획문제
• 향후 2년간 1,000대의 특수장비를 제작ㆍ공급계획
• 생산비용은 각각 금년 100(만원)과 내년 80(만원)으로 추정
• 금년과 내년의 생산량이 다르면 생산량 차이의 제곱에 비례하는
추가 비용이 발생
금년의 생산량을 x1, 내년의 생산량을 x2라 하면
추가비용 C(x1, x2) 는
(x1 - x2)2
C(x1, x2) = ──────
100
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
총비용 TC = 정상생산비용 + 추가비용이므로,
다음의 비선형계획모형이 된다.
(x1 - x2)2
Min. TC(x1, x2) = 100x1 + 80x2 + ──────
100
s. t.
x1 + x2 = 1,000
라그랑지 승수를 λ라 하면, 라그랑지 함수는 다음과 같다.
(x1 - x2)2
L(x1, x2, λ) = 100x1 + 80x2 + ────── + λ(x1 + x2 - 1,000)
100
이를 x1, x2, λ에 대해 각각 편미분하여 이를 0으로 놓으면,
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
∂L
(x1 - x2)
─── = 100 + ────── - λ = 0
∂x1
50
∂L
(x1 - x2)
─── = 80 - ────── - λ = 0
∂x2
50
∂L
─── = x1 + x2 - 1,000 = 0
∂λ
• 위 식을 풀면, x1 = 250, x2 = 750, λ = 90, TC = 87,500(만원)
• (x1, x2) = (250, 750)이 총비용을 최소로 하는 값인지를 확인하기
위하여는, 2차 편미분 필요
• 라그랑지 승수 λ = 90의 의미 : 최적 상태에서 특수장비를 한 대
더 생산하면 90의 비용이 추가적으로 소요됨(LP의 쌍대변수값)
한밭대학교 산업경영공학과 강진규 교수
Problem Solving
Problem solving using the method of Lagrange Multipliers
Take derivatives with respect to the three variables a, b, c
Multiply the three equations
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 21/33
An Example for Understanding
R(A,B)
S(B,C)
T(A,C)
Example
k = 8 (# of Reduce processes)
r = s = t = 100M (# of tuples in relations)
# of buckets for the attributes (shares)
–
3
𝑘𝑟𝑡/𝑠 2 =
3
8 = 2, b = 2, c = 2
The minimum communication cost
–
𝑎=
rc + sa + tb = 600M
Meaning of solutions
–
We can determine a, b, c to optimize the communication cost
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 22/33
General Algorithm for Optimization
Questions
How can we select the map-key attributes?
–
Dominated Attributes
What is the best # of buckets for each attribute?
–
Lagrange Multiplier Methods
You can read section 3 of the paper
http://infolab.stanford.edu/~ullman/pub/join-mr.pdf
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 23/33
Outline
Introduction
2-Way Join vs. Multi-Way Join
Optimization of Multi-Way Joins
Important Special Cases
Star Join
Chain Join
Experiments
Conclusion
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 24/33
Special Cases
Star Joins
There is a fact table joined with several dimension tables
–
Fact table F: F(A1, A2, … An)
–
Dimension tables Di: Di(Ai, Bi)
Chain Joins
A chain join is a join of the form
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 25/33
Star Joins
Example
k = abcd
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 26/33
Star Joins
We can subtract each equation from each other equation
sbcd = tacd = uabd = vabc
s/a = t/b = u/c = v/d
We can use these equation to solve for b, c, and d in terms of a
b = at/s, c = au/s, d = av/s
k=a4tuv/s3 because k=abcd
𝑎=
4
𝑘𝑠 3 /𝑡𝑢𝑣, 𝑏 =
4
𝑘𝑡 3 /𝑠𝑢𝑣, 𝑐 =
4
𝑘𝑢3 /𝑠𝑡𝑣, 𝑑 =
4
𝑘𝑣 3 /𝑠𝑡𝑢
Generalization
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 27/33
Outline
Introduction
2-Way Join vs. Multi-Way Join
Optimization of Multi-Way Joins
Important Special Cases
Experiments
Conclusion
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 28/33
Experimental Settings
Multi-node cluster composed of 4 PCs
Debian GNU/Linux
3.0GHz dual-core CPU, 1GB RAM, 160GB HDD
1Gbps LAN
Tuning Hadoop Parameters
# of Reduce processes : 100
HDFS block size (max. size of each input split) : 128MB
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 29/33
Test Data Sets
Sizes of data sets, intermediate relations, and output
(unit: 1 million tuples)
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 30/33
Test Results
Processing times for the two methods
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 31/33
Outline
Introduction
2-Way Join vs. Multi-Way Join
Optimization of Multi-Way Joins
Important Special Cases
Experiments
Conclusion
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 32/33
Conclusion
Proposed an algorithm for multi-way join that optimizes the
communication cost
How can we select the map-key attributes?
–
Dominated Attributes
What is the best # of buckets for each attribute?
–
Lagrange Multiplier Methods
Examined the algorithm with two common kinds of joins
Star-join
Chain-join
Center for E-Business Technology
Copyright 2010 by CEBT
IDS Lab. Seminar – 33/33