Document 7711588

Download Report

Transcript Document 7711588

5th International Workshop on Algorithms,
Models and Tools for Parallel Computing on
Heterogeneous Networks (HeteroPar'06)
Self-Adapting Scheduling for
Tasks with Dependencies
in Stochastic Environments
Ioannis Riakiotakis, Florina M. Ciorba, Theodore
Andronikos and George Papakonstantinou
National Technical University of Athens, Greece
[email protected]
www.cslab.ece.ntua.gr
Talk outline







The problem
Inter-slave synchronization mechanism
Overview of Distributed Trapezoid SelfScheduling algorithm
Self-Adapting Scheduling
Stochastic environment modeling
Results
Conclusions & future work
2
The problem we study



Scheduling problems with task dependencies
Target systems: stochastic systems, i.e., real
non-dedicated heterogeneous systems with
fluctuating load
Approach: adaptive dynamic load balancing
3
Algorithmic model & notations
for (i1=l1; i1<=u1; i1++) {
...
for (in=ln; in<=un; in++)
{
S1(I);
Loop
...
Body
Sk(I);
}
...
}
 Perfectly nested loops
 Constant flow data dependencies
 General program statements within the loop body
 J – index space of an n-dimensional uniform
dependence loop
 J  {j  N | lr  i r  u r ,1  r  n}
 DS  {d 1 ,..., d p }, p  n – set of dependence vectors
4
More notations
 P1,...,Pm – slaves
 VPk – virtual computing power of slave Pk
 Σmk=1 VPk – total virtual computing power of the
system
 qk – number of processes/jobs in the run-queue of
slave Pk, reflecting the its total load
Ak  VPk q k  – available computing power of slave Pk
 Σmk=1 Ak – total available computing power of the
system
5
Synchronization mechanism (1)

u1 – is called the
synchronization

dimension, noted us;
synchronization
points are introduced
along us
u2 – is called the
scheduling
(IPDPS’06)
dimension, noted uc;
chunks are formed
along uc

Ci – chunk size at the i-th scheduling step

Vi – projection of Ci along scheduling dimension u6c
Synchronization mechanism (2)

Current slave –
the slave assigned
chunk Ci

Previous slave
– the slave
assigned chunk
Ci-1

SPj – synchronization point

M – the number of SPs along synchronization dimension us

H – the interval between two SPs; H is the same for every chunk

SCi,j – set of iterations of chunk Ci between SPj-1 and SPj
7
Synchronization mechanism (3)
SPj
SCi,j+1
Ci-1



SPj+2
t+1
Ci+1
Ci
SPj+1
Pk+1
t
SCi-1,j+1
t+1
t
communication set
set of points computed at
moment t+1
Pk
set of points computed at
moment t
Pk-1
indicates communication
auxiliary explanations
Ci-1 is assigned to Pk-1, Ci assigned to Pk and Ci+1 to Pk+1
When Pk reaches SPj+1, it sends to Pk+1 only the data Pk+1 requires
(i.e., those iterations imposed by the existing dependence vectors)
Afterwards, Pk receives from Pk-1 the data required for the current
computation
Slaves do not reach an SP at the same time ―> a wavefront
execution fashion
H should be chosen so as to maintain the comm/comp < 1
8
Overview of Distributed Trapezoid
Self-Scheduling (DTSS) Algorithm
(IPDPS’06)

Divides the scheduling dimension into decreasing chunks

First chunk is F = |uc|/(2×Σmk=1Ak), where:

|uc| – the size of the scheduling dimension
Σmk=1Ak – the total available computational power of the system
Last chunk is L = 1
N = (2* |uc|)/(F+L) – the number of scheduling
steps
D = (F-L)/(N-1) – chunk decrement
Ci = Ak * [F – D*(Sk-1+(Ak-1)/2)], where:







Sk-1 = A1+ … + Ak-1
DTSS selects chunk sizes based on:
 the virtual computational power of a processor, Vk
 the number of processes in the run-queue of each processor, qk
9
Self-Adapting Scheduling – SAS (1)




SAS is a self-scheduling scheme
It is NOT a decrease-chunk algorithm
Built upon the master-slave model
Each chunk size is computed based on:




history of computation times of previous chunks on
the particular slave
history of jobs in the run-queue of the particular slave
current number of jobs in the run-queue of the
particular slave
Targeted for stochastic systems
10
Self-Adapting Scheduling – SAS (2)

Terminology used with SAS






Vk1, …, Vkj – the sizes of the first j chunks assigned to
Pk and tk1, …, tkj – their computation times
Vk1> … > Vkj, and Vk1=1
qk1, …, qkj – number of jobs in the run-queue of Pk
when assigned its first j chunks
k
μ j – the average time/iteration for the first j chunks
of Pk
k
t̂ j1– the estimated computation time for executing its
j+1-th chunk
tRef – the execution time of the first chunk of the
problem (reference time); all processors are expected
to compute their chunks within tRef
11
SAS – description (1)

Master side:
Initialization:
(M.a) Register slaves; store each reported VPk and initial load qk1.
(M.b) Sort slaves in decreasing order of their VPk, considering
VP1=1; assign first chunk to each slave, i.e., Vk1.
While there are unassigned iterations do:
(M.1) Receive request from Pk and store its reported qkj+1 and
tkj.
t Ref
k
k
V


V
(M.2) Determine j1
j # iterations  , kwhere
k
t̂ j1
tl
j
t̂ kj1  μ kj  q kj1  Vjk (sec)
Σ l1
μ kj 
sec
q lk  Vlk 
# iterations

j
# jobs

* If Pk took > tRef to compute its j-th chunk,
its j+1-th chunk will be decreased, and vice versa.




12
SAS – description (2)

Slave side:
Initialization: Register with the master; report VPk and initial load
qk1.
(S.1) Send request for work to the master; report current load
(qkj+1) and the time (tkj) spent on completing the previous
chunk.
(S.2) Wait for reply/work.
If no more work to do, terminate.
Else receive the size of next chunk (Vkj+1) and
compute it.
(S.3) Exchange data at SPs as described in slide 7.
(S.4) Measure the completion time tkj+1 for chunk Vkj+1.
13
(S.5) Go to (S.1).
Stochastic Environment Modeling (1)




Existing real non-dedicated systems have
fluctuating load
Fluctuating load is non-deterministic ―>
stochastic process modeling
The incoming foreign jobs inter-arrival time is
considered to be exponentially distributed
(λ – arrival rate)
The incoming foreign jobs lifetime is considered to
be exponentially distributed (μ – service rate)
14
Stochastic Environment Modeling (2)
system
load
fluctuation
parallel job
λ~μ
inter-arrival time ~ service time
foreign job
slow
inter-arrival time > tRef
qkj=0+1
qkj=1+1
qkj=1+1
qkj=1+1
medium
inter-arrival time ~ tRef
inter-arrival
time
inter-arrival
time
fast
inter-arrival time < tRef
tRef
qkj=2+1
inter-arrival
time
inter-arrival
time
2*tRef
time
qkj=1+1
inter-arrival
time
15
Implementation and testing setup
 The algorithms are implemented in C and C++
 MPI is used for master-slave and inter-slave




communication
The heterogeneous system consists of 7 dual-node
machines (12+2 processors):
 3 Intel Pentiums III, 1266 MHz with 1GB RAM (called
zealots), assumed to have VPk = 1
 4 Intel Pentiums III, 800 MHz with 256MB RAM (called
kids), assumed to have VPk = 0.5. (one of them is the
master)
Interconnection network is Fast Ethernet, at 100Mbit/sec
One real-life application: Floyd-Steinberg error dithering
computation
The synchronization interval is H = 100
16
Results with fast, medium and slow
load fluctuations
Floyd-Steinberg, Exponential distribution (fast)
Floyd-Steinberg, Exponential distribution (medium)
DTSS_Medium
DTSS_Fast
25
DTSS_Unloaded
20
SAS_Unloaded
15
10
5
Parallel times (sec)
SAS_Fast
0
30
SAS_Medium
25
DTSS_Unloaded
SAS_Unloaded
20
15
10
5
0
4
6
8
10
12
4
6
Num ber of processors
8
10
12
Num ber of processors
Floyd-Steinberg, Exponential distribution (slow)
DTSS_Slow
Parallel times (sec)
Parallel times (sec)
30
30
SAS_Slow
25
DTSS_Unloaded
20
SAS_Unloaded
15
10
5
0
4
6
8
Num ber of processors
10
12
17
Conclusions


Loops with dependencies can now be
dynamically scheduled on stochastic systems
Adaptive load balancing algorithms efficiently
compensate for the system’s heterogeneity and
foreign load fluctuations for loops with
dependencies
18
Future work


Establish a model for predicting the optimal
synchronization interval H and minimize the
communication
Modeling the foreign load with other probabilistic
distributions & analyze the relationship between
distribution type and performance
19
Thank you!
Questions?
20