A Scalable Heterogeneous Parallelization Framework for Iterative
Download
Report
Transcript A Scalable Heterogeneous Parallelization Framework for Iterative
A Scalable Heterogeneous
Parallelization Framework for
Iterative Local Searches
Martin Burtscher1 and Hassan Rabeti2
1Department
of Computer Science, Texas State University-San Marcos
2Department of Mathematics, Texas State University-San Marcos
Problem: HPC is Hard to Exploit
HPC application writers are domain experts
They are not typically computer scientists and have
little or no formal education in parallel programming
Parallel programming is difficult and error prone
Modern HPC systems are complex
Consist of interconnected compute nodes with
multiple CPUs and one or more GPUs per node
Require parallelization at multiple levels (inter-node,
intra-node, and accelerator) for best performance
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
2
Target Area: Iterative Local Searches
Important application domain
Widely used in engineering & real-time environments
Examples
All sorts of random restart greedy algorithms
Ant colony opt, Monte Carlo, n-opt hill climbing, etc.
ILS properties
Iteratively produce better solutions
Can exploit large amounts of parallelism
Often have exponential search space
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
3
Our Solution: ILCS Framework
Iterative Local Champion Search (ILCS) framework
Supports non-random restart heuristics
Genetic algorithms, tabu search, particle swarm opt, etc.
Simplifies implementation of ILS on parallel systems
Design goal
Ease of use and scalability
Framework benefits
Handles threading, communication, locking, resource
allocation, heterogeneity, load balance, termination
decision, and result recording (check pointing)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
4
User Interface
User writes 3 serial C functions and/or 3 single-
GPU CUDA functions with some restrictions
size_t CPU_Init(int argc, char *argv[]);
void CPU_Exec(long seed, void const
*champion, void *result);
void CPU_Output(void const *champion);
See paper for GPU interface and sample code
Framework runs Exec (map) functions in parallel
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
5
Internal Operation: Threading
master forks a
worker per core
Fc
user CPU code
Fc
user CPU code
Fc
user CPU code
Fc
user CPU code
Fc
user CPU code
Fc
Fc
user CPU code
Fc
workersFcevaluate
user CPU code
user CPU code
user CPU code
seeds, record local opt
Fc
user CPU code
Fc
user CPU code
Fm
ILCS master
thread starts
user CPU code
Fm
Fg h
h Fg h
handlers launch GPU
user GPU code
code, sleep, record result
master forks a
handler per GPU
Fc
Fg h
h Fg h
user GPU code
… worker thread #1
… worker thread #2
… worker thread #3
… worker thread #4
… master/comm thread
master sporadically
finds
h Fg h
global opt via MPI, sleeps
… GPU handler thread #1
user GPU code
… GPU1 worker threads
GPU workers evaluate
Fg h
seeds, recordh local
opt
user GPU code
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
… GPU handler thread #2
… GPU2 worker threads
6
Internal Operation: Seed Distribution
E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)
Node 0
CPUs
0, 1, 2, …
each node gets chunk
of 64-bit seed range
Benefits
Node 1
GPUs CPUs
Node 2
GPUs CPUs
Node 3
GPUs CPUs
…, 263-1, 263, …
CPUs process
chunk bottom up
262, ...
a b c d a b c d a b
CPU threads (one seed per thread at a time)
GPUs
…, 264-2, 264-1
GPUs process
chunk top down
…, 263-1
1 2 1 2 1 2 1 2 1 2 1
GPUs (strided range of seeds per GPU at a time)
Balanced workload irrespective of number of CPU
cores or GPUs (or their relative performance)
Users can generate other distributions from seeds
Any injective mapping results in no redundant evaluations
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
7
Related Work
MapReduce/Hadoop/MARS and PADO
Their generality and unnecessary features for ILS incur
overhead and increase learning curve
Some do not support accelerators, some require Java
ILCS framework is optimized for ILS applications
Reduction is provided, does not require multiple keys,
does not need secondary storage to buffer data,
directly supports non-random restart heuristics,
allows early termination, works with GPUs and MICs,
targets single-node workstations through HPC clusters
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
8
Evaluation Methodology
Three HPC Systems (at TACC and NICS)
compute
CPU
CPU clock
GPU GPU clock
system
CPUs
GPUs
nodes
cores frequency
cores frequency
Keeneland
264
528 4,224
2.6 GHz 792 405,504
1.3 GHz
Ranger
3,936 15,744 62,976
2.3 GHz
Stampede
6,400 12,800 102,400
2.7 GHz 128*
n/a
n/a
Largest tested configuration
compute
nodes
Keeneland
128
Ranger
2048
Stampede
1024
system
total total
total
total
CPUs GPUs CPU cores GPU cores
256 384
2048 196,608
8192
0
32768
0
2048
0
16384
0
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
datacenterknowledge.com
9
Sample ILS Codes
Traveling Salesman
Problem (TSP)
Find shortest tour
4 inputs from TSPLIB
2-opt hill climbing
Finite State Machine (FSM)
Find best FSM config to
predict hit/miss events
4 sizes (n = 3, 4, 5, 6)
Monte Carlo method
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
10
FSM Transitions/Second Evaluated
transitions evaluated per sec (trillions)
25.0
21,532,197,798,304 s-1
20.0
GPU shmem limit
15.0
Keeneland
Ranger
10.0
Stampede
5.0
Ranger uses twice as many cores as Stampede
0.0
3-bit FSM
4-bit FSM
5-bit FSM
6-bit FSM
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
11
TSP Tour-Changes/Second Evaluated
moves evaluated per second (trillions)
14.0
12,239,050,704,370 s-1 based on serial CPU code
12.0
10.0
GPU re-computes:
O(n) memory
8.0
Keeneland
Ranger
6.0
Stampede
4.0
2.0
CPU pre-computes:
O(n2) memory
each core evals a tour
change every 3.6 cycles
0.0
kroE100
ts225
rat575
d1291
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
12
TSP Moves/Second/Node Evaluated
moves evaluated per second (billions)
50.0
GPUs provide >90% of performance on Keeneland
45.0
40.0
35.0
30.0
Keeneland
25.0
Ranger
20.0
Stampede
15.0
10.0
5.0
0.0
kroE100
ts225
rat575
d1291
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
13
ILCS Scaling on Ranger (FSM)
transitions evaluated per sec (billions)
10000
1000
>99% parallel efficiency on 2048 nodes
100
3-bit FSM
4-bit FSM
5-bit FSM
6-bit FSM
10
other two systems are similar
1
compute nodes
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
14
ILCS Scaling on Ranger (TSP)
moves evaluated per second (billions)
100000
>95% parallel efficiency on 2048 nodes
10000
1000
kroE100
ts225
100
rat575
d1291
10
longer runs are even better
1
0.1
compute nodes
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
15
Intra-Node Scaling on Stampede (TSP)
moves evaluated per second (billions)
14.0
>98.9% parallel efficiency on 16 threads
12.0
10.0
kroE100
8.0
6.0
framework overhead is very small
4.0
ts225
rat575
d1291
2.0
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
worker threads
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
16
Tour Quality Evolution (Keeneland)
deviation from optimal tour length
8.0%
7.0%
6.0%
5.0%
kroE100
4.0%
quality depends on chance: ILS provides good ts225
3.0% solution quickly, then progressively improves it rat575
d1291
2.0%
1.0%
0.0%
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
step
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
17
Tour Quality after 6 Steps (Stampede)
deviation from optimal tour length
9.0%
8.0%
7.0%
6.0%
kroE100
5.0%
4.0%
larger node counts typically
yield better results faster
3.0%
ts225
rat575
d1291
2.0%
1.0%
0.0%
1
2
4
8
16 32 64 128 256 512 1024
compute nodes
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
18
Summary and Conclusions
ILCS Framework
Automatic parallelization of iterative local searches
Provides MPI, OpenMP, and multi-GPU support
Checkpoints currently best solution every few seconds
Scales very well (decentralized)
Evaluation
2-opt hill climbing (TSP) and Monte Carlo method (FSM)
AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs
ILCS source code is freely available
http://cs.txstate.edu/~burtscher/research/ILCS/
Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
19