A Scalable Heterogeneous Parallelization Framework for Iterative

Download Report

Transcript A Scalable Heterogeneous Parallelization Framework for Iterative

A Scalable Heterogeneous
Parallelization Framework for
Iterative Local Searches
Martin Burtscher1 and Hassan Rabeti2
1Department
of Computer Science, Texas State University-San Marcos
2Department of Mathematics, Texas State University-San Marcos
Problem: HPC is Hard to Exploit
 HPC application writers are domain experts
 They are not typically computer scientists and have
little or no formal education in parallel programming
 Parallel programming is difficult and error prone
 Modern HPC systems are complex
 Consist of interconnected compute nodes with
multiple CPUs and one or more GPUs per node
 Require parallelization at multiple levels (inter-node,
intra-node, and accelerator) for best performance
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
2
Target Area: Iterative Local Searches
 Important application domain
 Widely used in engineering & real-time environments
 Examples
 All sorts of random restart greedy algorithms
 Ant colony opt, Monte Carlo, n-opt hill climbing, etc.
 ILS properties
 Iteratively produce better solutions
 Can exploit large amounts of parallelism
 Often have exponential search space
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
3
Our Solution: ILCS Framework
 Iterative Local Champion Search (ILCS) framework
 Supports non-random restart heuristics

Genetic algorithms, tabu search, particle swarm opt, etc.
 Simplifies implementation of ILS on parallel systems
 Design goal
 Ease of use and scalability
 Framework benefits
 Handles threading, communication, locking, resource
allocation, heterogeneity, load balance, termination
decision, and result recording (check pointing)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
4
User Interface
 User writes 3 serial C functions and/or 3 single-
GPU CUDA functions with some restrictions
size_t CPU_Init(int argc, char *argv[]);
void CPU_Exec(long seed, void const
*champion, void *result);
void CPU_Output(void const *champion);
 See paper for GPU interface and sample code
 Framework runs Exec (map) functions in parallel
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
5
Internal Operation: Threading
master forks a
worker per core
Fc
user CPU code
Fc
user CPU code
Fc
user CPU code
Fc
user CPU code
Fc
user CPU code
Fc
Fc
user CPU code
Fc
workersFcevaluate
user CPU code
user CPU code
user CPU code
seeds, record local opt
Fc
user CPU code
Fc
user CPU code
Fm
ILCS master
thread starts
user CPU code
Fm
Fg h
h Fg h
handlers launch GPU
user GPU code
code, sleep, record result
master forks a
handler per GPU
Fc
Fg h
h Fg h
user GPU code
… worker thread #1
… worker thread #2
… worker thread #3
… worker thread #4
… master/comm thread
master sporadically
finds
h Fg h
global opt via MPI, sleeps
… GPU handler thread #1
user GPU code
… GPU1 worker threads
GPU workers evaluate
Fg h
seeds, recordh local
opt
user GPU code
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
… GPU handler thread #2
… GPU2 worker threads
6
Internal Operation: Seed Distribution
 E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)
Node 0
CPUs
0, 1, 2, …
each node gets chunk
of 64-bit seed range
 Benefits
Node 1
GPUs CPUs
Node 2
GPUs CPUs
Node 3
GPUs CPUs
…, 263-1, 263, …
CPUs process
chunk bottom up
262, ...
a b c d a b c d a b
CPU threads (one seed per thread at a time)
GPUs
…, 264-2, 264-1
GPUs process
chunk top down
…, 263-1
1 2 1 2 1 2 1 2 1 2 1
GPUs (strided range of seeds per GPU at a time)
 Balanced workload irrespective of number of CPU
cores or GPUs (or their relative performance)
 Users can generate other distributions from seeds

Any injective mapping results in no redundant evaluations
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
7
Related Work
 MapReduce/Hadoop/MARS and PADO
 Their generality and unnecessary features for ILS incur
overhead and increase learning curve
 Some do not support accelerators, some require Java
 ILCS framework is optimized for ILS applications
 Reduction is provided, does not require multiple keys,
does not need secondary storage to buffer data,
directly supports non-random restart heuristics,
allows early termination, works with GPUs and MICs,
targets single-node workstations through HPC clusters
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
8
Evaluation Methodology
 Three HPC Systems (at TACC and NICS)
compute
CPU
CPU clock
GPU GPU clock
system
CPUs
GPUs
nodes
cores frequency
cores frequency
Keeneland
264
528 4,224
2.6 GHz 792 405,504
1.3 GHz
Ranger
3,936 15,744 62,976
2.3 GHz
Stampede
6,400 12,800 102,400
2.7 GHz 128*
n/a
n/a
 Largest tested configuration
compute
nodes
Keeneland
128
Ranger
2048
Stampede
1024
system
total total
total
total
CPUs GPUs CPU cores GPU cores
256 384
2048 196,608
8192
0
32768
0
2048
0
16384
0
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
datacenterknowledge.com
9
Sample ILS Codes
 Traveling Salesman
Problem (TSP)
 Find shortest tour
 4 inputs from TSPLIB
 2-opt hill climbing
 Finite State Machine (FSM)
 Find best FSM config to
predict hit/miss events
 4 sizes (n = 3, 4, 5, 6)
 Monte Carlo method
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
10
FSM Transitions/Second Evaluated
transitions evaluated per sec (trillions)
25.0
21,532,197,798,304 s-1
20.0
GPU shmem limit
15.0
Keeneland
Ranger
10.0
Stampede
5.0
Ranger uses twice as many cores as Stampede
0.0
3-bit FSM
4-bit FSM
5-bit FSM
6-bit FSM
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
11
TSP Tour-Changes/Second Evaluated
moves evaluated per second (trillions)
14.0
12,239,050,704,370 s-1 based on serial CPU code
12.0
10.0
GPU re-computes:
O(n) memory
8.0
Keeneland
Ranger
6.0
Stampede
4.0
2.0
CPU pre-computes:
O(n2) memory
each core evals a tour
change every 3.6 cycles
0.0
kroE100
ts225
rat575
d1291
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
12
TSP Moves/Second/Node Evaluated
moves evaluated per second (billions)
50.0
GPUs provide >90% of performance on Keeneland
45.0
40.0
35.0
30.0
Keeneland
25.0
Ranger
20.0
Stampede
15.0
10.0
5.0
0.0
kroE100
ts225
rat575
d1291
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
13
ILCS Scaling on Ranger (FSM)
transitions evaluated per sec (billions)
10000
1000
>99% parallel efficiency on 2048 nodes
100
3-bit FSM
4-bit FSM
5-bit FSM
6-bit FSM
10
other two systems are similar
1
compute nodes
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
14
ILCS Scaling on Ranger (TSP)
moves evaluated per second (billions)
100000
>95% parallel efficiency on 2048 nodes
10000
1000
kroE100
ts225
100
rat575
d1291
10
longer runs are even better
1
0.1
compute nodes
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
15
Intra-Node Scaling on Stampede (TSP)
moves evaluated per second (billions)
14.0
>98.9% parallel efficiency on 16 threads
12.0
10.0
kroE100
8.0
6.0
framework overhead is very small
4.0
ts225
rat575
d1291
2.0
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
worker threads
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
16
Tour Quality Evolution (Keeneland)
deviation from optimal tour length
8.0%
7.0%
6.0%
5.0%
kroE100
4.0%
quality depends on chance: ILS provides good ts225
3.0% solution quickly, then progressively improves it rat575
d1291
2.0%
1.0%
0.0%
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
step
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
17
Tour Quality after 6 Steps (Stampede)
deviation from optimal tour length
9.0%
8.0%
7.0%
6.0%
kroE100
5.0%
4.0%
larger node counts typically
yield better results faster
3.0%
ts225
rat575
d1291
2.0%
1.0%
0.0%
1
2
4
8
16 32 64 128 256 512 1024
compute nodes
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
18
Summary and Conclusions
 ILCS Framework
 Automatic parallelization of iterative local searches

Provides MPI, OpenMP, and multi-GPU support
 Checkpoints currently best solution every few seconds
 Scales very well (decentralized)
 Evaluation
 2-opt hill climbing (TSP) and Monte Carlo method (FSM)
 AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs
 ILCS source code is freely available
 http://cs.txstate.edu/~burtscher/research/ILCS/
Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
19