A Parallel GPU Version of the Traveling Salesman Problem
Download
Report
Transcript A Parallel GPU Version of the Traveling Salesman Problem
A Parallel GPU Version of the
Traveling Salesman Problem
Molly A. O’Neil, Dan Tamir, and Martin Burtscher*
Department of Computer Science
The Traveling Salesman Problem
Common combinatorial optimization problem
Wire routing, logistics, robot arm movement, etc.
Given n cities, find shortest Hamiltonian tour
Must visit all cities exactly once and end in first city
Usually expressed as a graph problem
We use complete, undirected, planar, Euclidean graph
Vertices represent cities
Edge weights reflect distances
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
TSP Algorithm
Optimal solution is NP-hard
Heuristic algorithms used to approximate solution
We use an iterative hill climbing search algorithm
Generate k random initial tours (k climbers)
Iteratively refine them until local minimum reached
In each iteration, apply best opt-2 move
Find best pair of edges (a,b) and (c,d)
such that replacing them with (a,d)
and (b,c) minimizes tour length
A Parallel GPU Version of the Traveling Salesman Problem
→
July 2011
GPU Requirements
Lots of data parallelism
Need 10,000s of ‘independent’ threads
Sufficient memory access regularity
Thepcreport.net
Sets of 32 threads should have ‘nice’ access patterns
Sufficient code regularity
Sets of 32 threads should follow the same control flow
Plenty of data reuse
At least O(n2) operations on O(n) data
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
TSP_GPU Implementation
Assuming 100-city problems & 100,000 climbers
Climbers are independent, can be run in parallel
Plenty of data parallelism
Potential load imbalance
Different number of steps required to reach local minimum
Every step determines best of 4851 opt-2 moves
Same control flow (but different data)
Coalesced memory access patterns
O(n2) operations on O(n) data
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
Code Optimizations
Key code section: finding best opt-2 move
Doubly nested loop
Only computes difference in tour length, not absolute length
Highly optimized to minimize memory accesses
“Caches” rest of data in registers
Requires only 6 clock cycles per move on a Xeon CPU core
Local minimum compared to best solution so far
Best solution updated if needed, otherwise tour is discarded
Other small optimizations (see paper)
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
GPU Optimizations
Random tours generated in parallel on GPU
Minimizes data transfer to GPU
(CPU only generates distance matrix
and prints result)
gamedsforum.ca
2D distance matrix resident in shared memory
Ensures hits in software-controlled fast data cache
Tours copied to local memory in chunks of 1024
Enables accessing them with coalesced loads & stores
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
Evaluation Method
Systems
NVIDIA Tesla C2050 GPU (1.15 GHz 14 SMs w/ 32 PEs)
Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons)
Datasets
Five 100-city inputs from TSPLIB
Implementations
CUDA (GPU), Pthreads (CPU), serial C (CPU)
Use almost identical code for finding best opt-2 move
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
Runtime Comparison (kroE100 Input)
262144
154684
156413 (median)
sequential
Runtimes (in ms)
CUDA GPU
Min
Median
Max
78350
65536
pthreads
39175
19591
16384
9802
4908
4368
4096
2724
2539
2497
256
CUDA
GPU
1024
seq
CPU
1
2
4
8
16
32
64
Number of threads (pthreads CPU)
128
GPU is 7.8x faster than CPU with 8 cores
One GPU chip is as fast as 16 or 32 CPU chips
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
Speedup over Sequential Code
Speedup over Serial (kroE100 Input)
90
pthreads
80
CUDA GPU
Min
70
Median
60
56.8
Max
60.9
61.9
50
40
31.5
35.4
30
15.8
20
10
(median)
1.0
2.0
3.9
1
2
4
7.9
0
8
16
32
Number of threads (pthreads)
64
128
256
CUDA
GPU
Pthreads code scales well to 32 threads (4 CPUs)
CPU performance fluctuates (NUMA), GPU stable
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
Solution Quality
TSPLIB Database
CUDA GPU Solution Quality
Name
Optimal Cost
Min. Tour Cost
Min. Tour #
Runtime (s)
kroA100
21,282
21,282
33,188
2.540
kroB100
22,141
22,141
5,969
2.499
kroC100
20,749
20,749
23,092
2.543
kroD100
21,294
21,294
32,142
2.497
22,084
16,941
2.499
22,068
117,583
4.952
kroE100
22,068
Optimal tour found in 4 of 5 cases with 100,000 climbers
200,000 climbers find best solution in fifth case
Runtime independent of input and linear in climbers
A Parallel GPU Version of the Traveling Salesman Problem
July 2011
Summary
TSP_GPU source code is freely available at
http://www.cs.txstate.edu/~burtscher/research/TSP_GPU/
TSP_GPU algorithm
Highly optimized implementation for GPUs
Evaluates almost 20 billion tour modifications per
second on a single GPU (as fast as 32 8-core Xeons)
Produces high-quality results
May be better suited for GPU than ACO and GA algos.
Acknowledgments
NSF TeraGrid (NICS), NVIDIA Corp., and Intel Corp.
A Parallel GPU Version of the Traveling Salesman Problem
July 2011