下載/瀏覽Download

Download Report

Transcript 下載/瀏覽Download

Evaluation of parallel particle swarm
optimization algorithms within the
CUDA™ architecture
Luca Mussi, Fabio Daolio, Stefano Cagnoni,
Information Sciences, 181(20), 2011, pp. 4642-4657.
Presenter: Guan-Yu Chen
1
Outline
1.
2.
3.
4.
5.
6.
Particle swarm optimization (PSO)
PSO parallelization
The CUDA™ architecture
Parallel PSO within the CUDA™
Results
Final remarks
2
1. Particle swarm optimization (1/3)
• Kennedy & Eberhart (1995).
– Velocity function.
– Fitness function.
3
1. Particle swarm optimization (2/3)
• Velocity function
V(t )  wV(t  1)  C1 R1[ Xlbest (t  1)  X(t  1)]
 C2 R2 [X gbest (t  1)  X(t  1)]
X(t )  X(t  1)  V(t )
V: the velocity of a particle.
C1, C2: two positive constants.
w: inertia weight.
t: at time t.
X: the position of a particle.
R1, R2: two random numbers uniformly drawn between 0 and 1.
Xlbest: the best-fitness position reached by the particle.
Xgbest: the best-fitness point ever found by the whole swarm.
4
1. Particle swarm optimization (3/3)
• Fitness function
z  Z (X)
Self-definition.
X  arg min Z ( X)
*
5
2. PSO parallelization
•
•
•
•
Master-Slave paradigm.
Island model (coarse-grained algorithms).
Cellular model (fine-grained paradigm).
Synchronous or Asynchronous.
6
3. The CUDA™ architecture (1/5)
• CUDA™ (nVIDIA™, Nov. 2006).
– A handy tool to develop scientific programs
oriented to massively parallel computation.
• Kernels  Grid  Thread blocks  Threads
• How many thread blocks for problem?
• How many threads per thread block?
7
3. The CUDA™ architecture (2/5)
• Streaming Multiprocessors (SMs)
–
–
–
–
–
8 scalar processing cores,
A number of fast 32-bit registers,
A parallel data cache shared between all cores,
A read-only constant cache,
A read-only texture cache.
8
3. The CUDA™ architecture (3/5)
• SIMT (Single Instruction, Multiple Thread)
– Which creates, manages, schedules, and executes
groups (warps) of 32 parallel threads.
– The main difference from a SIMD (Single
Instruction, Multiple Data) architecture is that
SIMT instructions specify the whole execution and
branching behavior of a single thread.
9
3. The CUDA™ architecture (4/5)
Each kernel should reflect the following structure:
a) Load data from global/texture memory;
b) Process data;
c) Store results back to global memory.
10
3. The CUDA™ architecture (5/5)
The most important specific programming guidelines:
a) Minimize data transfers between the host and the
graphics card;
b) Minimize the use of global memory: shared memory
should be preferred;
c) Ensure global memory accesses are coalesced
whenever possible;
d) Avoid different execution paths within the same
warp.
11
4. Parallel PSO within the CUDA™
• The main obstacle to PSO parallelization is the
dependence between particle’s updates.
SyncPSO
– Xgbest or Xlbest are updated at the end of each generation only.
RingPSO
– Relaxing the synchronization constraint.
– Allowing the computation load to be distributed over all
SMs available.
12
4.1 Basic parallel PSO design (1/2)
13
4.1 Basic parallel PSO design (2/2)
14
4.2 Multi-kernel parallel PSO
algorithm (1/3)
posID = ( swarmID * n + particleID ) * D + dimensionID
15
4.2 Multi-kernel parallel PSO
algorithm (2/3)
• PositionUpdateKernel ( 1st kernel )
– Be used to update the particles’ positions by scheduling a
number of thread blocks equal to the number of particles.
• FitnessKernel ( 2nd kernel )
– Be used to compute the fitness.
• BestUpdateKernel ( 3rd kernel )
– Be used to update Xgbest or Xlbest.
16
4.2 Multi-kernel parallel PSO
algorithm (3/3)
17
5. Results
w = 0.729844 and C1 = C2 = 1.49618.
18
5.1 SyncPSO (1/2)
•
Asus GeForce EN8800GT GPU; Intel Core2 Duo™
CPU 1.86 GHz.
a) 100 consecutive runs; a single swarm of 32, 64, and
128 particles; 5-dimensional Rastrigin function vs.
the number of generations.
b) run 10,000 generations of one swarm with 32, 64
and 128 particles scales with respect to the
dimension of the generalized Rastrigin function (up
to nine dimensions).
19
5.1 SyncPSO (2/2)
20
5.2 RingPSO (1/5)
•
•
nVIDIA™ Quadro FX 5800; Zotac GeForce
GTX260 AMP 2 edition; Asus GeForce EN8800GT
SPSO on 64-bit Intel(R) Core(TM) i7 CPU 2.67
GHz.
1) The sequential SPSO version modified to implement the ring
topology;
2) The ‘basic’ three-kernel version of RingPSO;
3) RingPSO implemented with two kernels only (one kernel
which fuses BestUpdateKernel and PositionUpdateKernel,
and FitnessKernel)
21
5.2 RingPSO (2/5)
Sphere function
D [100, 100]
22
5.2 RingPSO (3/5)
Rastrigin function
D[5.12, 5.12]
23
5.2 RingPSO (4/5)
Rosenbrock function
D[30, 30]
24
5.2 RingPSO (5/5)
25
6. Final remarks (1/2)
• SyncPSO is usually more than enough for any
practical application.
• SyncPSO’s usage of computation resources is
very inefficient in cases when only one or few
swarms need to be simulated.
• SyncPSO becomes inefficient when the
problem size increases above a certain
threshold.
26
6. Final remarks (2/2)
• The drawbacks of accessing global memory
for the multi-kernel version are more than
compensated by the advantages of
parallelization.
• The speed-up for the multi-kernel version
increases with problem size.
• Both versions are far better than the most
recent results published on the same task.
27