PPT - Computer Science Club

Download Report

Transcript PPT - Computer Science Club

Donald Pupecki
Why GPUs?
 GPUs are scaling faster than CPUs[1]

[2]
Intel recently wrote a paper claiming a two year old GPU was
“only 10x faster” than their top-end CPU.
Why OpenCL?
 Portable.
 Can run on most parallel platforms




CPU
GPU
FPGA
Cell B.E.
 Industry Wide Support
 Binding For many languages
Why PSO?
 Particle Swarm Optimization
 Created in 1995 by Kennedy, Eberhart , Shi[5][6]
 Inspired by BOIDS, a bird flocking simulation by Craig
Reynolds[3][4]
 Provides good exploration, exploitation sometimes
requires parameter tuning. [7]
 A subset of Swarm Intelligence, which is itself a subset
of Evolutionary Computation
 Video: http://vimeo.com/17407010
Is it parallelizable?
 The PSO, like most evolutionary computations is
iterative.
 General process:
Init
while not term_conditions
update position, velocity
recalculate fitness
update local bests
Within each iteration there is a lot to do.
Pick 2 random numbers
Uniform Rand R1,2 [0,1]
Update Pos/Vel for each dimension
Veli+1 = W(Veli) +C1*R1*(Besti-Posi) + C2*R2*(Gbesti-Posi)
Posi+1 += Veli+1
Update Fitness
Sphere, Rast, Rosen, Grien, Schwefel Ackley or Sum of Powers
Update Best Position
If Posi>Besti, Posi = Besti
If Posi>Gbest, Gbest = Posi
Parallel Gotchas
 Bank conflicts[18]
 Divergent Warps
 Struct Sizes
 Coalescing
 Debugging
 SIMD
 No Branch Prediction
Goals: Implement More Test Functions
 The previous version of this work only implemented 4
test functions.
 More non-parabolic test functions could help confirm
the soundness of the algorithm.
Test Functions Implemented
Sphere
Unimodal, Symmetric
Rosnebrock
Multimodal, Nonsymentric,
Banana Shaped
Rastragin
Highly Multimodal, Symmetric
Grienwank
Regularly Distributed Minima
Ackley
Highly Multimodal, Funnel Shaped
Schwefel
Globally distant minima
Sum of Powers
Unimodal, non Symmetric
[14][15]
Goals: Shift Test Functions
 PSOs have a tendency to flock to the middle. [10]
 Remedied by “Region Scaling”[10][11][12] or “Center
offset” approach.
 Used center offset with an offset of 2.5
Goals: More Parallelism
 Version 1.0 only does fitness calculations over Num
Particles.
 Dimensional Calculations almost all contain products
or sums over all dimensions and should be done in
parallel and combined with a reduction.
Goals: Move Everything to Card
 Less than 1/10th of the time in the 1.0 version was spent
doing calculation.
Method
Time
Data xferred
WriteBuffer
WriteBuffer
WriteBuffer
WriteBuffer
WriteBuffer
rast__k3_ATI RV7701
ReadBuffer
ReadBuffer
0.08559
0.07935
0.07701
0.07851
0.07641
0.04503
0.02778
0.02052
2.34
2.34
0.23
2.34
0.08
0
2.34
2.34
OCLPSO 1.1
 Get everything to the card!
 Needed to implement a LCRNG on card.
 Eliminated large data transfers.
 Added hill climber to fall back on.
 Got inconsistent results.
 Note on synchronizations. Barrier, Mem Fence, FP
Atomics, Compute Capability 2.x and the lack of global
sync.
OCL PSO 2.0
 Rewrite.
 Introduced the Fully Saturated concept.
 Compute Units are now self contained.
 All dimensions of all particles of all swarms are
evaluated in parallel.
OCLPSO 2.0 vs. CUDA PSO 2.0
 The CUDA PSO[16] [17] was submitted for the GECCO
competition on GPGPGPU.com
 It features a ring topology for its bests calculation.
OCLPSO 2.0 vs. SPSO
 The 2006 Standard PSO was written by Clerc and
Kennedy.
 Pretty frills-free compared to newer (2010) Standard
PSO codes or TRIBES.
 Non-parallel but extremely efficient.
Conclusions
 OCL PSO 2.0 performs competitively as compared to
CUDA PSO and trumps SPSO.
 This is despite its lack of neighborhood bests(it uses
only a global best).
 This is also without the hill climb component as I
needed to strip as much complexity out as I could in
the final rewrite.
Future work
 Re-implement hill climber, it was a great help for
peaks.
 Re-add neighborhoods.
 Implement on newest generation of CUDA 2.x cards
supporting global floating point atomics
 Implement TRIBES on the GPU.
References
[1] Nvidia, “TESLA DPU computing: Supercomputing at 1/10th the Cost.”
http://www.nvidia.com/docs/IO/96958/Tesla_Master_Deck.pdf , 2011
[2] Nvidia, “Why Choose Tesla,” http://www.nvidia.com/object/why-choose-tesla.html, 2011
[3] BOIDS paper or wiki:boids http://www.red3d.com/cwr/boids/
[4] Reynolds, Craig (1987), "Flocks, herds and schools: A distributed behavioral model.", SIGGRAPH '87:
Proceedings of the 14th annual conference on Computer graphics and interactive techniques
(Association for Computing Machinery): 25–34
[5] Kennedy, J.; Eberhart, R. (1995). "Particle Swarm Optimization". Proceedings of IEEE International
Conference on Neural Networks. IV. pp. 1942–1948. doi:10.1109/ICNN.1995.488968.
http://www.engr.iupui.edu/~shi/Coference/psopap4.html.
[6] Shi, Y.; Eberhart, R.C. (1998). "A modified particle swarm optimizer". Proceedings of IEEE
International Conference on Evolutionary Computation. pp. 69–73.
[7] Clerc, M.; Kennedy, J. (2002). "The particle swarm - explosion, stability, and convergence in a
multidimensional complex space". IEEE Transactions on Evolutionary Computation
[8] M. Nobile. “Particle Swarm Optimization”, http://vimeo.com/17407010, 2011
References – Cont.
[9] M. Clerc, Confinements and Biases in Particle Swarm Optimization. http://clerc.maurice.free.fr/pso/
, 2006.
[10] C. K. Monson and K. D. Seppi, "Exposing Origin-Seeking Bias in PSO," presented at GECCO'05,
Washington, DC, USA, 2005, pp. 241-248.
[11] P. J. Angeline. Using selection to improve particle swarm optimization. In Proceedings of the IEEE
Congress on Evolutionary Computation (CEC 1998), Anchorage, Alaska, USA, 1998.
[12] D. K. Gehlhaar and D. B. Fogel. Tuning evolutionary programming for conformationally flexible
molecular docking. In Evolutionary Programming, pages 419–429, 1996.
[13] D. Pupecki, “OpenCL PSO, Development, Benchmarking, Lessons, Future Work,” 2011.
http://web.cs.sunyit.edu/~pupeckd/oclpsopres.pdf
[14] Molga, M., Smutnicki, C., 2005. “Test functions for optimization needs”,
http://www.zsd.ict.pwr.wroc.pl/files/docs/functions.pdf
[15] Pohlheim, Hartmut. Genetic and Evolutionary Algorithm Toolbox for usewith MATLAB . Technical
Report, Technical University Ilmenau, 1998.
[16] L. Mussi, F. Daolio, and S. Cagnoni. Evaluation of parallel particle swarm optimization algorithms
within the CUDA architecture. Information Sciences, 2011, in press.
References – Cont.
[17] L. Mussi, Y.S.G. Nashed, and S. Cagnoni. GPU-based Asynchronous Particle Swarm Optimization.
Proc.GECCO 2011, 2011.
[18] NVIDIA, OpenCL Programming Guide for the CUDA Architecture,
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_OpenCL_Program
mingGuide.pdf, 2011
[19] M. Clerc. TRIBES - un exemple d’optimisation par essaim particulaire sans param`etres de contrˆole.
In Optimisation par Essaim Particulaire (OEP 2003), Paris, France, 2003.