Transcript Pr cis

Large Matrix-Matrix Multiply on
PS3 clusters
15 September 2010
Mark Barnell, AFRL RITB
[email protected]
Dennis Fitzgerald, ITT
[email protected]
DISTRIBUTION STATEMENT A. Approved for public release; distribution unlimited. (Approval given by Public Affairs
Office (September 2010).
Description
•
•
•
•
Matrix-Matrix multiplication of large matrices
> 100k x 100k
Parallelized over a number PS3s
Maintained near peak performance on each
Cell BE
UNCLASSIFIED
2
Challenges
• Near peak computation rate on the Cell BE for
small matrix sizes
• Data and thread coordination between
PowerPC and Cell BE with near zero overhead
• Balanced IO with Cell BE’s peak FLOPS to
keep PS3 computationally busy
• Network performance sufficient to deliver
enough data to many PS3s
UNCLASSIFIED
3
Approach
• Core MM algorithm > 99% efficient (128x128)
– Daniel Hackenberg – Dresden
• PowerPC code to coordinate larger
rectangular matrices – Miriam Leeser –
Northeastern
• Multi-buffering & semaphors to reduce wait
time
• Blocked sub-matrix distribution with data
sized to balance compute and IO
UNCLASSIFIED
4
Results
Matrix-Matrix Mutiply GFLOPS
48k x 48k
3500.00
48k x 240k
3000.00
GFLOPS
2500.00
PS3 Max GFLOPS (153)
2000.00
1500.00
1000.00
500.00
0.00
1 3 5 7 9 11 13 15 17 19 21
Number of PS3s
UNCLASSIFIED
5