Transcript Slide 1

Integration, Development
and Results of the 500
Teraflop Heterogeneous
Cluster (Condor)
11 September 2012
Integrity  Service  Excellence
Mark Barnell
Air Force Research Laboratory
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
1
Agenda
•
Mission
•
RI HPC-ARC & HPC Systems
•
Condor Cluster
•
Success and Results
•
Future Work
•
Conclusions
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
2
Exponentially Improving Price-Performance
Measured by AFRL-Rome HPCs
500 TFLOP Cell-GPGPU
53 TFLOP Cell Cluster 250 TFLOPS/$M
147 TFLOPS/$M
1M
100,000
Heterogeneous HPC
XEON + FPGA
81 TOPS/$M
10,000
SKY (PowerPC)
200 GFLOPS/$M
1,000
100
INTEL PARAGON
(i860) 12 GFLOPS/$M
10
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
3
Agenda
•
Mission
•
RI HPC-ARC & HPC Systems
•
Condor Cluster
•
Success and Results
•
Future Work
•
Conclusions
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
4
Mission
• Objective: To support CS&E R&D along with HPC
to the Field experiments by providing interactive
access to hardware, software and user services with
special attention to applications and missions
supporting C4ISR.
• Technical Mission: Provide classical and unique,
real-time, interactive HPC resources to the AF and
DoD R&D community.
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
5
•
Mission
•
RI HPC-ARC & HPC Systems
•
Condor Cluster
•
Success and Results
•
Future Work
•
Conclusions
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
6
HPC Facility Resources
Cell BE Cluster
53 TFLOPS Peak
Performance
CONDOR CLUSTER
500 TFLOPS
Legend:
Funding: $2M HPCMP DHPI
Urban Surveillance
Cognitive Computing
Quantum Computing
HPC SDREN Assets
May 2012
HPC Assets on
HPC DREN Network
HORUS
EMULAB
22TFLOPS TTCP
Field Experiments
Network Emulation
Testbed
Online: Nov 2010
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
7
HPC Facility Resources
GPGPU Clusters
Legend:
ATI Cluster 32 TFLOPS ATI
FirePro 8800
HPC GPGPU Assets
on DREN Network
HORUS
CONDOR CLUSTER
500 TFLOPS
Funding: $2M HPCMP DHPI
Urban Surveillance
Cognitive Computing
Quantum Computing
22TFLOPS TTCP
Field Experiments
Online: Jan 2011
• Upgrade all Nvidia GPGPUs to C2050s & C2070s Tesla cards June 2012
• 30 Kepler cards ~90K will have a 3x improvement (1.5Tflop DP) 220W
• Condor among the greenest HPC in the world (1.25 Gflop/W DP&SP)
• Redistribute 60 C1060 Tesla cards to other HPC and research sites
• ASIC, UMASS, & ARSC
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
Online: Nov 2010
8
•
Mission
•
RI HPC-ARC & HPC Systems
•
Condor Cluster
•
Success and Results
•
Future Work
•
Conclusions
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
9
The Condor Cluster
FY10 DHPI Key design considerations: Price/performance & Performance/Watt
1716 SONY Playstation3s
• STI Cell Broadband Engine
• PowerPC PPE
• 6 SPEs
• 256 MB RAM
84 head nodes
• 6 gateway access points
• 78 compute nodes
• Intel Xeon X5650 dual-socket
hexa-core
• (2) NVIDIA Tesla GPGPUs
• 54 nodes – (108) C2050
• 24 nodes – (48) C2070/5
• 24-48 GB RAM
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
10
Condor Cluster (500 Tflops)
Online: November 2010
• 263 Tflops from 1,716 PS3s
•
•
•
•
153 GFLOPS/PS3
78 subclusters of 22 PS3s
225 Tflops from server nodes
•
84 sever nodes (Intel Westmere 5650 dual
socket Hexa (12 cores))
•
•
Dual GPGPUs in 78 server nodes
Firebird Cluster (~32 Tflops)
Cost: Approx. $2M
Sustained throughput benchmarks/appications YTD:
Xeon X5650: 16.8 Tflops, Cell 171.6 Tflops, C2050 : 68.2
Tflops, C2070: 34 Tflops….CONDOR TOTAL 290.6 Tflops
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
11
Condor Cluster Networks
10 GbE STAR-Bonded HUB
CS1
CS2
CS3
CS4
Server
CS5
CS6
CS7
CS8
CS9
CS10
CS11
CS12
CS13
CS14
x22
Rack1
Switch
CS15
CS16
CS17
CS18
Server
CS19
CS20
CS21
CS22
CS23
CS24
CS25
CS26
CS27
CS28
x14
Rack2
Switch
Switch
x13
DELL
RACK
x14
Rack5
x13
Switch
BOND
x22
Switch
BOND
CS57
CS58
CS59
CS60
Server
CS61
CS62
CS63
CS64
CS65
CS66
CS67
CS68
CS69
CS70
CPS53
CPS54
CPS55
CPS56
Server
CPS57
CPS58
CPS59
CPS60
CPS61
CPS62
CPS63
CPS64
CPS65
x13
Switch
Rack6
x13
Switch
BOND
Switch
BOND
CPS66
CPS67
CPS68
CPS69
Server
CPS70
CPS71
CPS72
CPS73
CPS74
CPS75
CPS76
CPS77
CPS78
x14
x14
CPS1
CPS2
CPS3
CPS4
Server
CPS5
CPS6
CPS7
CPS8
CPS9
CPS10
CPS11
CPS12
CPS13
x22
BOND
CS71
CS72
CS73
CS74
Server
CS75
CS76
CS77
CS78
CS79
CS80
CS81
CS82
CS83
CS84
S
W
I
T
C
H
S
Switch
CPS40
CPS41
CPS42
CPS43
Server
CPS44
CPS45
CPS46
CPS47
CPS48
CPS49
CPS50
CPS51
CPS52
CPS53
Rack4
x22
S
W
I x13
T
C
H
Switch
BOND
Switch
x14
CS43
CS44
CS45
CS46
Server
CS47
CS48
CS49
CS50
CS51
CS52
CS53
CS54
CS55
CS56
x22
x14
x13
CPS14
CPS15
CPS16
CPS17
Server
CPS18
CPS19
CPS20
CPS21
CPS22
CPS23
CPS24
CPS25
CPS26
CS29
CS30
CS31
CS32
Server
CS33
CS34
CS35
CS36
CS37
CS38
CS39
CS40
CS41
CS42
Rack3
Switch
S
W
I
T
C
H
S
S
W
I
T
C
H
S
x22
CPS27
CPS28
CPS29
CPS30
Server
CPS31
CPS32
CPS33
CPS34
CPS35
CPS36
CPS37
CPS38
CPS39
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
12
Condor Cluster Networks
Infiniband Mesh Non-Blocking 20Gb/s
(5) 12200 & (1) 12300 Qlogic 40Gb/s Infiniband (36 port) switches
Rack 1
14
servers
A
24
14
3
6
28
14
6
6
32
6
6
4
4
6
Rack 2
14
servers
6
4
32
6
5
B
24
Rack 6
14
servers
10
36
Rack 5
14
servers
6
14
14
Rack 3
14
servers
Rack 4
14
servers
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
13
Condor Web Interface
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
14
•
Mission
•
RI HPC-ARC & HPC Systems
•
Condor Cluster
•
Success and Results
•
Future Work
•
Conclusions
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
15
Solving Demanding,
Real-Time Military Problems
Occluded text recognition
Radar processing for
high resolution images
…but beginning to perceive that the handcuffs were
not for me and that the military had so far got….
Space object identification
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
16
RADAR Data Processing for
High Resolution Images
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
17
Optical Text Recognition
Processing Performance
•Computing resources involved in this run
– 4 Condor servers – 32 Intel Xeon processor cores
– 88 PlayStation 3’s – 616 IBM Cell-BE processor cores
– 40 Condor servers – 320 Intel Xeon processor cores
– 880 PS3s – 6160 IBM Cell-BE Processor cores (21 pages/sec)
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
18
Space Object Identification
Low resolution frames
High resolution image
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
19
Matrix Multiply
MAGMA-only, one-sided matrix factorization
500
450
400
GFLOPS
350
Matrix-matrix multiplication test
C2050 (MAGMA vs CUBLAS)
300
Intel 5650 12 Cores
250
Nvidia C2050
200
700
150
600
100
50
500
GFLOPS
0
0
400
Magma
300
CUBLAS
2000
4000
6000
8000
10000
12000
Matrix Size
200
100
0
0
2000
4000
6000
8000
10000
12000
Matrix Size
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
20
LAMMPS on GPUs
• Condor/Firebird provides access to next-generation hybrid CPU/GPU architectures
• Critical for understanding the capability and operation prior to larger deployments
• Opportunity to study non-traditional applications of HPC, e.g., C4I applications
• CPU/GPU compute nodes provides significant raw computing power
• OCL N-Body benchmark with 768K particles sustained performance ~ 2 TFLOPS using 4
Tesla C2050s or 3 FireBird V8800s
• Production chemistry code (LAMMPS) shows speedup with minimal effort
• Original CPU code ported to OpenCL with limited source code modifications
• Exact double-precision algorithm runs on Nvidia and AMD nodes
• Overall platform capability increased by 2x (2.8x) without any GPU optimization
40
GFLOPS
2000
Loop Time (sec)
OpenCL N-Body Benchmark1
Tesla C2050
FirePro V8800
1000
0
1 GPU
1MPI-modified
2 GPUs
3 GPUs
4 GPUs
BDT N-Body benchmark distributed with COPRTHR 1.1
1
LAMMPS-OCL EAM Benchmark2
(Absolutely no GPU optimizations)
30
20
(cores)
2
(GPUs)
10
(GPUs)
1
4
8
2
1
3
4
2
3
0
Xeon X5660
Tesla C2050
FirePro V8800
2LAMMPS-OCL
is a modified version of the LAMMPS molecular dynamics
code ported to OpenCL by Brown Deer Technology
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
21
•
Mission
•
RI HPC-ARC & HPC Systems
•
Condor Cluster
•
Success and Results
•
Future Work
•
Conclusions
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
22
Future Work
• Improved OTR applications
– Multiple languages
• Space Situation Awareness
– Heterogeneous algorithms
• Persistent Wide Area Surveillance
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
23
Autonomous Sensing in Persistent
Wide-Area Surveillance
•Cross-TD effort
– Investigate scalable, real-time and autonomous sensing technologies
– Develop a neuromorphic computing architecture for synthetic aperture radar
(SAR) imagery information exploitation
– Provide critical wide-area persistent surveillance capabilities including motion
detection, object recognition, areas-of-interest identification and predictive
sensing
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
24
Conclusions
– Valuable resource to support entire AFRL/RI, AFRL
and tri-service RDT&E community.
– Leading large GPGPU development and
benchmarking tests.
– This investment is leveraged by many (130+) users
– Technical benefits – Faster, higher fidelity problem
solution; multiple parallel solutions,
heterogeneous application development
DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)
25
Questions?
DISTRIBUTION STATEMENT A – Unclassified, Unlimited Distribution
26