Transcript Slide 1
Integration, Development and Results of the 500 Teraflop Heterogeneous Cluster (Condor) 11 September 2012 Integrity Service Excellence Mark Barnell Air Force Research Laboratory DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 1 Agenda • Mission • RI HPC-ARC & HPC Systems • Condor Cluster • Success and Results • Future Work • Conclusions DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 2 Exponentially Improving Price-Performance Measured by AFRL-Rome HPCs 500 TFLOP Cell-GPGPU 53 TFLOP Cell Cluster 250 TFLOPS/$M 147 TFLOPS/$M 1M 100,000 Heterogeneous HPC XEON + FPGA 81 TOPS/$M 10,000 SKY (PowerPC) 200 GFLOPS/$M 1,000 100 INTEL PARAGON (i860) 12 GFLOPS/$M 10 DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 3 Agenda • Mission • RI HPC-ARC & HPC Systems • Condor Cluster • Success and Results • Future Work • Conclusions DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 4 Mission • Objective: To support CS&E R&D along with HPC to the Field experiments by providing interactive access to hardware, software and user services with special attention to applications and missions supporting C4ISR. • Technical Mission: Provide classical and unique, real-time, interactive HPC resources to the AF and DoD R&D community. DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 5 • Mission • RI HPC-ARC & HPC Systems • Condor Cluster • Success and Results • Future Work • Conclusions DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 6 HPC Facility Resources Cell BE Cluster 53 TFLOPS Peak Performance CONDOR CLUSTER 500 TFLOPS Legend: Funding: $2M HPCMP DHPI Urban Surveillance Cognitive Computing Quantum Computing HPC SDREN Assets May 2012 HPC Assets on HPC DREN Network HORUS EMULAB 22TFLOPS TTCP Field Experiments Network Emulation Testbed Online: Nov 2010 DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 7 HPC Facility Resources GPGPU Clusters Legend: ATI Cluster 32 TFLOPS ATI FirePro 8800 HPC GPGPU Assets on DREN Network HORUS CONDOR CLUSTER 500 TFLOPS Funding: $2M HPCMP DHPI Urban Surveillance Cognitive Computing Quantum Computing 22TFLOPS TTCP Field Experiments Online: Jan 2011 • Upgrade all Nvidia GPGPUs to C2050s & C2070s Tesla cards June 2012 • 30 Kepler cards ~90K will have a 3x improvement (1.5Tflop DP) 220W • Condor among the greenest HPC in the world (1.25 Gflop/W DP&SP) • Redistribute 60 C1060 Tesla cards to other HPC and research sites • ASIC, UMASS, & ARSC DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) Online: Nov 2010 8 • Mission • RI HPC-ARC & HPC Systems • Condor Cluster • Success and Results • Future Work • Conclusions DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 9 The Condor Cluster FY10 DHPI Key design considerations: Price/performance & Performance/Watt 1716 SONY Playstation3s • STI Cell Broadband Engine • PowerPC PPE • 6 SPEs • 256 MB RAM 84 head nodes • 6 gateway access points • 78 compute nodes • Intel Xeon X5650 dual-socket hexa-core • (2) NVIDIA Tesla GPGPUs • 54 nodes – (108) C2050 • 24 nodes – (48) C2070/5 • 24-48 GB RAM DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 10 Condor Cluster (500 Tflops) Online: November 2010 • 263 Tflops from 1,716 PS3s • • • • 153 GFLOPS/PS3 78 subclusters of 22 PS3s 225 Tflops from server nodes • 84 sever nodes (Intel Westmere 5650 dual socket Hexa (12 cores)) • • Dual GPGPUs in 78 server nodes Firebird Cluster (~32 Tflops) Cost: Approx. $2M Sustained throughput benchmarks/appications YTD: Xeon X5650: 16.8 Tflops, Cell 171.6 Tflops, C2050 : 68.2 Tflops, C2070: 34 Tflops….CONDOR TOTAL 290.6 Tflops DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 11 Condor Cluster Networks 10 GbE STAR-Bonded HUB CS1 CS2 CS3 CS4 Server CS5 CS6 CS7 CS8 CS9 CS10 CS11 CS12 CS13 CS14 x22 Rack1 Switch CS15 CS16 CS17 CS18 Server CS19 CS20 CS21 CS22 CS23 CS24 CS25 CS26 CS27 CS28 x14 Rack2 Switch Switch x13 DELL RACK x14 Rack5 x13 Switch BOND x22 Switch BOND CS57 CS58 CS59 CS60 Server CS61 CS62 CS63 CS64 CS65 CS66 CS67 CS68 CS69 CS70 CPS53 CPS54 CPS55 CPS56 Server CPS57 CPS58 CPS59 CPS60 CPS61 CPS62 CPS63 CPS64 CPS65 x13 Switch Rack6 x13 Switch BOND Switch BOND CPS66 CPS67 CPS68 CPS69 Server CPS70 CPS71 CPS72 CPS73 CPS74 CPS75 CPS76 CPS77 CPS78 x14 x14 CPS1 CPS2 CPS3 CPS4 Server CPS5 CPS6 CPS7 CPS8 CPS9 CPS10 CPS11 CPS12 CPS13 x22 BOND CS71 CS72 CS73 CS74 Server CS75 CS76 CS77 CS78 CS79 CS80 CS81 CS82 CS83 CS84 S W I T C H S Switch CPS40 CPS41 CPS42 CPS43 Server CPS44 CPS45 CPS46 CPS47 CPS48 CPS49 CPS50 CPS51 CPS52 CPS53 Rack4 x22 S W I x13 T C H Switch BOND Switch x14 CS43 CS44 CS45 CS46 Server CS47 CS48 CS49 CS50 CS51 CS52 CS53 CS54 CS55 CS56 x22 x14 x13 CPS14 CPS15 CPS16 CPS17 Server CPS18 CPS19 CPS20 CPS21 CPS22 CPS23 CPS24 CPS25 CPS26 CS29 CS30 CS31 CS32 Server CS33 CS34 CS35 CS36 CS37 CS38 CS39 CS40 CS41 CS42 Rack3 Switch S W I T C H S S W I T C H S x22 CPS27 CPS28 CPS29 CPS30 Server CPS31 CPS32 CPS33 CPS34 CPS35 CPS36 CPS37 CPS38 CPS39 DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 12 Condor Cluster Networks Infiniband Mesh Non-Blocking 20Gb/s (5) 12200 & (1) 12300 Qlogic 40Gb/s Infiniband (36 port) switches Rack 1 14 servers A 24 14 3 6 28 14 6 6 32 6 6 4 4 6 Rack 2 14 servers 6 4 32 6 5 B 24 Rack 6 14 servers 10 36 Rack 5 14 servers 6 14 14 Rack 3 14 servers Rack 4 14 servers DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 13 Condor Web Interface DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 14 • Mission • RI HPC-ARC & HPC Systems • Condor Cluster • Success and Results • Future Work • Conclusions DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 15 Solving Demanding, Real-Time Military Problems Occluded text recognition Radar processing for high resolution images …but beginning to perceive that the handcuffs were not for me and that the military had so far got…. Space object identification DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 16 RADAR Data Processing for High Resolution Images DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 17 Optical Text Recognition Processing Performance •Computing resources involved in this run – 4 Condor servers – 32 Intel Xeon processor cores – 88 PlayStation 3’s – 616 IBM Cell-BE processor cores – 40 Condor servers – 320 Intel Xeon processor cores – 880 PS3s – 6160 IBM Cell-BE Processor cores (21 pages/sec) DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 18 Space Object Identification Low resolution frames High resolution image DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 19 Matrix Multiply MAGMA-only, one-sided matrix factorization 500 450 400 GFLOPS 350 Matrix-matrix multiplication test C2050 (MAGMA vs CUBLAS) 300 Intel 5650 12 Cores 250 Nvidia C2050 200 700 150 600 100 50 500 GFLOPS 0 0 400 Magma 300 CUBLAS 2000 4000 6000 8000 10000 12000 Matrix Size 200 100 0 0 2000 4000 6000 8000 10000 12000 Matrix Size DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 20 LAMMPS on GPUs • Condor/Firebird provides access to next-generation hybrid CPU/GPU architectures • Critical for understanding the capability and operation prior to larger deployments • Opportunity to study non-traditional applications of HPC, e.g., C4I applications • CPU/GPU compute nodes provides significant raw computing power • OCL N-Body benchmark with 768K particles sustained performance ~ 2 TFLOPS using 4 Tesla C2050s or 3 FireBird V8800s • Production chemistry code (LAMMPS) shows speedup with minimal effort • Original CPU code ported to OpenCL with limited source code modifications • Exact double-precision algorithm runs on Nvidia and AMD nodes • Overall platform capability increased by 2x (2.8x) without any GPU optimization 40 GFLOPS 2000 Loop Time (sec) OpenCL N-Body Benchmark1 Tesla C2050 FirePro V8800 1000 0 1 GPU 1MPI-modified 2 GPUs 3 GPUs 4 GPUs BDT N-Body benchmark distributed with COPRTHR 1.1 1 LAMMPS-OCL EAM Benchmark2 (Absolutely no GPU optimizations) 30 20 (cores) 2 (GPUs) 10 (GPUs) 1 4 8 2 1 3 4 2 3 0 Xeon X5660 Tesla C2050 FirePro V8800 2LAMMPS-OCL is a modified version of the LAMMPS molecular dynamics code ported to OpenCL by Brown Deer Technology DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 21 • Mission • RI HPC-ARC & HPC Systems • Condor Cluster • Success and Results • Future Work • Conclusions DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 22 Future Work • Improved OTR applications – Multiple languages • Space Situation Awareness – Heterogeneous algorithms • Persistent Wide Area Surveillance DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 23 Autonomous Sensing in Persistent Wide-Area Surveillance •Cross-TD effort – Investigate scalable, real-time and autonomous sensing technologies – Develop a neuromorphic computing architecture for synthetic aperture radar (SAR) imagery information exploitation – Provide critical wide-area persistent surveillance capabilities including motion detection, object recognition, areas-of-interest identification and predictive sensing DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 24 Conclusions – Valuable resource to support entire AFRL/RI, AFRL and tri-service RDT&E community. – Leading large GPGPU development and benchmarking tests. – This investment is leveraged by many (130+) users – Technical benefits – Faster, higher fidelity problem solution; multiple parallel solutions, heterogeneous application development DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) 25 Questions? DISTRIBUTION STATEMENT A – Unclassified, Unlimited Distribution 26