Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer.
Download ReportTranscript Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer.
Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Why Blue Gene? Extreme scalability, balanced architecture, simple design Efficient energy usage A change from IBM Power systems at NCAR But familiar Programming model Chip (similar to Power4) Linux on front-end and IO nodes Interesting research platform University of Colorado at Boulder / National Center for Atmospheric Research 2 Outline System Overview Applications In the Classroom Scheduler Development TeraGrid Integration Other Current Research Activities University of Colorado at Boulder / National Center for Atmospheric Research 3 Frost Fun Facts Collaborative effort Univ of Colorado at Boulder (CU) NCAR Univ of Colorado at Denver Debuted in June 2005, tied for 58th place on Top500 5.73 Tflops peak – 4.71 sustained 25KW loaded power usage 4 front-ends, 1 service node 6TB usable storage Why is it leaning? Henry Tufo and Rich Loft, with Frost University of Colorado at Boulder / National Center for Atmospheric Research 4 System Internals 5.5GB/s 11GB/s PLB (4:1) 32k/32k L1 2.7GB/s 256 128 L2 440 CPU 4MB EDRAM “Double FPU” snoop Multiported Shared SRAM Buffer 256 32k/32k L1 440 CPU I/O proc 128 Shared L3 directory for EDRAM 1024+ 144 ECC L3 Cache or M emory 22GB/s L2 256 Includes ECC 256 “Double FPU” l 128 Ethernet Gbit JTAG Access Torus Tree Global Interrupt DDR Control with ECC 5.5 GB/s Gbit Ethernet JTAG 6 out and 6 in, each at 1.4 Gbit/s link 3 out and 3 in, each at 2.8 Gbit/s link 4 global barriers or interrupts 144 bit wide DDR 256MB Blue Gene/L system on-a-chip University of Colorado at Boulder / National Center for Atmospheric Research 5 More Details Chips Storage PPC440 @700MHZ, 2 cores per 4 Power5 systems as GPFS node 512 MB memory per node Coprocessor vs Virtual Node 1:32 IO to Compute ratio cluster NFS export to BGL IO nodes Interconnects 3D Torus (154 MB/s one direction) Tree (354 MB/s) Global Interrupt GigE JTAG/IDO University of Colorado at Boulder / National Center for Atmospheric Research 6 Frost Utilization Utilization BlueGene/L (frost) Usage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% University of Colorado at Boulder / National Center for Atmospheric Research 9/10/06 8/27/06 8/13/06 7/30/06 7/16/06 7/2/06 6/18/06 6/4/06 5/21/06 5/7/06 4/23/06 4/9/06 3/26/06 3/12/06 2/26/06 2/12/06 1/29/06 1/15/06 1/1/06 0% 7 HOMME High Order Method Modeling Environment Spectral element dynamical core Proved scalable on other platforms Cubed-sphere topology Space-filling curves University of Colorado at Boulder / National Center for Atmospheric Research 8 HOMME Performance Ported in 2004 on BG/L prototype at TJ Watson, with eventual goal of Gordon Bell submission in 2005 7 Serial and parallel obstacles: SIMD instructions Eager vs Adaptive routing Mapping strategies 6 5 4 3 2 1 Result: Good scalability out to 32,768 processors (3 elements per processor) 00 1 2 3 4 5 6 7 0 1 2 3 4 6 5 Snake mapping on 8x8x8 3D torus University of Colorado at Boulder / National Center for Atmospheric Research 9 7 HOMME Scalability on 32 Racks University of Colorado at Boulder / National Center for Atmospheric Research 10 Other Applications Popular codes on Frost WRF CAM, POP, CICE MPIKAIA EULAG BOB PETSc Used as a scalability test bed, in preparation for runs on 20-rack BG/W system University of Colorado at Boulder / National Center for Atmospheric Research 11 Classroom Access Henry Tufo’s ‘High Performance Scientific Computing’ course at University of Colorado Let students loose on 2048 processors Thinking BIG Throughput and latency studies Scalability tests - Conway’s Game of Life Final projects Feedback from ‘novice’ HPC users University of Colorado at Boulder / National Center for Atmospheric Research 12 Cobalt Component-Based Lightweight Toolkit Open source resource manager and scheduler Developed by ANL along with NCAR/CU Component Architecture Communication via XML-RPC Process manager, queue manager, scheduler ~3000 lines of python code Manages traditional clusters also http://www.mcs.anl.gov/cobalt University of Colorado at Boulder / National Center for Atmospheric Research 13 Cobalt Architecture University of Colorado at Boulder / National Center for Atmospheric Research 14 Cobalt Development Areas Scheduler improvements Efficient packing Multi-rack challenges Simulation ability Tunable scheduling parameters Visualization Aid in scheduler development Give users (and admins) better understanding of machine allocation Accounting / project management and logging Blue Gene/P TeraGrid integration University of Colorado at Boulder / National Center for Atmospheric Research 15 NCAR joins the TeraGrid, June 2006 University of Colorado at Boulder / National Center for Atmospheric Research 16 TeraGrid Testbed Experimental Environment Production Environment CU experimenta l Storage Cluster CSS Switch Teragri d NETS Switch NLR NCAR 1GBNe t E NT E RPR ISE 6 0 0 0 E NT E RPR ISE 6 0 0 0 Computational Cluster E NT E RPR ISE 6 0 0 0 FRGP Datagrid University of Colorado at Boulder / National Center for Atmospheric Research 17 TeraGrid Activities Grid-enabling Frost Common TeraGrid Software Stack (CTSS) Grid Resource Allocation Manager (GRAM) and Cobalt interoperability Security infrastructure Storage Cluster 16 OSTs, 50-100 TB usable storage 10G connectivity GPFS-WAN Lustre-WAN University of Colorado at Boulder / National Center for Atmospheric Research 18 Other Current Research Activities Scalability of CCSM components POP CICE Scalable solver experiments Efficient communication mapping Coupled climate models Petascale parallelism Meta-scheduling Across sites Cobalt vs other schedulers Storage PVFS2 + ZeptoOS Lustre University of Colorado at Boulder / National Center for Atmospheric Research 19 Frost has been a success as a … Research experiment Utilization rates Educational tool Classroom Fertile ground for grad students Development platform Petascale problems Systems work Questions? [email protected] https://wiki.cs.colorado.edu/BlueGeneWiki University of Colorado at Boulder / National Center for Atmospheric Research 20