Transcript Document
Gordon: NSF Flash-based System for Data-intensive Science Mahidhar Tatineni 37th HPC User Forum Seattle, WA Sept. 14, 2010 PIs: Michael L. Norman, Allan Snavely San Diego Supercomputer Center SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO What is Gordon? • A “data-intensive” supercomputer based on SSD flash memory and virtual shared memory SW • Emphasizes MEM and IOPS over FLOPS • A system designed to accelerate access to massive data bases being generated in all fields of science, engineering, medicine, and social science • The NSF’s most recent Track 2 award to the San Diego Supercomputer Center (SDSC) • Coming Summer 2011 SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Michael L. Norman Principal Investigator Director, SDSC Allan Snavely Co-Principal Investigator Project Scientist SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Why Gordon? • Growth of digital data is exponential • “data tsunami” • Driven by advances in digital detectors, networking, and storage technologies • Making sense of it all is the new imperative • • • • • data analysis workflows data mining visual analytics multiple-database queries data-driven applications SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Cosmological Dark Energy Surveys Survey Area (sq. deg.) Start date Image Data (PB) Object Catalog (PB) Pan-STARRS-1 30,000 2009 1.5 Dark Energy Survey 5,000 2011 2.4 0.1 Pan-STARRS-4 30,000 2012 20 0.2 Large Synoptic Survey Telescope 20,000 ~2015 60 30 Joint Dark Energy Mission 28,000 ~2015 ~60 ~30 0.1 Accelerating universe SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Gordon is designed specifically for dataintensive HPC applications • Such applications involve “very large data-sets or very large input-output requirements” (NSF Track 2D RFP) • Two data-intensive application classes are important and growing Data Mining “the process of extracting hidden patterns from data… with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information.” Wikipedia Data-Intensive Predictive Science solution of scientific problems via simulations that generate large amounts of data SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Red Shift: Data keeps moving further away from the CPU with every turn of Moore’s Law 1000 nanoseconds 100 10 CPU Cycle Time 1 Multi Core Effective Cycle Time 0.1 Memory Access Time Disk Access Time 0.01 2007 2005 2003 2001 1999 1997 1995 1993 1991 1989 1987 1985 1984 1982 data due to Dean Klein of Micron SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The Memory Hierarchy of a Typical HPC Cluster Shared memory programming Message passing programming Latency Gap Disk I/O SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The Memory Hierarchy of Gordon Shared memory programming Disk I/O SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Gordon Architecture: “Supernode” 4 TB SSD I/O Node • 32 Appro Extreme-X compute nodes • Dual processor Intel Sandy Bridge • 240 GFLOPS • 64 GB • 2 Appro Extreme-X IO nodes • Intel SSD drives • 4 TB ea. • 560,000 IOPS • ScaleMP vSMP virtual shared memory • 2 TB RAM aggregate • 8 TB SSD aggregate vSMP memory virtualization 240 GF Comp. Node 240 GF Comp. Node 64 GB RAM 64 GB RAM SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Gordon Architecture: Full Machine • 32 supernodes = 1024 compute nodes • Dual rail QDR Infiniband network SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN • 3D torus (4x4x4) • 4 PB rotating disk parallel file system • >100 GB/s SN SN SN SN D SN SN D SN SN D SN SN D SN SN SN SN SN D SN D SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Gordon Aggregate Capabilities Speed 245 TFLOPS Mem (RAM) 64 TB Mem (SSD) 256 TB Mem (RAM+SSD) 320 TB Ratio (MEM/SPEED) 1.31 BYTES/FLOP IO rate to SSDs 35 Million IOPS Network bandwidth 16 GB/s bi-directional Network latency 1 msec. Disk storage 4 PB Disk IO Bandwidth >100 GB/sec SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Dash*: a working prototype of Gordon *available as a TeraGrid resource SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO DASH Architecture Compute node 2 Nehalem quadcore CPUs Compute node 48GB DDR3 memory Compute node … 16 InfiniBand I/O node I/O node CPU & memory Compute node RAID controllers/HBAs (Host Bus Adapters) Intel® X25-E 64GB flash drive Intel® X25-E 64GB flash drive 16 … Intel® X25-E 64GB flash drive SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO DASH Architecture Just a cluster ??? Compute node Compute node I/O node … 16 InfiniBand Compute node SAN DIEGO SUPERCOMPUTER CENTER 15 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO DASH Architecture Supernode vSMP system Compute node Compute node … 16 InfiniBand Supernode 128 cores 128 cores 768GB DRAM 768GB DRAM I/O node 1TB flash drives 1TB flash drives Supernode Supernode Compute node 128 cores 128 cores 768GB DRAM 768GB DRAM 1TB flash drives 1TB flash drives SAN DIEGO SUPERCOMPUTER CENTER 16 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO I/O Node: Original Configuration Achieved only 15% of the upper bound after exhaustive tuning. The embedded processor is the bottleneck. I/O node CPU & memory Software RAID 0 Hardware RAID 0 Hardware RAID 0 RAID controller 64GB flash 64GB flash RAID controller 8 … 64GB flash 64GB flash 64GB flash 8 … 64GB flash SAN DIEGO SUPERCOMPUTER CENTER 17 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO I/O Node: New Configuration I/O node Achieved up to 80% of the upper bound. Peripheral hardware designed for spinning disks cannot satisfy flash drives! CPU & memory Software RAID 0 HBA 64GB flash 64GB flash HBA 4 … 64GB flash 64GB flash HBA 64GB flash 64GB flash 64GB flash 4 … 64GB flash HBA 4 … 64GB flash 64GB flash 64GB flash 4 … 64GB flash SAN DIEGO SUPERCOMPUTER CENTER 18 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO I/O System Configuration • • • • • CPU: 2 Intel® Nehalem quad-core 2.4GHz Memory: 48GB DDR3 Flash drives: 16 Intel® X25-E 64GB SLC SATA Benchmark: IOR and XDD File system: XFS SAN DIEGO SUPERCOMPUTER CENTER 19 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO DASH I/O Node Testing* • Distributed shared memory + flash drives • Prototype system: DASH • Design space exploration with 16,800 tests • • • • • Stripe sizes Stripe widths File systems I/O schedulers Queue depths • Revealed hardware and software issues • • • • RAID controller processor MSI per-vector masking File system cannot handle high IOPS I/O schedulers designed for spinning disks • Tuning during testing improved performance by about 9x. * Detailed info in TeraGrid 2010 paper: “DASH-IO: an Empirical Study of Flash-based IO for HPC” Jiahua He, Jeffrey Bennett, and Allan Snavely SAN DIEGO SUPERCOMPUTER CENTER 20 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The I/O Tuning Story Normalized random read IOPS 120 98.8 100 80 60 40 20 0 1 Default 1.9 5.6 12.4 Basic Software Without tunings RAID RAID Important tunings RAM Drive SAN DIEGO SUPERCOMPUTER CENTER 21 at the UNIVERSITY OF CALIFORNIA, SAN DIEGO DASH vSMP Node Tests • STREAM memory benchmark • GAMESS standard benchmark problem • Early user examples • Genomic sequence assembly code (Velvet). Successfully run on Dash with 178GB of memory used. • CMMAP cloud data analysis with data transposition in memory. Early tests run with ~130GB of data. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO DASH – Applications using flash as scratch space • Several HPC applications have a substantial per core (local) I/O component (primarily scratch files). Examples are Abaqus, NASTRAN, QChem. • Standard Abaqus test cases (S2A1, S4B) were run on Dash to compare performance between local hard disk and SSDs. HDD SSD S2A1, 8cores, 1node test 4058.2s 2964.9s S4B, 8cores, 1 node test* 12198s 9604.8s *Lustre is not optimal for such I/O. S4B test took 18920s. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Dash wins SC09 Storage Challenge at SC09 SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 3 Application Benchmarks • Massive graph traversal • Semantic web search • Astronomical database queries • Identification of transient objects • Biological networks database queries • Protein interactions SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Graph Networks Search (BFS) • Breadth-first-search on 200GB graph network SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO NIH Biological Networks Pathway Analysis • Interaction networks or graphs occur in many disciplines, e.g. epidemiology, phylogenetics and systems biology • Performance is limited by latency of a Database query and aggregate amount of fast storage available SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Biological Networks Query timing SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Palomar Transient Factory (PTF) • Nightly wide-field surveys using Palomar Schmidt telescope • Image data sent to LBL for archive/analysis • 100 new transients every minute • Large, random queries across multiple databases for IDs SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO PTF-DB Transient Search DASH-IOSSD Forward Q1 Backward Q1 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate observations SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Trestles – Coming Soon! • Trestles will work with and span the deployments of SDSC’s recently introduced Dash system and the larger Gordon data-intensive system. • To be configured by SDSC and Appro, Trestles is based on quad-socket, 8-core AMD Magny-Cours compute nodes connected via a QDR InfiniBand fabric. • Trestles will have 324 nodes, 10,368 processor cores, a peak speed of 100 teraflop/s, and 38 terabytes of flash memory. • Each of the 324 nodes will have 32 cores, 64 GB of DDR3 memory, and 120 GB of flash memory. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Conclusions • Gordon architecture customized for data-intensive applications, but built entirely from commodity parts • Basically a Linux cluster with • • • • • Large RAM memory/core Large amount of flash SSD Virtual shared memory software 10 TB of storage accessible from a single core shared memory parallelism for higher performance • Dash prototype accelerates real applications by 2-100x relative to disk depending on memory access patterns • Random I/O accelerated the most SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Dash – STREAM on vSMP (test results) SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO