Transcript Pr cis
Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Maya Gokhale Sandia National Laboratories, California Lawrence Livermore National Laboratory Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000 Revolutionary Storage Technologies • Storage-Intensive Supercomputing (SISC) at LLNL – System architectures for applications with massive datasets – New technologies: processing elements, networks, and storage • NAND-Flash storage in high-performance computing – Flash chips have great potential: 100x access times, 10x bandwidth – However, few commercial products have delivered performance • Exception: Fusion-io’s ioDrive – PCIe x4 card with 80-320 GB of flash – Theoretical read speed of 700 MB/s – Hardware allows many IOPs to be in-flight concurrently Threaded I/O Microbenchmarks • Observation: Increasing number of IOPs improves performance – Opposite of what we expect from hard drives – Due to flash memory packaging: chip is actually a die stack • Implemented a set of I/O microbenchmarks to investigate – Threaded with mixed I/O characteristics (mostly read-only) – Currently: Block transfer, kNN, external sort, binary search • Example: k-Nearest Neighbors (kNN) – Stream through all training vectors and find k vectors that are most similar to each input vector – Each thread works on portion of training data kNN Results • Single ioDrive vs. Three SATA hard drives in RAID0 – ioDrive provided a 3x improvement to end application – Small number of threads can have large impact with flash memory 250 160 SATA RAID - 1 Thread SATA RAID - 2 Threads SATA RAID - 4 Threads ioDrive - 1 Thread ioDrive - 2 Threads ioDrive - 4 Threads Time (s) Time (s) 100 80 60 20 ioDrive 40 ioDrive - 32 Vectors per Pass 200 SATA Single Pass Time (s) 120 SATA RAID - 32 Vectors per Pass Single Pass Time (s) 140 150 100 50 0 0 1 2 4 8 16 Input Vectors per Pass Input Vectors 32 64 1 2 4 8 16 Threads Threads 32 64 128