Transcript Pr cis

Threading Opportunities in High-Performance
Flash-Memory Storage
Craig Ulmer
Maya Gokhale
Sandia National Laboratories, California
Lawrence Livermore National Laboratory
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States
Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000
Revolutionary Storage Technologies
• Storage-Intensive Supercomputing (SISC) at LLNL
– System architectures for applications with massive datasets
– New technologies: processing elements, networks, and storage
• NAND-Flash storage in high-performance computing
– Flash chips have great potential: 100x access times, 10x bandwidth
– However, few commercial products have delivered performance
• Exception: Fusion-io’s ioDrive
– PCIe x4 card with 80-320 GB of flash
– Theoretical read speed of 700 MB/s
– Hardware allows many IOPs to
be in-flight concurrently
Threaded I/O Microbenchmarks
• Observation: Increasing number of IOPs improves performance
– Opposite of what we expect from hard drives
– Due to flash memory packaging: chip is actually a die stack
• Implemented a set of I/O microbenchmarks to investigate
– Threaded with mixed I/O characteristics (mostly read-only)
– Currently: Block transfer, kNN, external sort, binary search
• Example: k-Nearest Neighbors (kNN)
– Stream through all training vectors and find
k vectors that are most similar to each input vector
– Each thread works on portion of training data
kNN Results
• Single ioDrive vs. Three SATA hard drives in RAID0
– ioDrive provided a 3x improvement to end application
– Small number of threads can have large impact with flash memory
250
160
SATA RAID - 1 Thread
SATA RAID - 2 Threads
SATA RAID - 4 Threads
ioDrive - 1 Thread
ioDrive - 2 Threads
ioDrive - 4 Threads
Time (s)
Time (s)
100
80
60
20
ioDrive
40
ioDrive - 32 Vectors per Pass
200
SATA
Single Pass Time (s)
120
SATA RAID - 32 Vectors per Pass
Single Pass Time (s)
140
150
100
50
0
0
1
2
4
8
16
Input Vectors per Pass
Input
Vectors
32
64
1
2
4
8
16
Threads
Threads
32
64
128