HPC Storage Current Status and Futures

Download Report

Transcript HPC Storage Current Status and Futures

HPC Storage
Current Status and Futures
Torben Kling Petersen, PhD
Principal Architect, HPC
1
Agenda ??
• Where are we today ??
• File systems
• Interconnects
• Disk technologies
• Solid state devices
• Solutions
• Final thoughts …
Pan Galactic Gargle Blaster
"Like having your brains smashed
out by a slice of lemon wrapped
around a large gold brick.”
2
Current Top10 …..
Rank Name
1
2
3
4
5
6
7
8
9
10
Computer
Site
TH-IVB-FEP Cluster, Xeon
E5-2692 12C 2.2GHz, TH
Express-2, Intel Xeon Phi
Cray XK7 , Opteron 6274
Titan
16C 2.2GHz, Cray Gemini
interconnect, NVIDIA K20x
BlueGene/Q, Power BQC
Sequoia
16C 1.60 GHz, Custom
Interconnect
Fujitsu, SPARC64 VIIIfx
K computer 2.0GHz,
Tofu interconnect
Tianhe-2
National Super
Computer Center
in Guangzhou
DOE/SC/Oak
Ridge National
Laboratory
Rmax
Rpeak
Power
(KW)
File
system
Size
Perf
Lustre/H2
12.4 PB ~750 GB/s
FS
3120000 33862700 54902400
17808
560640 17590000 27112550
8209
Lustre
10.5 PB
240 GB/s
DOE/NNSA/LLNL 1572864 17173224 20132659
7890
Lustre
55 PB
850 GB/s
12659
Lustre
40 PB
965 GB/s
RIKEN AICS
DOE/SC/Argonne
Mira
National
Laboratory
Cray XC30, Xeon E5-2670 Swiss National
Piz Daint
8C 2.600GHz, Aries
Supercomputing
interconnect , NVIDIA K20x Centre (CSCS)
PowerEdge C8220, Xeon
TACC/
Stampede E5-2680 8C 2.7GHz, IB
Univ. of Texas
FDR, Intel Xeon Phi
BlueGene/Q, Power BQC Forschungs
JUQUEEN 16C 1.600GHz, Custom
zentrum Juelich
Interconnect
(FZJ)
BlueGene/Q, Power BQC
Vulcan
16C 1.600GHz, Custom
DOE/NNSA/LLNL
Interconnect
iDataPlex DX360M4, Xeon
Leibniz
SuperMUC E5-2680 8C 2.70GHz,
Rechenzentrum
Infiniband FDR
BlueGene/Q, Power BQC
16C 1.60GHz, Custom
Total
Cores
705024 10510000 11280384
786432
8586612 10066330
3945
GPFS
7.6 PB
88 GB/s
115984
6271000
7788853
2325
Lustre
2.5 PB
138 GB/s
462462
5168110
8520112
4510
Lustre
14 PB
150 GB/s
458752
5008857
5872025
2301
GPFS
5.6 PB
33 GB/s
393216
4293306
5033165
1972
Lustre
55 PB
850 GB/s
147456
2897000
3185050
3423
GPFS
10 PB
200 GB/s
n.b. NCSA Bluewaters
24 PB
1100 GB/s (Lustre 2.1.3)
3
Lustre Roadmap
4
Other parallel file systems
• GPFS
– Running out of steam ??
– Let me qualify !! (and he then rambles on …..)
• Frauenhofer
– Excellent metadata perf
– Many modern features
– No real HA
• Ceph
– New, interesting and with a LOT of good features
– Immature and with limited track record
• Panasas
– Still RAID 5 and running out of steam ….
5
Object based storage
• A traditional file system includes a hierarchy of files and
directories
• Accessed via a file system driver in the OS
• Object storage is “flat”, objects are located by direct
reference
• Accessed via custom APIs
o Swift, S3, librados, etc.
• The difference boils down to 2 questions:
o
o
How do you find files?
Where do you store metadata?
• Object store + Metadata + driver is a filesystem
6
Object Storage Backend: Why?
• It’s more flexible. Interfaces can be presented in other
ways, without the FS overhead. A generalized storage
architecture vs. a file system
• It’s more scalable. POSIX was never intended for
clusters, concurrent access, multi-level caching, ILM,
usage hints, striping control, etc.
• It’s simpler. With the file system-”isms” removed, an
elegant (= scalable, flexible, reliable) foundation can be
laid
7
Elephants all the way down...
Most clustered FS and OS are built on local FS’s –
….and inherit their problems
• Native FS
o
XFS, ext4, ZFS, btrfs
• OS on FS
o
o
Ceph on btrfs
Swift on XFS
• FS on OS on FS
o
o
o
CephFS on Ceph on btrfs
Lustre on OSS on ext4
Lustre on OSS on ZFS
8
The way forward ..
• ObjectStor based solutions offers a lot of flexibility:
– Next-generation design, for exascale-level size,
performance, and robustness
– Implemented from scratch
• "If we could design the perfect exascale storage system..."
–
–
–
–
–
–
–
Not limited to POSIX
Non-blocking availability of data
Multi-core aware
Non-blocking execution model with thread-per-core
Support for non-uniform hardware
Flash, non-volatile memory, NUMA
Using abstractions, guided interfaces can be implemented
• e.g., for burst buffer management (pre-staging and de-staging).
9
Interconnects (Disk and Fabrics)
• S-ATA 6 Gbit
• FC-AL 8 Gbit
• SAS 6 Gbit
• SAS 12 Gbit
• PCI-E direct attach
• Ethernet
• Infiniband
• Next gen interconnect…
10
12 Gbit SAS
• Doubles bandwidth compared to SAS 6 Gbit
• Triples the IOPS !!
• Same connectors and cables 
• 4.8 GB/s in each direction with 9.6 GB/s following
– 2 streams moving to 4 streams
• 24 Gb/s SAS is on the drawing board ….
11
PCI-E direct attach storage
• M.2 Solid State Storage Modules
• Lowest latency/highest bandwidth
• Limitations in # of PCI-E channels available
• Ivy Bridge has up to 40 lanes per chip
PCIe Arch
Raw Bit Rate
Data
Encoding
Interconnect
bandwidth
BW Lane
Direction
Total BW for
x16 link
PCIe 1.x
2.5GT/s
8b/10b
2Gb/s
~250MB/s
~8GB/s
PCIe 2.0
5.0GT/s
8b/10b
4Gb/s
~500MB/s
~16GB/s
PCIe 3.0
8.0GT/s
128b/130b
8Gb/s
~1 GB/s
~32GB/s
PCIe 4.0
??
??
??
??
??
12
Ethernet – Still going strong ??
• Ethernet has now been around for 40 years !!!
• Currently around 41% of Top500 systems …
28% 1 GbE
13% 10 GbE
• 40 GbE shipping in volume
• 100 GbE being demonstrated
– Volume shipments expected in 2015
• 400 GbE and 1 TbE is on the drawing board
– 400 GbE planned for 2017
13
Infiniband …
14
Next gen interconnects …
• Intel acquired Qlogic Infiniband team …
• Intel acquired Cray’s interconnect technologies …
• Intel has published and shows silicon photonics …
• And this means WHAT ????
15
Disk drive technologies
16
Areal density futures
17
Disk write technology ….
18
18
Hard drive futures …
• Sealed Helium Drives (Hitachi )
– Higher density – 6 platters/12 heads
– Less power (~ 1.6W idle)
– Less heat (~ 4°C lower temp)
• SMR drives (Seagate)
– Denser packaging on current technology
– Aimed at read intensive application areas
• Hybrid drives (SSHD)
– Enterprise edition
– Transparent SSD/HHD combination (aka Fusion drives)
• eMLC + SAS
19
SMR drive deep-dive
20
Hard drive futures …
• HAMR drives (Seagate)
– Using a laser to heat the magnetic substrate
(Iron/Platinum alloy)
– Projected capacity – 30-60 TB/ 3.5 inch drive …
– 2016 timeframe ….
• BPM (bit patterned media recording)
– Stores one bit per cell, as opposed to regular hard-drive
technology, where each bit is stored across a few
hundred magnetic grains
– Projected capacity – 100+ TB / 3.5 inch drive …
21
What about RAID ?
• RAID 5 – No longer viable
• RAID 6 – Still OK
– But re-build times are becoming prohibitive
• RAID 10
– OK for SSDs and small arrays
• RAID Z/Z2 etc
– A choice but limited functionality on Linux
• Parity Declustered RAID
– Gaining foothold everywhere
– But ….
PD-RAID ≠ PD-RAID ≠ PD-RAID ….
• No RAID ???
– Using multiple distributed copies works but …
22
Flash (NAND)
• Supposed to “Take over the World” [cf. Pinky and the Brain]
• But for high performance storage there are issues ….
– Price and density not following predicted evolution
– Reliability (even on SLC) not as expected
• Latency issues
– SLC access ~25µs, MLC ~50µs …
– Larger chips increase contention
• Once a flash die is accessed,
other dies on the same bus must wait
• Up to 8 flash dies shares a bus
• Address translation, garbage collection and especially
wear leveling add significant latency
23
Flash (NAND)
• MLC
3-4 bits per cell @ 10K duty cycles
• SLC
– 1 bit per cell @ 100K duty cycles
• eMLC
– 2 bits per cell @ 30K duty cycles
• Disk drive formats (S-ATA / SAS bandwidth limitations)
• PCI-E accelerators
• PCI-E direct attach
24
NV-RAM
• Flash is essentially NV-RAM but ….
• Phase Change Memory (PCM)
– Significantly faster and more dense that NAND
– Based on chalcogenide glass
• Thermal vs electronic process
– More resistant to external factors
• Currently the expected solution for burst buffers etc …
– but there’s always Hybrid Memory Cubes ……
25
Solutions …
• Size does matter …..
2014 – 2016 >20 proposals for 40+ PB file systems
Running at 1 – 4 TB/s !!!!
• Heterogeneity is the new buzzword
– Burst buffers, data capacitors, cache off loaders …
• Mixed workloads are now taken seriously ….
• Data integrity is paramount
– T10-DIF/X is a decent start but …
• Storage system resiliency is equally important
– PD-RAID need to evolve and become system wide
• Multi tier storage as standard configs …
• Geographical distributed solutions commonplace
26
Final thoughts ???
• Object based storage
– Not an IF but a WHEN ….
– Flavor(s) still TBD – DAOS, Exascale10, XEOS, ….
• Data management core to any solution
– Self aware data, real time analytics, resource management
ranging from job scheduler to disk block ….
– HPC storage = Big Data
• Live data –
– Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔︎ Cloud ↔︎ Ice
• Compute with storage ➜ Storage with compute
Storage is no longer a 2nd class citizen
27
Thank You
[email protected]
28