Transcript Document

Gordon: NSF Flash-based System for
Data-intensive Science
Mahidhar Tatineni
37th HPC User Forum
Seattle, WA Sept. 14, 2010
PIs: Michael L. Norman, Allan Snavely
San Diego Supercomputer Center
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
What is Gordon?
• A “data-intensive” supercomputer based on SSD
flash memory and virtual shared memory SW
• Emphasizes MEM and IOPS over FLOPS
• A system designed to accelerate access to
massive data bases being generated in all fields
of science, engineering, medicine, and social
science
• The NSF’s most recent Track 2 award to the San
Diego Supercomputer Center (SDSC)
• Coming Summer 2011
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Michael L. Norman
Principal Investigator
Director, SDSC
Allan Snavely
Co-Principal Investigator
Project Scientist
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Why Gordon?
• Growth of digital data is
exponential
• “data tsunami”
• Driven by advances in
digital detectors,
networking, and storage
technologies
• Making sense of it all is the
new imperative
•
•
•
•
•
data analysis workflows
data mining
visual analytics
multiple-database queries
data-driven applications
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Cosmological Dark Energy Surveys
Survey
Area
(sq. deg.)
Start date
Image Data
(PB)
Object
Catalog
(PB)
Pan-STARRS-1
30,000
2009
1.5
Dark Energy
Survey
5,000
2011
2.4
0.1
Pan-STARRS-4
30,000
2012
20
0.2
Large Synoptic
Survey
Telescope
20,000
~2015
60
30
Joint Dark
Energy Mission
28,000
~2015
~60
~30
0.1
Accelerating
universe
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Gordon is designed specifically for dataintensive HPC applications
• Such applications involve “very large data-sets or very
large input-output requirements” (NSF Track 2D RFP)
• Two data-intensive application classes are important
and growing
Data Mining
“the process of extracting
hidden patterns from data…
with the amount of data
doubling every three years, data
mining is becoming an
increasingly important tool to
transform this data into
information.”
Wikipedia
Data-Intensive
Predictive Science
solution of scientific
problems via simulations
that generate large amounts
of data
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Red Shift: Data keeps moving further away from
the CPU with every turn of Moore’s Law
1000
nanoseconds
100
10
CPU Cycle Time
1
Multi Core Effective Cycle Time
0.1
Memory Access Time
Disk Access Time
0.01
2007
2005
2003
2001
1999
1997
1995
1993
1991
1989
1987
1985
1984
1982
data due to Dean Klein of Micron
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Memory Hierarchy of a Typical HPC
Cluster
Shared memory
programming
Message passing
programming
Latency Gap
Disk I/O
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Memory Hierarchy of Gordon
Shared memory
programming
Disk I/O
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Gordon Architecture: “Supernode”
4 TB
SSD
I/O Node
• 32 Appro Extreme-X
compute nodes
• Dual processor Intel
Sandy Bridge
• 240 GFLOPS
• 64 GB
• 2 Appro Extreme-X IO
nodes
• Intel SSD drives
• 4 TB ea.
• 560,000 IOPS
• ScaleMP vSMP virtual
shared memory
• 2 TB RAM aggregate
• 8 TB SSD aggregate
vSMP memory
virtualization
240 GF
Comp.
Node
240 GF
Comp.
Node
64 GB
RAM
64 GB
RAM
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Gordon Architecture: Full Machine
• 32 supernodes =
1024 compute nodes
• Dual rail QDR
Infiniband network
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
SN
• 3D torus (4x4x4)
• 4 PB rotating disk
parallel file system
• >100 GB/s
SN
SN
SN
SN
D
SN
SN
D
SN
SN
D
SN
SN
D
SN
SN
SN
SN
SN
D
SN
D
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Gordon Aggregate Capabilities
Speed
245 TFLOPS
Mem (RAM)
64 TB
Mem (SSD)
256 TB
Mem (RAM+SSD)
320 TB
Ratio (MEM/SPEED)
1.31 BYTES/FLOP
IO rate to SSDs
35 Million IOPS
Network bandwidth
16 GB/s bi-directional
Network latency
1 msec.
Disk storage
4 PB
Disk IO Bandwidth
>100 GB/sec
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Dash*:
a working prototype of Gordon
*available as a TeraGrid resource
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
DASH Architecture
Compute node
2 Nehalem quadcore CPUs
Compute node
48GB DDR3
memory
Compute node
…
16
InfiniBand
I/O
node
I/O node
CPU & memory
Compute node
RAID controllers/HBAs (Host Bus Adapters)
Intel® X25-E 64GB flash
drive
Intel® X25-E 64GB flash
drive
16
…
Intel® X25-E 64GB flash
drive
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
DASH Architecture
Just a cluster ???
Compute node
Compute node
I/O
node
…
16
InfiniBand
Compute node
SAN DIEGO SUPERCOMPUTER CENTER
15
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
DASH Architecture
Supernode
vSMP system
Compute node
Compute node
…
16
InfiniBand
Supernode
128 cores
128 cores
768GB DRAM
768GB DRAM
I/O
node
1TB flash drives
1TB flash drives
Supernode
Supernode
Compute node
128 cores
128 cores
768GB DRAM
768GB DRAM
1TB flash drives
1TB flash drives
SAN DIEGO SUPERCOMPUTER CENTER
16
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
I/O Node: Original Configuration
Achieved only 15% of the upper
bound after exhaustive tuning. The
embedded processor is the
bottleneck.
I/O node
CPU & memory
Software RAID 0
Hardware RAID 0
Hardware RAID 0
RAID controller
64GB
flash
64GB
flash
RAID controller
8
…
64GB
flash
64GB
flash
64GB
flash
8
…
64GB
flash
SAN DIEGO SUPERCOMPUTER CENTER
17
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
I/O Node: New Configuration
I/O node
Achieved up to 80% of the upper bound.
Peripheral hardware designed for
spinning disks cannot satisfy flash
drives!
CPU & memory
Software RAID 0
HBA
64GB
flash
64GB
flash
HBA
4
…
64GB
flash
64GB
flash
HBA
64GB
flash
64GB
flash
64GB
flash
4
…
64GB
flash
HBA
4
…
64GB
flash
64GB
flash
64GB
flash
4
…
64GB
flash
SAN DIEGO SUPERCOMPUTER CENTER
18
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
I/O System Configuration
•
•
•
•
•
CPU: 2 Intel® Nehalem quad-core 2.4GHz
Memory: 48GB DDR3
Flash drives: 16 Intel® X25-E 64GB SLC SATA
Benchmark: IOR and XDD
File system: XFS
SAN DIEGO SUPERCOMPUTER CENTER
19
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
DASH I/O Node Testing*
• Distributed shared memory + flash drives
• Prototype system: DASH
• Design space exploration with 16,800 tests
•
•
•
•
•
Stripe sizes
Stripe widths
File systems
I/O schedulers
Queue depths
• Revealed hardware and software issues
•
•
•
•
RAID controller processor
MSI per-vector masking
File system cannot handle high IOPS
I/O schedulers designed for spinning disks
• Tuning during testing improved performance by about 9x.
* Detailed info in TeraGrid 2010 paper:
“DASH-IO: an Empirical Study of Flash-based IO for HPC”
Jiahua He, Jeffrey Bennett, and Allan Snavely
SAN DIEGO SUPERCOMPUTER CENTER
20
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
The I/O Tuning Story
Normalized random read IOPS
120
98.8
100
80
60
40
20
0
1
Default
1.9
5.6
12.4
Basic Software Without
tunings RAID
RAID
Important tunings
RAM
Drive
SAN DIEGO SUPERCOMPUTER CENTER
21
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
DASH vSMP Node Tests
• STREAM memory benchmark
• GAMESS standard benchmark problem
• Early user examples
• Genomic sequence assembly code (Velvet). Successfully
run on Dash with 178GB of memory used.
• CMMAP cloud data analysis with data transposition in
memory. Early tests run with ~130GB of data.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
DASH – Applications using flash as
scratch space
• Several HPC applications have a substantial per
core (local) I/O component (primarily scratch
files). Examples are Abaqus, NASTRAN, QChem.
• Standard Abaqus test cases (S2A1, S4B) were
run on Dash to compare performance between
local hard disk and SSDs.
HDD
SSD
S2A1, 8cores, 1node test
4058.2s
2964.9s
S4B, 8cores, 1 node test*
12198s
9604.8s
*Lustre is not optimal for such I/O. S4B test took 18920s.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Dash wins SC09 Storage Challenge at SC09
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
3 Application Benchmarks
• Massive graph traversal
• Semantic web search
• Astronomical database queries
• Identification of transient objects
• Biological networks database queries
• Protein interactions
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Graph Networks Search (BFS)
• Breadth-first-search on 200GB graph network
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
NIH Biological Networks
Pathway Analysis
• Interaction networks or graphs
occur in many disciplines, e.g.
epidemiology, phylogenetics
and systems biology
• Performance is limited by
latency of a Database query and
aggregate amount
of fast storage available
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Biological Networks Query timing
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Palomar Transient Factory (PTF)
• Nightly wide-field
surveys using
Palomar Schmidt
telescope
• Image data sent to
LBL for
archive/analysis
• 100 new transients
every minute
• Large, random
queries across
multiple databases
for IDs
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
PTF-DB Transient Search
DASH-IOSSD
Forward
Q1
Backward
Q1
11s
(145x)
100s
(24x)
Existing DB 1600s
2400s
Random Queries requesting very small chunks of data about the
candidate observations
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Trestles – Coming Soon!
• Trestles will work with and span the deployments of SDSC’s
recently introduced Dash system and the larger Gordon
data-intensive system.
• To be configured by SDSC and Appro, Trestles is based on
quad-socket, 8-core AMD Magny-Cours compute nodes
connected via a QDR InfiniBand fabric.
• Trestles will have 324 nodes, 10,368 processor cores, a
peak speed of 100 teraflop/s, and 38 terabytes of flash
memory.
• Each of the 324 nodes will have 32 cores, 64 GB of DDR3
memory, and 120 GB of flash memory.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Conclusions
• Gordon architecture customized for data-intensive
applications, but built entirely from commodity parts
• Basically a Linux cluster with
•
•
•
•
•
Large RAM memory/core
Large amount of flash SSD
Virtual shared memory software
10 TB of storage accessible from a single core
shared memory parallelism for higher performance
• Dash prototype accelerates real applications by 2-100x
relative to disk depending on memory access patterns
• Random I/O accelerated the most
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Dash – STREAM on vSMP (test results)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO