TK. Petersen - Optimizing Performance of HPC Storage Systems

Download Report

Transcript TK. Petersen - Optimizing Performance of HPC Storage Systems

Optimizing Performance of
HPC Storage Systems
Torben Kling Petersen, PhD
Principal Architect
High Performance Computing
Current Top10 …..
Rank Name
1
2
3
4
5
N/A
6
7
8
9
10
Computer
TH-IVB-FEP Cluster, Xeon
E5-2692 12C 2.2GHz, TH
Express-2, Intel Xeon Phi
Cray XK7 , Opteron 6274
Titan
16C 2.2GHz, Cray Gemini
interconnect, NVIDIA K20x
BlueGene/Q, Power BQC
Sequoia
16C 1.60 GHz, Custom
Interconnect
Fujitsu, SPARC64 VIIIfx
K computer 2.0GHz,
Tofu interconnect
BlueGene/Q, Power BQC
Mira
16C 1.60GHz, Custom
Cray XK7, Opteron 16C
BlueWaters 2.2GHz, Cray Gemini
interconnect, NVIDIA K20x
Cray XC30, Xeon E5-2670
Piz Daint
8C 2.600GHz, Aries
interconnect , NVIDIA K20x
PowerEdge C8220, Xeon
Stampede E5-2680 8C 2.7GHz, IB
FDR, Intel Xeon Phi
BlueGene/Q, Power BQC
JUQUEEN 16C 1.600GHz, Custom
Interconnect
BlueGene/Q, Power 16C
Vulcan
1.6GHz, Custom Intercon.
iDataPlex DX360M4, Xeon
SuperMUC E5-2680 8C 2.70GHz,
Infiniband FDR
Tianhe-2
Site
National Super
Computer Center
in Guangzhou
DOE/SC/Oak
Ridge National
Laboratory
Total
Cores
Rmax
Power
(KW)
Rpeak
File
system
Size
Perf
Lustre/H2
12.4 PB ~750 GB/s
FS
3120000 33862700 54902400
17808
560640 17590000 27112550
8209
Lustre
10.5 PB
240 GB/s
DOE/NNSA/LLNL 1572864 17173224 20132659
7890
Lustre
55 PB
850 GB/s
12659
Lustre
40 PB
965 GB/s
3945
GPFS
7.6 PB
88 GB/s
RIKEN AICS
705024 10510000 11280384
DOE/SC/Argonne
National Lab.
786432
NCSA
8586612 10066330
-
-
-
-
Lustre
24 PB 1100 GB/s
Swiss National
Supercomputing
Centre (CSCS)
115984
6271000
7788853
2325
Lustre
2.5 PB
138 GB/s
TACC/
Univ. of Texas
462462
5168110
8520112
4510
Lustre
14 PB
150 GB/s
Forschungs
zentrum Juelich
(FZJ)
458752
5008857
5872025
2301
GPFS
5.6 PB
33 GB/s
DOE/NNSA/LLNL
393216
4293306
5033165
1972
Lustre
55 PB
850 GB/s
Leibniz
Rechenzentrum
147456
2897000
3185050
3423
GPFS
10 PB
200 GB/s
2
Performance testing
3
Storage benchmarks
• IOR
• IOzone
• Bonnie++
• spgdd-survey
• obdfilter-survey
• FIO
• dd/xdd
• Filebench
• dbench
• Iometer
• MDstat
• metarates …….
©Xyratex
2013
4
Lustre® Architecture – High Level
Object Storage Object Storage
Servers (OSS) Target (OST)
1-1,000s
CIFS
Client
Gateway
NFS
Client
Client
Client
Router
Support multiple network types
IB, X-GigE
…
Client
Lustre Client
1-100,000
Metadata
MDS
Servers (MDS)
Metadata
Target (MDT)
MDS
OSS
disk
OSS
disk
OSS
disk
OSS
OSS
OSS
OSS
Disk arrays &
SAN Fabric
disk
©Xyratex
2013
5
Dissecting benchmarking
The chain and the weakest link …
Non-blocking fabric ?
TCP-IP Overhead ??
Routing ?
SAS port over subscription
Cabling …
RAID controller (SW/HW)
Client
OSS
Memory
Memory
bus
CPU
Memory
PCI-E bus
FS client
MPI stack
network
bus
bus
SAS Controller/Expander
PCI-E bus
CPU
Memory
OS
Interconnect
disk
interconnect
Disk drive perf
RAID sets
SAS or S-ATA
File system ??
Only a balanced system will deliver performance …..
7
Server Side Benchmark
• Using obdfilter-survey is a Lustre benchmark tool that measures OSS
and backend OST performance and does not measure LNET or Client
performance
• This is a good benchmark to isolate network and client from the server.
• Example of obdfilter-survey parameters
[root@oss1 ~]# nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey
• Parameters Defined
– size=65536
// file size (2x Controller Memory is good practice)
– nobjhi=1 nobjlo=1 // number of files
– thrhi=256 thrlo=256 // number of worker threads when testing OSS
• If you see results significantly lower than what is expected, rerun the test multiple times to
ensure those results are not consistent.
• This benchmark can also target individual OSTs if we see an OSS node performing lower
than expected, it can be because of a single OST performing lower due to drive issue,
RAID array rebuilds, etc.
[root@oss1 ~]# targets=“fsname-OST0000 fsname-OST0002” nobjlo=1 nobjhi=1
thrlo=256 thrhi=256 size=65536 obdfilter-survey
©Xyratex
2013
8
Client Side Benchmark
• IOR uses MPI-IO to execute the benchmark tool across all nodes and
mimics typical HPC applications running on Clients
• Within IOR, one can configure the benchmark for File-Per-Process, and
Single-Shared-File
– File-Per-Process: Creates a unique file per task and most common
way to measure peak throughput of a Lustre parallel Filesystem
– Single-Shared-File: Creates a Single File across all tasks running on
all clients
• Two primary modes for IOR
– Buffered: This is default and takes advantage of Linux page caches
on the Client
– DirectIO: Bypasses Linux page caching and writes directly to the
filesystem
©Xyratex
2013
9
Typical Client Configuration
• At customer sites, typically all clients have the same
architecture, same number of CPU cores, and same
amount of memory.
• With a uniform client architecture, the parameters for IOR
are simpler to tune and optimize for benchmarking
• Example for 200 Clients
– Number of Cores per Client: 16 (# nproc)
– Amount of Memory per Client 32GB (cat /proc/meminfo)
©Xyratex
2013
10
IOR Rule of Thumb
• Always want to transfer 2x the memory size of the total
number of clients used to avoid any client side caching
effect
• In our example:
– (200 Clients*32 GB)*2 = 12,800 GB
• Total file size for the IOR benchmark will be 12.8 TB
– NOTE: Typically all nodes are uniform.
©Xyratex
2013
11
Lustre Configuration
Lustre Server Caching Description
• Lustre read_cache_enable
– controls whether data read from disk during a read request is kept in
memory and available for later read requests for the same data,
without having to re-read it from disk. By default, read cache is enabled
(read_cache_enable = 1).
• Lustre writethrough_cache_enable
– controls whether data sent to the OSS as a write request is kept in the
read cache and available for later reads, or if it is discarded from cache
when the write is completed. By default, writethrough cache is enabled
(writethrough_cache_enable = 1)
• Lustre readcache_max_filesize
– controls the maximum size of a file that both the read cache and
writethrough cache will try to keep in memory. Files larger than
readcache_max_filesize will not be kept in cache for either reads or
writes. Default is all file sizes are cached.
©Xyratex
2013
13
Client Lustre Parameters
• Network Checksums
– Default is turned on and impacts performance.
Disabling this is first thing we do for performance
• LRU Size
– Typically we disable this parameter
– Parameter used to control the number of client-side locks in an LRU queue
• Max RPCs in Flight
– Default is 8, increase to 32
– RPC is remote procedure call
– This tunable is the maximum number of concurrent RPCs in flight from
clients.
• Max Dirty MB
– Default is 32, good rule of thumb is 4x the value of max_rpcs_in_flight.
– Defines the amount of MBs of dirty data can be written and
queued up on the client
©Xyratex
2013
14
Lustre Striping
• Default Lustre Stripe size is 1M and stripe count is 1
– Each file is written to 1 OST with a stripe size of 1M
– When multiple files are created and written, MDS will do
best effort to distribute the load across all available OSTs
• The default stripe size and count can be changed.
Smallest Stripe size is 64K and can be increased by 64K
and stripe count can be increased to include all OSTs
– Changing stripe count to all OSTs indicates each file will
be created using all OSTs. This is best when creating a
single shared file from multiple Lustre Clients
• One can create multiple directories with various stripe
sizes and counts to optimize for performance
©Xyratex
2013
15
Experimental setup
&
Results
Equipment used
• ClusterStor with 2x CS6000 SSUs
– 2TB NL-SAS Hitachi Drives
– 4U CMU
– Neo 1.2.1, HF applied
• Clients
– 64 Clients, 12 Cores, 24GB Memory, QDR
– Mellanox FDR core switch
– Lustre Client: 1.8.7
• Lustre
– Server version: 2.1.3
– 4 OSSes
– 16 OSTs (RAID 6)
©Xyratex
2013
17
Subset of test parameters
• Disk backend testing – obdfilter-survey
• Client based testing – IOR
– I/O mode
– I/O Slots per client
– IOR transfer size
– Number of Client threads
• Lustre tunings
– writethrough cache enabled
– read cache enabled
– read cache max filesize = 1M Client Settings
– LRU Disabled
– Checksums Disabled
– MAX RPCs in Flight = 32
©Xyratex
2013
18
Lustre obdfilter-survey
# pdsh -g oss "TERM=linux thrlo=256 thrhi=256 nobjlo=1
nobjhi=1 rsz=1024K size=32768 obdfilter-survey"
cstor01n04: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024
write 3032.89 [ 713.86, 926.89]
rewrite 3064.15 [ 722.83, 848.93]
read 3944.49 [ 912.83,1112.82]
cstor01n05: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024
write 3022.43 [ 697.83, 819.86]
rewrite 3019.55 [ 705.15, 827.87]
read 3959.50 [ 945.20,1125.76]
This means that a single SSU have a
• write performance of 6,055 MB/s (75,9 MB/s per disk)
• read performance of 7,904 MB/s (98.8 MB/s per disk )
©Xyratex
2013
19
Buffered I/O
Write Performance - Buffered I/O
Write performance Buffered I/O
14,000
14,000
12,000
12,000
np=4
10,000
-t = 2
8,000
-t = 4
-t = 8
6,000
-t = 16
-t = 32
4,000
Performnane (MB/s)
Performance (MB/s)
np=8
10,000
np=16
np=32
8,000
np=64
6,000
np=128
4,000
np=256
-t = 64
2,000
2,000
np=512
np=1024
0
4
8
16
32
64
128
256
512
1024
1
1536
2
4
8
16
32
64
Transfer size (MB)
No of threads
Read performance - Buffered I/O
Read Performance - Buffered I/O
14,000
14,000
12,000
12,000
10,000
-t = 2
8,000
-t = 4
-t = 8
6,000
Performnane (MB/s)
Performance (MB/s)
np=4
np=8
10,000
np=16
8,000
np=32
np=64
6,000
-t = 16
-t = 32
4,000
np=128
4,000
np=256
-t = 64
2,000
2,000
np=512
np=1024
0
4
8
16
32
64
128
No of threads
256
512
1024
1536
1
2
4
8
16
32
64
Transfer size (MB)
20
Direct I/O
IOR FPP DirectIO Reads
Read performance - Direct I/O
14,000
14,000
12,000
np=4
10,000
-t = 2
8,000
-t = 4
-t = 8
6,000
-t = 16
4,000
-t = 32
Performance (MB/s)
Throughput (MB/s)
12,000
10,000
np=8
np=16
8,000
np=32
6,000
np=64
np=128
4,000
np=256
-t = 64
2,000
2,000
np=512
np=1024
0
4
8
16
32
64
128
256
512
1024
0
1536
1
No of threads
2
4
8
16
32
64
Transfer size (MB)
Write performance - Direct I/O
IOR FPP DirectIO Writes
14,000
14,000
12,000
12,000
10,000
10,000
-t = 2
8,000
-t = 4
-t = 8
6,000
-t = 16
4,000
-t = 32
Performance (MB/s)
Throughput (MB/s)
np=4
np=8
np=16
8,000
np=32
6,000
np=64
np=128
4,000
np=256
-t = 64
2,000
2,000
0
0
np=512
np=1024
4
8
16
32
64
128
No of threads
256
512
1024
1536
1
2
4
8
16
32
64
Transfer size (MB)
21
Summary
Reflections on the results
• Never trust marketing numbers …
• Testing all stage of the data pipeline is essential
• Optimal parameters and/or methodology for read and write
are seldom the same
• Real life applications can often be configured accordingly
• Balanced architectures will deliver performance
– Client based IOR performs within 5% of backend
– In excess of 750 MB/s per OST … -> 36 GB/s per rack …
• A well designed solution will scale linearly using Lustre
– cf. NCSA BlueWaters
©Xyratex
2013
23
Optimizing Performance of
HPC Storage Systems
[email protected]
Thank You