Transcript Slide 1
生命科学、气象行业
高性能计算解决方案及成功案例分享
凌巍才
高性能计算产品技术顾问
戴尔(中国)有限公司
1
Confidential
Global Marketing
内容
•
生命科学高性能计算解决方案
– GPU加速解决方案
– 高性能存储解决方案
•
WRF V3.3 ( 气象行业应用) 在 Dell R720 服务器 程序测试及优化
– gcc 编译器器
– Intel 编译器
•
2
Confidential
成功案例分享
Global Marketing
生命科学
HPC GPU 方案
Global Marketing
在生命科学领域中
很多用户采用GPU加速解决方案
4
Confidential
Global Marketing
CPU + GPU 计算
5
Confidential
Global Marketing
HPCC GPU 异构平台
6
Confidential
Global Marketing
支持GPU的 Dell 服务器方案(2012年,12代服务器)
Internal Solutions
External Solutions (PowerEdge C)
C6220
C6220
C6145
C6145
C410x
C410x
C410x
C410x
R720
T620
C6220
GPU:Socket Ratio
Total System Boards
Total HIC
IB Capable
Total GPU
Per GPU B/W
MSRP (M2075)
Power Envelope (est)
Theoretical GFLOPs
Est. GFLOPs
GFLOPS/Rack U
$/GFLOPS
Rack Size
GPU/Rack U
7
Confidential
1:1
8
8
Yes
16
8
$117,000
5.525 kW
TBD
TBD
TBD
TBD
7
2.3
C6145
2:1
4
4
Yes
16
4
$86,900
4.118 kW
TBD
TBD
TBD
TBD
5
3.2
1:1
4
8
Yes
16
8
$114,000
5.030 kW
9,326
2,891
413
39
7
2.3
2:1
2
4
Yes
16
4
$85,250
3.802 kW
8,932
1,697
339
50
5
3.2
2:1
1
0
Yes*
4
4
$19,000
1:1
1
0
Yes
2
16
$13,000
2,431
TBD
486
8
5
0.8
1,401
TBD
701
9
2
1.0
Global Marketing
GPU 扩展箱方案 (GPU外置方案)
Dell PowerEdge C410x
PCIe EXPANSION CHASSIS CONNECTING 1-8 HOSTS TO 1-16 PCIe
Great for: HPC including universities, oil & gas, biomed research, design, simulation,
mapping, visualization, rendering, and gaming
•
•
•
•
•
•
3U chassis, 19” wide, 143 pounds
PCI express modules: 10 front, 6 rear
PCI form factors: HH/HL and FH/HL
Up to 225W per module
PCIe inputs: 8PCIe x16 IPASS ports
PCI fan out options: x16 to 1 slot, x16 to 2 slot,
x16 to 3 slot, x16 to 4 slot
• GPUs supported: NVIDIA M1060, M2050,
M2070 (TBD)
8
Confidential
• Thermals: high-efficiency 92mm fans; N + 1 fan
redundancy
• Management: On-board BMC; IPMI 2.0;
dedicated management port
• Power supplies: 4 x 1400W hot-plug, high
efficiency PSUs; N+1 power redundancy
• Services vary by region: IT Consulting, Server
and Storage Deployment, Rack Integration (US
only), Support Services
Global Marketing
PowerEdge C410x PCIe 模块
• Serviceable PCIe module (taco) capable of supporting any half-height, halflength (HH/HL) or full-height/half-length (FH/HL) cards
• FH/FL cards supported with extended PCIe module
• Future-proofing on next generations of NVIDIA
and AMD ATI GPU cards
Power connector
for GPGPU card
LED
Board-to-board
connector for
X16 Gen PCIe
signals and power
GPU card
9
Confidential
Global Marketing
PowerEdge C410x Configurations
• Enabling HPC applications to optimize cost / performance
equation off single x16
1 GPU / x16
Host
HIC
x16
C6100
PCI
Switch
8GPU/7U
x16
C410x
7U = (1) C410x + (2) C6100
3 GPU / x16
x16
HIC
Host
PCI
Switch
12GPU/5U
x16
x16
C6100
x16
iPass cable
C410x
HIC
x16
GPU
GPU
GPU
16GPU/7U
x16
PCI
Switch
GPU
C410x
iPass cable
7U = (1) C410x + (2) C6100
4 GPU / x16
Host
GPU
x16
C6100
iPass cable
Host
GPU
2 GPU / x16
x16
HIC
16GPU/5U
PCI
Switch
x16
x16
C6100
x16
iPass cable
x16
5U = (1) C410x + (1)
C6100
5U = (1) C410x + (1) C6100
C410x
GPU
GPU
GPU
GPU
GPU/U ratios assume PowerEdge C6100 host with 4 servers per 2U chassis
10
Confidential
Global Marketing
Flexibility of the PowerEdge C410x
• Increases to 8:1 possible with dual x16
x16
x16
x16
iPass cable
x16
Host
PCI
Switch
x16
iPass cable
GPU
Host
PCI
Switch
x16
x16
iPass cable
x16
GPU
HIC
HIC
x16
C410x
x16
x16
x16
PCI
Switch
x16
x16
iPass cable
x16
x16
C410x
11
Confidential
GPU
GPU
GPU
HIC
HIC
GPU
GPU
PCI
Switch
GPU
GPU
GPU
GPU
GPU
Global Marketing
PowerEdge C6100 Configurations
“2:1 Sandwich”
C6100
C410x
C6100
Summary
C6100 “2:1 Sandwich”
One Dell C410x (16 GPUs)
Two C6100 (8 nodes)
One x16 slot for each node to 2 GPUs
7U total
16 GPUs total
8 nodes total (2 GPUs per board)
12
Confidential
Details
• Two C6100
• 8 system boards
• 2S Westmere, 12 DIMM slots, QDR
IB, up to 6 drives per host
• Single port x16 HIC (iPASS)
• Single C410x
• 16 GPUs (fully populated)
• PCIe x8 per GPU
• Total space = 7U
Note: This configuration is equivalent to
using the C6100 and the NVIDIA S2050
but this configuration is more dense
Global Marketing
PowerEdge C6100 Configurations
“4:1 Sandwich”
Details
C410x
C6100
Summary
C6100 “4:1 Sandwich”
One Dell C410x (16 GPUs)
One C6100 (4 nodes)
One x16 slot for each node to 4 GPUs
5U total
16 GPUs total
4 nodes total (4 GPUs per board)
13
Confidential
• One C6100
• 4 system boards
• 2S Westmere, 12 DIMM slots, QDR
IB, up to 6 drives per host
• Single port x16 HIC (iPASS)
• Single C410x
• 16 GPUs (fully populated)
• PCIe x4 per GPU
• Total space = 5U
Global Marketing
PowerEdge C6100 Configurations
“8:1 Sandwich” (Possible Future Development)
C410x
C6100
C410x
Summary
C6100 “8:1 Sandwich”
Two Dell C410x (32 GPUs)
One C6100 (4 nodes)
One x16 slot for each node to 8 GPUs
8U total
32 GPUs total
4 nodes total (8 GPUs per board)
14
Confidential
Details
• One C6100
• 4 system boards
• 2S Westmere, 12 DIMM slots, QDR
IB, up to 6 drives per host
• Single port x16 HIC (iPASS)
• Two C410x
• 32 GPUs (fully populated)
• PCIe x2 per GPU
• Total space = 8U
• See later table for metrics
Global Marketing
PowerEdge C6145 Configurations
“8:1 Sandwich”
5U of Rack Space
Details
• One C6145
C6145
C410x
Details
C6145 “16:1 Sandwich”
One Dell C410x (16 GPUs)
One C6145 (2 nodes)
Two-Four HIC slots for each node to
16 GPUs
5U total
16 GPUs total
2 nodes total (16 GPUs per board)
Dell Confidential
• 2 system boards
• 4S MagnyCours, 32 DIMM slots,
QDR IB, up to 12 drives per host
• 3 x Single port x16 HIC (iPASS) + 1
x Single port onboard x16 HIC
(iPASS)
• One C410x
• 16 GPUs (fully populated)
• PCIe x4-x8 per GPU
• Total space = 5U
Global Marketing
PowerEdge C6145 Configurations
“16:1 Sandwich”
8U of Rack Space
C410x
C6145
C410x
Details
C6145 “16:1 Sandwich”
Two Dell C410x (32 GPUs)
One C6145 (2 nodes)
Four HIC slots for each node to 16
GPUs
8U total
32 GPUs total
2 nodes total (16 GPUs per board)
Dell Confidential
Details
• One C6145
• 2 system boards
• 4S MagnyCours, 32 DIMM slots,
QDR IB, up to 12 drives per host
• 3 x Single port x16 HIC (iPASS) + 1
x Single port onboard x16 HIC
(iPASS)
• Two C410x
• 32 GPUs (fully populated)
• PCIe x4 per GPU
• Total space = 8U
Global Marketing
PowerEdge C410x Block Diagram
GPUs x 16
Switch Level
2x4
Switch Level
1x8
Host
Connections
x8
Global Marketing
C410X BMC控制台配置界面
Global Marketing
GPU 扩展箱支持服务器列表
HIC/C410x Support Matrix
• Dell external GPU solution support
– Hardware Interface Card (HIC) in PCIe slot
connects to external GPU(s) in C410x
– Dell ‘slot validates’ NVIDIA interface cards
to verify power, thermals, etc.
Server
C6100
Planned Support
C410x Support
Date
Yes
Now
C6105
RTS+
Now – BIOS 1.7.1
or later
C6145
RTS
Now
C1100
Yes
Now
Precision R5500
Yes
Now – Disable
SSC in BIOS
R710
Yes
Now
M610x
Yes
Now
R410
Yes
Now
R720
RTS
RTS
R720xd
RTS
RTS
R620
RTS
RTS
C6220
RTS
RTS
Global Marketing
生命科学应用测试: GPU-HMMER
GPU-HMMER CPU vs. GPU
12000
Wall Clock (s)
10000
8000
1.8X
6000
2.7X
CPU
C410x / C6100 (1)
2.8X
4000
2.9X
2000
0
415
983
1419
Length of HMM
2293
Dell High Performance Computing
20
GPU:Host Scaling : GPU-HMMER
GPU-HMMER: GPU Scaling
7000
Wall clock (s)
6000
5000
Speedup
4000
C410x / C6100 (1)
C410x / C6100 (2)
C410x / C6100 (4)
Internal 2-x16 (2)
3000
2000
1.8X
3.6X
7.2X
3.6X
1000
0
415
983
1419
Length of HMM
2293
Dell High Performance Computing
21
GPU:Host Scaling: NAMD
NAMD
1.52
1.6
Steps/Second
1.4
1.2
0.95
1
0.82
0.8
0.6
0.47
0.4
0.2
CPU
C410x / C6100 (1)
C410x / C6100 (2)
C410x / C6100 (4)
Internal 2-x16 (2)
Speedup
4.7X
8.2X
15.2X
9.5X
0.10
0
STMV
Dell High Performance Computing
22
GPU:Host Scaling : LAMMPS JL-Cut
Wall clock (s)
LAMMPS LJ GPU Scaling
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Speedup
C410x / C6100 (1)
C410x / C6100 (2)
C410x / C6100 (4)
Internal 2-x16 (2)
8.5X
13.5X
14.4X
14.0X
256000
500000
1000188
Number of Particles
Dell High Performance Computing
23
生命科学
存储方案
Global Marketing
生命科学
计算、数据容量增长率
The Lustre Parallel File System
• Key Lustre Components:
1.Clients (compute nodes)
Clients
• “Users” of the file system where applications run
• The Dell HPC Cluster
2. Meta Data Server (MDS)
• Holds meta-data information
3. Object
Storage Server (OSS)
• Provides back-end storage for the users’ files
• Additional OSS units increase throughput linearly
Meta Data Server (MDS)
OSS
OSS
…
OSS
2
7
Confidential
InfiniBand (IPoIB) NFS Performance:
Sequential Read
NSS IPoIB Sequential Reads
1600000
1400000
Throughput KB/s
1200000
1000000
NSS Small
800000
NSS Medium
600000
NSS Large
400000
200000
0
1
• Peaks:
2
4
8
16
24
32
Threads (Nodes)
– NSS Small: 1 node doing IO (fairly level until 4 nodes)
– NSS Medium: 4 nodes doing IO (not much drop-off)
– NSS Large: 8 nodes doing IO (good performance over range)
Infiniband (IPoIB) NFS Performance:
Sequential Write
NSS IPoIB Sequential Writes
1600000
1400000
Throughput KB/s
1200000
1000000
NSS Small
800000
NSS Medium
600000
NSS Large
400000
200000
0
1
• Peaks:
2
4
8
16
24
32
Threads (Nodes)
– NSS Small: 1 node doing IO (steady drop off to 16 nodes)
– NSS Medium: 2 nodes doing IO (good performance for up to 8 nodes)
– NSS Large: 4 nodes doing IO (good performance over range)
3
1
Confidential
WRF V3.3 应用程序
测试调优
Global Marketing
Dell 测试环境
• Dell R720
– cpu : 2x Intel Sandy Bridge E5- 2650,
– Memory: 8x 8MB (64GB Memory)
– Harddisk: 2x 300 GB 15Krpm (Raid 0)
• BIOS Setting
– disable HT
– memory optimized
– High Performance enable ( Power Max)
• OS
– Redhat Enterprise Linux 6.3
33
Confidential
Gcc 测试
• gcc, gfortran, gc++
• Zlib 1.2.5
• HDF5 1.8.8
• Netcdf 4
• WRF V3.3
34
Confidential
测试结果
• 输出文件 wrf : 2011年11月30日 至 2011年12月5日
(13H9M53S)
– wrf.exe starts at: Sun Apr 29 09:35:36 CST 2012 …
– wrf: SUCCESS COMPLETE WRF
– wrf.exe completed at: Sun Apr 29 22:45:29 CST 2012
36
Confidential
配置文件
•# Settings for x86_64 Linux, gfortran compiler with gcc (smpar)
•DMPARALLEL
=
1
•OMPCPP
=
-D_OPENMP
•OMP
=
-fopenmp
•OMPCC
=
-fopenmp
•SFC
=
gfortran
•SCC
=
gcc
•CCOMP
=
gcc
•DM_FC
=
mpif90 -f90=$(SFC)
•DM_CC
=
mpicc -cc=$(SCC)
•FC
=
$(SFC)
•CC
=
$(SCC) -DFSEEKO64_OK
•LD
=
$(FC)
•RWORDSIZE
=
$(NATIVE_RWORDSIZE)
•PROMOTION
=
# -fdefault-real-8 # uncomment manually
•ARCH_LOCAL
=
-DNONSTANDARD_SYSTEM_SUBR
•CFLAGS_LOCAL =
-w -O3 -c -DLANDREAD_STUB
•LDFLAGS_LOCAL =
•CPLUSPLUSLIB =
•ESMF_LDFLAG =
$(CPLUSPLUSLIB)
•FCOPTIM
=
-O3 -ftree-vectorize -ftree-loop-linear -funroll-loops
•FCREDUCEDOPT
=
$(FCOPTIM)
•FCNOOPT
=
-O0
•FCDEBUG
=
# -g $(FCNOOPT)
•FORMAT_FIXED =
-ffixed-form
•FORMAT_FREE =
-ffree-form -ffree-line-length-none
•FCSUFFIX
=
•BYTESWAPIO
=
-fconvert=big-endian -frecord-marker=4
•FCBASEOPTS_NO_G =
-w $(FORMAT_FREE) $(BYTESWAPIO)
•FCBASEOPTS
=
$(FCBASEOPTS_NO_G) $(FCDEBUG)
•MODULE_SRCH_FLAG =
•TRADFLAG
=
-traditional
•CPP
=
/lib/cpp -C -P
•AR
=
ar
•ARFLAGS
=
ru
•M4
=
m4 -G
•RANLIB
=
ranlib
•CC_TOOLS
=
$(SCC)
37 Confidential
Wrf.out
….
WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 16
WRF TILE 1 IS 1 IE 250 JS 1 JE 10
WRF TILE 2 IS 1 IE 250 JS 11 JE 20
WRF TILE 3 IS 1 IE 250 JS 21 JE 30
WRF TILE 4 IS 1 IE 250 JS 31 JE 39
WRF TILE 5 IS 1 IE 250 JS 40 JE 48
WRF TILE 6 IS 1 IE 250 JS 49 JE 57
WRF TILE 7 IS 1 IE 250 JS 58 JE 66
WRF TILE 8 IS 1 IE 250 JS 67 JE 75
WRF TILE 9 IS 1 IE 250 JS 76 JE 84
WRF TILE 10 IS 1 IE 250 JS 85 JE 93
WRF TILE 11 IS 1 IE 250 JS 94 JE 102
WRF TILE 12 IS 1 IE 250 JS 103 JE 111
WRF TILE 13 IS 1 IE 250 JS 112 JE 120
WRF TILE 14 IS 1 IE 250 JS 121 JE 130
WRF TILE 15 IS 1 IE 250 JS 131 JE 140
WRF TILE 16 IS 1 IE 250 JS 141 JE 150
WRF NUMBER OF TILES = 16
…..
38
Confidential
系统资源分析 CPU
• CPU: (mpstat –P ALL)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
39
Linux 2.6.32-257.el6.x86_64 (r720)
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
04:06:40 PM
Confidential
04/29/2012
CPU %usr %nice %sys %iowait
all 85.27 0.00 2.62 0.01 0.00
0 85.71 0.00 2.58 0.01 0.00
1 85.05 0.00 2.77 0.05 0.00
2 85.26 0.00 2.69 0.00 0.00
3 85.24 0.00 2.65 0.01 0.00
4 87.36 0.00 1.90 0.00 0.00
5 84.97 0.00 2.70 0.00 0.00
6 85.23 0.00 2.64 0.00 0.00
7 84.97 0.00 2.71 0.00 0.00
8 85.33 0.00 2.60 0.00 0.00
9 85.32 0.00 2.57 0.00 0.00
10 84.88 0.00 2.77 0.00 0.00
11 84.93 0.00 2.69 0.00 0.00
12 85.16 0.00 2.62 0.00 0.00
13 85.00 0.00 2.69 0.00 0.00
14 84.91 0.00 2.75 0.00 0.00
15 85.02 0.00 2.65 0.00 0.00
_x86_64_
(16 CPU)
%irq %soft %steal %guest %idle
0.00 0.00 0.00 12.10
0.00 0.00 0.00 11.69
0.04 0.00 0.00 12.09
0.00 0.00 0.00 12.05
0.00 0.00 0.00 12.10
0.00 0.00 0.00 10.73
0.00 0.00 0.00 12.33
0.00 0.00 0.00 12.13
0.00 0.00 0.00 12.32
0.00 0.00 0.00 12.06
0.00 0.00 0.00 12.11
0.00 0.00 0.00 12.35
0.00 0.00 0.00 12.38
0.00 0.00 0.00 12.21
0.00 0.00 0.00 12.31
0.00 0.00 0.00 12.34
0.00 0.00 0.00 12.33
系统资源分析 (Memory)
• Memory : (free)
total
Mem:
used
free
shared buffers
65895488 32823072 33072416
-/+ buffers/cache: 5899828 59995660
Swap:
40
Confidential
66027512
0 66027512
cached
0
38220 26885024
系统资源分析 (IO, HDD)
IO: (iostat)
Device:
tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda
9.01
125.71 2063.47 3096354 50823660
dm-0
0.64
12.63
1.99 311170 49016
dm-1
0.01
0.10
0.00
2576
0
dm-2
258.17
112.05 2061.48 2759698 50774616
HDD : (df)
Filesystem
1K-blocks Used Available Use% Mounted on
/dev/mapper/vg_r720-lv_root
51606140 5002372 43982328 11% /
tmpfs
32947744
88 32947656 1% /dev/shm
/dev/sda1
495844 37433 432811 8% /boot
/dev/mapper/vg_r720-lv_home
458559680 58258760 377007380 14% /home
41
Confidential
Intel 测试
42
Confidential
4
3
Confidential
Intel links
• http://software.intel.com/en-us/articles/building-the-wrf-withintel-compilers-on-linux-and-improving-performance-on-intelarchitecture/
• http://software.intel.com/en-us/articles/wrf-and-wps-v311installation-bkm-with-inter-compilers-and-intelr-mpi/
• http://www.hpcadvisorycouncil.com/pdf/WRF_Best_Practices.p
df
44
Confidential
Intel Compilers Flags
45
Confidential
Intel 调优
http://software.intel.com/en-us/articles/performance-hints-for-wrf-on-intel-architecture/
1。Reducing MPI overhead:
• -genv I_MPI_PIN_DOMAIN omp
• -genv KMP_AFFINITY=compact
• -perhost
2。 Improving cache and memory bandwidth utilization:
• numtiles = X
3。Using Intel® Math Kernel Library (MKL) DFT for polar filters:
• Depending on workload, Intel® MKL DFT may
provide up to 3x speedup of simulation speed
4。Speeding up computations by reducing precision:
• -fp-model fast=2 -no-prec-div -no-prec-sqrt
46
Confidential
案例分享
Global Marketing
华大基因研究院
清华大学生命科学院
Success References in Life Science
• 国内
– Beijing Genome Institute (BGI)
– Tsinghua University Life Institute
– Beijing Normal University
– Jiang Su Tai Cang Life Institute
– The 4th Military Medical University
– …
• 国外
– David H. Murdock Research Institute
– Virginia Bioinformatics Institute
– University of Florida speeds up memory intensive gene
– UCSF
– National Center for Supercomputing Applications
– …
50
Confidential
谢谢!
5
1
Confidential