An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo

Transcript An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo

An Introduction of GridMPI
http://www.gridmpi.org/
(1,2)
Yutaka Ishikawa and Motohiko Matsuda
(1)
(2)
(2)
University of Tokyo
Grid Technology Research Center, AIST
(National Institute of Advanced Industrial Science and Technology)
This work is partially supported by the NAREGI project
2006/1/23
Yutaka Ishikawa, The University of Tokyo
1
Motivation
• MPI, Message Passing Interface, has been widely used to program
parallel applications.
• Users want to run such applications over the Grid environment
without any modifications of the program.
• However, the performance of existing MPI implementations is not
scaled up on the Grid environment.
computing resource
site A
computing resource
site B
Wide-area
Network
Single (monolithic) MPI application
over the Grid environment
2006/1/23
Yutaka Ishikawa, The University of Tokyo
2
Motivation
• Focus on metropolitan-area, high-bandwidth environment:
10Gpbs,  500miles (smaller than 10ms one-way latency)
– We have already demonstrated that the performance of the NAS parallel
benchmark programs are scaled up if one-way latency is smaller than 10ms
using an emulated WAN environment .
Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh,
``Evaluation of MPI Implementations on Grid-connected Clusters using
an Emulated WAN Environment,'' CCGRID2003, 2003
computing resource
site A
computing resource
site B
Wide-area
Network
Single (monolithic) MPI application
over the Grid environment
2006/1/23
Yutaka Ishikawa, The University of Tokyo
3
Issues
• High Performance Communication
TCP
MPI
Facilities for MPI on Long and Fat
Designed for streams. Repeat the computation
Networks
and communication
– TCP vs. MPI communication
phases.
Burst traffic.
patterns
Change traffic by
– Network Topology
communication patterns.
• Latency and Bandwidth
• Interoperability
– Most MPI library
implementations use their own
network protocol.
Internet
• Fault Tolerance and Migration
– To survive a site failure
• Security
2006/1/23
Yutaka Ishikawa, The University of Tokyo
4
Issues
• High Performance Communication
TCP
MPI
Facilities for MPI on Long and Fat
Designed for streams. Repeat the computation
Networks
and communication
– TCP vs. MPI communication
phases.
Burst traffic.
patterns
Change traffic by
– Network Topology
communication patterns.
• Latency and Bandwidth
• Interoperability
– Many MPI library
implementations. Most
Using Vendor
Using Vendor
implementations use their own
B’s MPI library
A’s MPI library
network protocol.
Interne
• Fault Tolerance and Migration
t
– To survive a site failure
• Security
2006/1/23
Using Vendor
C’s MPI library
Yutaka Ishikawa, The University of Tokyo
Using Vendor
D’s MPI library
5
GridMPI Features
• MPI-2 implementation
• IMPI (Interoperable MPI) protocol and extension for Grid
– MPI-2
– New Collective protocols
– Checkpoint
• Integration of Vendor MPI
– IBM, Solaris, Fujitsu, and MPICH2
• High Performance TCP/IP implementation on Long and Fat
Networks
– Pacing the transmission ratio so that the burst transmission is controlled
according to the MPI communication pattern.
Cluster X
Cluster Y
• Checkpoint
VendorMPI
2006/1/23
IMPI
Yutaka Ishikawa, The University of Tokyo
YAMPII
6
Evaluation
• It is almost impossible to reproduce the execution
behavior of communication performance in the wide area
network.
• A WAN emulator, GtrcNET-1, is used to scientifically
examine implementations, protocols, communication
algorithms, etc.
GtrcNET-1
GtrcNET-1 is developed at AIST.
• injection of delay, jitter, error, …
• traffic monitor, frame capture
2006/1/23
•Four 1000Base-SX ports
•One USB port for Host PC
•FPGA (XC2V6000)
http://www.gtrc.aist.go.jp/gnet/
Yutaka Ishikawa, The University of Tokyo
7
Experimental Environment
8 PCs
•Bandwidth:1Gbps
•Delay: 0ms -- 10ms
Node8
Host 0
Host 0
Host 0
………
WAN Emulator
GtrcNET-1
Catalyst 3750
Node7
Catalyst 3750
………
Node0
Host 0
Host 0
Host 0
8 PCs
Node15
CPU: Pentium4/2.4GHz, Memory: DDR400 512MB
NIC: Intel PRO/1000 (82547EI)
OS: Linux-2.6.9-1.6 (Fedora Core 2)
Socket Buffer Size: 20MB
2006/1/23
Yutaka Ishikawa, The University of Tokyo
8
GridMPI vs. MPICH-G2 (1/4)
FT (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
1.2
FT(GridMPI)
Relative Performance
1
FT(MPICH-G2)
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
One way delay (msec)
2006/1/23
Yutaka Ishikawa, The University of Tokyo
9
GridMPI vs. MPICH-G2 (2/4)
IS (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
1.2
Relative Performance
1
IS(GridMPI)
IS(MPICH-G2)
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
One way delay (msec)
2006/1/23
Yutaka Ishikawa, The University of Tokyo
10
GridMPI vs. MPICH-G2 (3/4)
LU (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
1.2
LU(GridMPI)
Relative Performance
1
LU(MPICH-G2)
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
One way delay (msec)
2006/1/23
Yutaka Ishikawa, The University of Tokyo
11
GridMPI vs. MPICH-G2 (4/4)
NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
1.2
SP(GridMPI)
BT (GridMPI)
MG(GridMPI)
CG(GridMPI)
SP(MPICH-G2)
BT(MPICH-G2)
MG(MPICH-G2)
CG(MPICH-G2)
Relative Performance
1
0.8
0.6
0.4
0.2
No parameters tuned in GridMPI
0
0
2
4
6
8
10
12
One way delay (msec)
2006/1/23
Yutaka Ishikawa, The University of Tokyo
12
• NAS Parallel Benchmarks run using 8
node (2.4GHz) cluster at Tsukuba and
8 node (2.8GHz) cluster at Akihabara
– 16 nodes
• Comparing the performance with
– result using 16 node (2.4 GHz)
– result using 16 node (2.8 GHz)
Relative performance
GridMPI on Actual network
1.2
1
0.8
0.6
0.4
0.2
0
2.4 GHz
2.8 GHz
BT CG EP FT IS LU MG SP
Benchmarks
JGN2 Network
10Gbps
Bandwidth
1.5 msec RTT
Pentium-4 2.4GHz x 8
Pentium-4 2.8 GHz x 8
connected by 1G Ethernet 60 Km (40mi.) Connected by 1G Ethernet
@ Tsukuba
@ Akihabara
2006/1/23
Yutaka Ishikawa, The University of Tokyo
13
Demonstration
• Easy installation
– Download the source
– Make it and set up configuration files
• Easy use
– Compile your MPI application
– Run it !
JGN2 Network
10Gbps
Bandwidth
1.5 msec RTT
Pentium-4 2.4GHz x 8
Pentium-4 2.8 GHz x 8
connected by 1G Ethernet 60 Km (40mi.) Connected by 1G Ethernet
@ Tsukuba
@ Akihabara
2006/1/23
Yutaka Ishikawa, The University of Tokyo
14
NAREGI Software Stack (Beta Ver. 2006)
Grid-Enabled Nano-Applications
Grid Visualization
Data
Grid
Programing
-Grid RPC
-Grid MPI
Grid PSE
Grid Workflow
Super Scheduler
Distributed
Information Service
（Globus,Condor,UNICOREOGSA / WSRF)
Grid VM
High-Performance & Secure Grid Networking
2006/1/23
Yutaka Ishikawa, The University of Tokyo
15
GridMPI Current Status
http://www.gridmpi.org/
• GridMPI version 0.9 was released
– MPI-1.2 features are fully supported
– MPI-2.0 features are supported except for MPI-IO
and one sided communication primitives
– Conformance Tests
• MPICH Test Suite: 0/142 (Fails/Tests)
• Intel Test Suite: 0/493 (Fails/Tests)
• GridMPI version 1.0 will be released in this
Spring
– MPI-2.0 fully supported
2006/1/23
Yutaka Ishikawa, The University of Tokyo
16
Concluding Remarks
• GridMPI is integrated into the NaReGI package.
• GridMPI is not only for production but also our research vehicle
for Grid environment in the sense that the new idea in Grid is
implemented and tested.
• We are currently studying high-performance communication
mechanisms in the long and fat network:
– Modifications of TCP Behavior
• M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP
Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005,
2005.
– Precise Software Pacing
• R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y.
Ishikawa, “Design and Evaluation of Precise Software Pacing
Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, 2005.
– Collective communication algorithms with respect to
network latency and bandwidth.
2006/1/23
Yutaka Ishikawa, The University of Tokyo
17
BACKUP
2006/1/23
Yutaka Ishikawa, The University of Tokyo
18
GridMPI Version 1.0
MPI API
IMPI
RPIM Interface
LACT Layer
(Collectives)
Request Layer
P2P Interface
Vendor
MPI
O2G
MX
PMv2
IMPI
TCP/IP
Globus
SCore
rsh
ssh
Vendor MPI
Request Interface
– YAMPII, developed at the University of Tokyo, is used
as the core implementation
– Intra communication by YAMPII（TCP/IP、SCore）
– Inter communication by IMPI（TCP/IP）
2006/1/23
Yutaka Ishikawa, The University of Tokyo
19
GridMPI vs. Others (1/2)
NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
1.2
FT(GridMPI)
IS(GridMPI)
LU(GridMPI)
Relative Performance
1
SP(GridMPI)
BT (GridMPI)
MG(GridMPI)
CG(GridMPI)
0.8
FT(MPICH-G2)
IS(MPICH-G2)
LU(MPICH-G2)
SP(MPICH-G2)
0.6
BT(MPICH-G2)
MG(MPICH-G2)
CG(MPICH-G2)
0.4
0.2
0
0
2
4
6
8
10
12
One way delay (msec)
2006/1/23
Yutaka Ishikawa, The University of Tokyo
20
GridMPI vs. Others (1/2)
NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
Relative Performance
2.00
1.50
GridMPI
GridMPI(with PSP)
MPICH
LAM/MPI
YAMPII
MPICH2
MPICH-G2
1.00
0.50
BT
2006/1/23
CG
LU
MG
Yutaka Ishikawa, The University of Tokyo
10ms
5ms
0ms
10ms
5ms
0ms
10ms
5ms
0ms
10ms
5ms
0ms
10ms
5ms
0ms
0.00
SP
21
GridMPI vs. Others (2/2)
NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
Relative Performance
2.00
1.50
GridMPI
GridMPI(with PSP)
MPICH
LAM/MPI
YAMPII
MPICH2
MPICH-G2
1.00
0.50
0.00
0ms
5ms
FT
2006/1/23
10ms
0ms
5ms
10ms
IS
Yutaka Ishikawa, The University of Tokyo
22
GridMPI vs. Others
NAS Parallel Benchmarks 3.2
16 x 16
Relative Performance
2.50
2.00
GridMPI
GridMPI(withPSP)
MPICH
LAM/MPI
YAMPII
MPICH2
MPICH-G2
1.50
1.00
0.50
0.00
0ms
5ms 10ms 0ms
FT
2006/1/23
2ms
5ms 10ms
IS
Yutaka Ishikawa, The University of Tokyo
23
GridMPI vs. Others
NAS Parallel Benchmarks 3.2
GridMPI
MPICH
YAMPII
MPICH-G2
1.50
GridMPI(w ithPSP)
LAM/MPI
MPICH2
1.00
0.50
BT
2006/1/23
CG
LU
Yutaka Ishikawa, The University of Tokyo
MG
10ms
5ms
0ms
10ms
5ms
0ms
10ms
5ms
0ms
10ms
5ms
0ms
10ms
5ms
0.00
0ms
Relative Performance
2.00
SP
24