Performance Evaluation of Adaptive MPI

Download Report

Transcript Performance Evaluation of Adaptive MPI

Performance Evaluation of
Adaptive MPI
Chao Huang1, Gengbin Zheng1,
Sameer Kumar2, Laxmikant Kale1
1 University
of Illinois at Urbana-Champaign
2 IBM T. J. Watson Research Center
7/16/2015
PPoPP 06
1
Motivation

Challenges

Applications with dynamic nature


Traditional MPI implementations


Shifting workload, adaptive refinement, etc
Limited support for such dynamic applications
Adaptive MPI


7/16/2015
Virtual processes (VPs) via migratable objects
Powerful run-time system that offers various
novel features and performance benefits
PPoPP 06
2
Outline



Motivation
Design and Implementation
Features and Benefits





Adaptive Overlapping
Automatic Load Balancing
Communication Optimizations
Flexibility and Overhead
Conclusion
7/16/2015
PPoPP 06
3
Processor Virtualization

Basic idea of processor virtualization



User specifies interaction between objects (VPs)
RTS maps VPs onto physical processors
Typically, number of VPs >> P, to allow for various
optimizations
System Implementation
User View
7/16/2015
PPoPP 06
4
AMPI: MPI with Virtualization

Each AMPI virtual process is implemented by a
user-level thread embedded in a migratable object
MPI
processes
“processes”
7/16/2015
Real Processors
PPoPP 06
5
Outline



Motivation
Design and Implementation
Features and Benefits





Adaptive Overlapping
Automatic Load Balancing
Communication Optimizations
Flexibility and Overhead
Conclusion
7/16/2015
PPoPP 06
6
Adaptive Overlap

Problem: Gap between completion time and CPU overhead

Solution: Overlap between communication and computation
Completion time and CPU overhead of 2-way ping-pong program on Turing (Apple G5) Cluster
7/16/2015
PPoPP 06
7
Adaptive Overlap
1 VP/P
2 VP/P
4 VP/P
Timeline of 3D stencil calculation with different VP/P
7/16/2015
PPoPP 06
8
Automatic Load Balancing

Challenge



Dynamically varying applications
Load imbalance impacts overall performance
Solution

Measurement-based load balancing




Load balancing by migrating threads (VPs)


7/16/2015
Scientific applications are typically iteration-based
The principle of persistence
RTS collects CPU and network usage of VPs
Threads can be packed and shipped as needed
Different variations of load balancing strategies
PPoPP 06
9
Automatic Load Balancing

Application: Fractography3D

7/16/2015
Models fracture propagation in material
PPoPP 06
10
Automatic Load Balancing
CPU utilization of Fractography3D without vs. with load balancing
7/16/2015
PPoPP 06
11
Communication Optimizations

AMPI run-time has capability of




Observing communication patterns
Applying communication optimizations
accordingly
Switching between communication algorithms
automatically
Examples


7/16/2015
Streaming strategy for point-to-point
communication
Collectives optimizations
PPoPP 06
12
Streaming Strategy

Combining short messages to reduce per-message overhead
Streaming strategy for point-to-point communication on NCSA IA-64 Cluster
7/16/2015
PPoPP 06
13
Optimizing Collectives


A number of optimization are developed to improve collective
communication performance
Asynchronous collective interface allows higher CPU utilization
for collectives

Computation is only a small proportion of the elapsed time
900
800
Mesh
Mesh Compute
700
Time (ms)
600
500
400
300
200
100
0
76
276
476
876
1276
1676
2076
3076
4076
6076
8076
Message Size (Bytes)
Time breakdown of an all-to-all operation using Mesh library
7/16/2015
PPoPP 06
14
Virtualization Overhead

Compared with performance benefits, overhead is very small


Usually offset by caching effect alone
Better performance when features are applied
Performance for point-to-point communication on NCSA IA-64 Cluster
7/16/2015
PPoPP 06
15
Flexibility

Running on arbitrary number of processors

Runs with a specific number of MPI processes

Big runs on a few processors
Exec Time [sec]
100
10
1
19
27
33
64
80
105 125 140 175 216 250 512
Adaptive MPI 42.4 30.5 24.6 15.6 12.6 10.9 10.8 10.6 9.39 8.63 7.55 5.46
Native MPI
29.4
14.2
9.12
8.07
5.52
Procs
7/16/2015
3D stencil calculation of size 2403 run on Lemieux.
PPoPP 06
16
Outline



Motivation
Design and Implementation
Features and Benefits





Adaptive Overlapping
Automatic Load Balancing
Communication Optimizations
Flexibility and Overhead
Conclusion
7/16/2015
PPoPP 06
17
Conclusion

Adaptive MPI supports the following benefits







AMPI is being used in real-world parallel
applications and frameworks



Adaptive overlap
Automatic load balancing
Communication optimizations
Flexibility
Automatic checkpoint/restart mechanism
Shrink/expand
Rocket simulation at CSAR
FEM Framework
Portable to a variety of HPC platforms
7/16/2015
PPoPP 06
18
Future Work

Performance Improvement




Reducing overhead
Intelligent communication strategy
substitution
Machine-topology specific load balancing
Performance Analysis

7/16/2015
More direct support for AMPI programs
PPoPP 06
19
Thank You!
Download of AMPI is available at:
http://charm.cs.uiuc.edu/
Parallel Programming Lab
at University of Illinois
7/16/2015
PPoPP 06
20