Performance Evaluation of Adaptive MPI
Download
Report
Transcript Performance Evaluation of Adaptive MPI
Performance Evaluation of
Adaptive MPI
Chao Huang1, Gengbin Zheng1,
Sameer Kumar2, Laxmikant Kale1
1 University
of Illinois at Urbana-Champaign
2 IBM T. J. Watson Research Center
7/16/2015
PPoPP 06
1
Motivation
Challenges
Applications with dynamic nature
Traditional MPI implementations
Shifting workload, adaptive refinement, etc
Limited support for such dynamic applications
Adaptive MPI
7/16/2015
Virtual processes (VPs) via migratable objects
Powerful run-time system that offers various
novel features and performance benefits
PPoPP 06
2
Outline
Motivation
Design and Implementation
Features and Benefits
Adaptive Overlapping
Automatic Load Balancing
Communication Optimizations
Flexibility and Overhead
Conclusion
7/16/2015
PPoPP 06
3
Processor Virtualization
Basic idea of processor virtualization
User specifies interaction between objects (VPs)
RTS maps VPs onto physical processors
Typically, number of VPs >> P, to allow for various
optimizations
System Implementation
User View
7/16/2015
PPoPP 06
4
AMPI: MPI with Virtualization
Each AMPI virtual process is implemented by a
user-level thread embedded in a migratable object
MPI
processes
“processes”
7/16/2015
Real Processors
PPoPP 06
5
Outline
Motivation
Design and Implementation
Features and Benefits
Adaptive Overlapping
Automatic Load Balancing
Communication Optimizations
Flexibility and Overhead
Conclusion
7/16/2015
PPoPP 06
6
Adaptive Overlap
Problem: Gap between completion time and CPU overhead
Solution: Overlap between communication and computation
Completion time and CPU overhead of 2-way ping-pong program on Turing (Apple G5) Cluster
7/16/2015
PPoPP 06
7
Adaptive Overlap
1 VP/P
2 VP/P
4 VP/P
Timeline of 3D stencil calculation with different VP/P
7/16/2015
PPoPP 06
8
Automatic Load Balancing
Challenge
Dynamically varying applications
Load imbalance impacts overall performance
Solution
Measurement-based load balancing
Load balancing by migrating threads (VPs)
7/16/2015
Scientific applications are typically iteration-based
The principle of persistence
RTS collects CPU and network usage of VPs
Threads can be packed and shipped as needed
Different variations of load balancing strategies
PPoPP 06
9
Automatic Load Balancing
Application: Fractography3D
7/16/2015
Models fracture propagation in material
PPoPP 06
10
Automatic Load Balancing
CPU utilization of Fractography3D without vs. with load balancing
7/16/2015
PPoPP 06
11
Communication Optimizations
AMPI run-time has capability of
Observing communication patterns
Applying communication optimizations
accordingly
Switching between communication algorithms
automatically
Examples
7/16/2015
Streaming strategy for point-to-point
communication
Collectives optimizations
PPoPP 06
12
Streaming Strategy
Combining short messages to reduce per-message overhead
Streaming strategy for point-to-point communication on NCSA IA-64 Cluster
7/16/2015
PPoPP 06
13
Optimizing Collectives
A number of optimization are developed to improve collective
communication performance
Asynchronous collective interface allows higher CPU utilization
for collectives
Computation is only a small proportion of the elapsed time
900
800
Mesh
Mesh Compute
700
Time (ms)
600
500
400
300
200
100
0
76
276
476
876
1276
1676
2076
3076
4076
6076
8076
Message Size (Bytes)
Time breakdown of an all-to-all operation using Mesh library
7/16/2015
PPoPP 06
14
Virtualization Overhead
Compared with performance benefits, overhead is very small
Usually offset by caching effect alone
Better performance when features are applied
Performance for point-to-point communication on NCSA IA-64 Cluster
7/16/2015
PPoPP 06
15
Flexibility
Running on arbitrary number of processors
Runs with a specific number of MPI processes
Big runs on a few processors
Exec Time [sec]
100
10
1
19
27
33
64
80
105 125 140 175 216 250 512
Adaptive MPI 42.4 30.5 24.6 15.6 12.6 10.9 10.8 10.6 9.39 8.63 7.55 5.46
Native MPI
29.4
14.2
9.12
8.07
5.52
Procs
7/16/2015
3D stencil calculation of size 2403 run on Lemieux.
PPoPP 06
16
Outline
Motivation
Design and Implementation
Features and Benefits
Adaptive Overlapping
Automatic Load Balancing
Communication Optimizations
Flexibility and Overhead
Conclusion
7/16/2015
PPoPP 06
17
Conclusion
Adaptive MPI supports the following benefits
AMPI is being used in real-world parallel
applications and frameworks
Adaptive overlap
Automatic load balancing
Communication optimizations
Flexibility
Automatic checkpoint/restart mechanism
Shrink/expand
Rocket simulation at CSAR
FEM Framework
Portable to a variety of HPC platforms
7/16/2015
PPoPP 06
18
Future Work
Performance Improvement
Reducing overhead
Intelligent communication strategy
substitution
Machine-topology specific load balancing
Performance Analysis
7/16/2015
More direct support for AMPI programs
PPoPP 06
19
Thank You!
Download of AMPI is available at:
http://charm.cs.uiuc.edu/
Parallel Programming Lab
at University of Illinois
7/16/2015
PPoPP 06
20