Transcript Slide 1

MPI Must Evolve or Die!
Al Geist
Oak Ridge National Laboratory
September 9, 2008
Heterogeneous multi-core
Research sponsored by ASCR
Managed by UT-Battelle
for the Department of Energy
EuroPVM-MPI Conference
Dublin Ireland
Exascale software solution can’t rely on
“Then a miracle occurs”
Hardware developer
describing exascale
design says “Then a
miracle occurs”
System software
engineer replies
“I think you should
be more explicit”
2
Managed by UT-Battelle
for the Department of Energy
Acknowledgements
Harness Research Project (Geist, Dongarra, Sundaram)
The same team that created PVM has continued the
exploration of heterogeneous and adaptive computing.
Acknowledge the team members whose ideas and research on the
Harness project are being presented in this talk. Apologies to anyone I missed
Jelena Pješivac-Grbović
George Bosilca
Thara Angskun
Bob Manchek
Graham Fagg
June Denoto
Magdalena Slawinska
Jaroslaw Slawinski
Edgar Gabriel
Interesting observation
PVM use is starting to grow again
The support questions have doubled in past year
Even getting queries from HPC users who are desperate for fault tolerance.
3
Research sponsored by ASCR
Managed by UT-Battelle
for the Department of Energy
Example of a Petaflops System - ORNL (late 2008)
Multi-core, homogeneous, multiple programming models
DOE Cray “Baker”
 1 Petaflops system
 13,944 dual-socket, 8-core
SMP “nodes” with 16 GB
 27,888 quad-core processors
Barcelona 2.3 GHz (37 Gflops)
 223 TB memory (2GB/core)
 200+ GB/s disk bandwidth
 10 PB storage
 6.5 MW system power
 150 cabinets, 3,400 ft2
4
Managed by UT-Battelle
for the Department of Energy
 Liquid cooled cabinets
 Compute Node Linux
operating system
MPI Dominates Petascale Communication
Survey top HPC open science applications
5
Managed by UT-Battelle
for the Department of Energy
Must have
Can use
The answer is MPI. What is the question?
While applications may continue to use MPI due to:
• Inertia – these codes take decades to create and validate
• Nothing better – developers need a BIG incentive to rewrite (not 50%)
Communication libraries are being changed to exploit new petascale
systems, giving applications more life.
• Hardware support for MPI is pushing this out even further
Business as usual has been to improve latency and/or bandwidth.
• But large-scale, many-core, heterogeneous architectures require us to
think further outside the box
It is not business as usual inside petascale communication libraries
• Hierarchical algorithms
• Hybrid algorithms
• Dynamic algorithm selection
• Fault tolerance
6
Managed by UT-Battelle
for the Department of Energy
Hierarchical Algorithms
Hierarchical algorithm designs seek to consolidate information at
different levels of the architecture to reduce the number of messages
and contention on the interconnect.
PVM Project studied hierarchical
collective algorithms using
clusters of clusters
(simple 2-level model)
Communication within cluster was
10X faster than between clusters
Architecture Levels:
Socket
Node
Board
Cabinet
Switch
System
Found improvements in the range of 2X-5X
but not pursued because HPC machines at
time had only one level.
Needs rethinking for petascale systems
7
Managed by UT-Battelle
for the Department of Energy
Hybrid Algorithms
Hybrid algorithm designs use different algorithms at different levels of the
architecture, for example, using a shared memory algorithm within a
node, or an accelerator board, such as Cell, and a message passing
algorithm between nodes.
Roadrunner
PVM Project studied hybrid msg
passing algorithms using
heterogeneous parallel virtual
machines
Communication optimized
to the custom HW within
each computer
Today all MPI implementations
do this to some extent. But there
is more to be done for new
heterogeneous systems
8
Managed by UT-Battelle
for the Department of Energy
Adaptive Communication Libraries
Algorithm is dynamically selected from a set of collective communication
algorithms based on multiple metrics such as:
• Number of tasks being sent to
• Where they are located in the system
• The size of the message being sent
• The physical topology and particular quirks of the system
Harness Project explored having adaptive MPI collectives
At run time, decision function is invoked to select the “best” algorithm for
particular collective call
Steps in optimization process:
1. Implementation of different MPI algorithms
2. MPI collective algorithm performance information
“Optimal” MPI collective operation implementation
3. Decision / Algorithm selection process
4. Decision function - Automatically generate code based on step 3
9
Managed by UT-Battelle
for the Department of Energy
Harness Adaptive Collective Communication
Performed just once
on a given machine
MPI collective
algorithm
implementations
Performance
modeling
Exhaustive
Testing
Optimal
MPI collective
implementation
Decision
Process
Decision
Function
10 Managed by UT-Battelle
for the Department of Energy
Decision/Algorithm Selection Process
Three Different Approaches Explored
Parametric data modeling:
Use algorithm performance models to select
algorithm with shortest completion time
(Hockney, LogGP, PLogP, …)
Image encoding techniques:
Use graphics encoding algorithms to capture
information algorithm switching points
Statistical learning methods:
Use statistical learning methods to find patterns
in algorithm performance data and to construct
decision systems
11 Managed by UT-Battelle
for the Department of Energy
Fault Tolerant Communication
Harness Project was where FT-MPI was created to explore ways that
MPI could be modified to allow applications to “run through” faults.
Accomplishments of FT-MPI Research
• Define the behavior of MPI in case an error occurs
• Give the application the possibility to recover from a node-failure
• A regular, non fault-tolerant MPI program will run using FT-MPI
• Stick to the MPI-1 and MPI-2 specification as closely as possible
(e.g. no additional function calls)
• Provide the notification to the application
• Provide recovery options for the application to exploit if desired
What FT-MPI does not do:
• Recover user data (e.g. automatic check-pointing)
• Provide transparent fault-tolerance
12 Managed by UT-Battelle
for the Department of Energy
FT-MPI recovery options
Key to allowing MPI applications to “run through” faults.
Developing COMM_CREATE that can build a new MPI_COMM_WORLD
Four options explored (abort, blank, shrink, rebuild)
 ABORT: just do as other implementations
 BLANK: leave hole
 SHRINK: re-order processes to make a
contiguous communicator
 Some ranks change
 REBUILD: re-spawn lost processes and
add them to MPI_COMM_WORLD
As a convenience a fifth option to shrink
or rebuild ALL communicators inside an
application at once was also investigated.
13 Managed by UT-Battelle
for the Department of Energy
Future of Fault Tolerant Communication
The fault tolerant capabilities and datatypes in FT-MPI are now becoming
a part of the OpenMPI effort.
Fault Tolerance is under consideration by the MPI forum as
a possible addition to the MPI-3 standard
MPI-3
14 Managed by UT-Battelle
for the Department of Energy
Getting Applications to Use this Stuff
All these new multi-core, heterogeneous algorithms and techniques are
for naught if we don’t get the science teams to use them.
ORNL’s Leadership Computing Facility uses a couple key methods to
get the latest algorithms and system specific features used by the
science teams. Science Liaisons and Centers of Excellence.
Science Liaisons from the Scientific Computing Group are assigned to
every science team on the leadership system. Their duties include:
• Scaling algorithms to the required size
• Application and library code optimization
and scaling
• Exploiting parallel I/O & other technologies
in apps
• More…
15 Managed by UT-Battelle
for the Department of Energy
Centers of Excellence
ORNL has a Cray Center of Excellence and a Lustre Center of Excellence
one of their missions is to have vendor engineers engage directly with
users to help them with the latest techniques to get scalable performance
Cray Center of Excellence
Lustre Center of Excellence
But, having science liaisons and help from vendor engineers is not a
scalable solution for the larger community of users so we are creating…
16 Managed by UT-Battelle
for the Department of Energy
Harness Workbench for Science Teams
Help the user by building
a tool that can apply
basic knowledge of
developer, admin, and
vendor
Available to LCF science
17 Managed by UT-Battelle
liaisons
thisof summer
for the Department
Energy
Eclipse (Parallel Tools Platform)
Integrated with runtime
Next Generation Runtime
Scalable Tool Communication Infrastructure (STCI)
Harness runtime environment (underlying
Harness workbench, adaptive comm, fault recovery)
Adopted emerging RTE
Open runtime environment - OpenRTE
(underlying OpenMPI)
Which was generalized
Scalable Tool Communication Infrastructure
High-performance, scalable, resilient,
and portable communications and
process control services for user and
system tools:




parallel run-time environment (MPI),
application correctness tools,
performance analysis tools
system monitoring and management
18 Managed by UT-Battelle
for the Department of Energy
Execution context
Sessions
Communications
Persistence
Security
Petascale to Exascale requires new approach:
Synergistically Developing Architecture and Algorithms Together
 Try to break the cycle of HW vendors throwing the latest giant
system over the fence and leave it to the system software guys
and applications to figure out how to use the latest HW
(billion-way parallelism at exascale)
 Try to get applications to rethink their algorithms and even their
physics in order to better match what the HW can give them
(memory wall isn’t going away)
Meet in the middle – change what “balanced system” means
Creating a Revolution in Evolution
Institute for Advances Architectures and Algorithms has
been established in a Sandia/ORNL joint effort to facilitate the
co-design of architectures and algorithms in order to create
synergy in their respective evolutions.
19 Managed by UT-Battelle
for the Department of Energy
Summary
It is not business as usual for petascale communication
 No longer just about improved latency and bandwidth
 But MPI is not going away
Communication libraries are adapting
 Hierarchical algorithms
 Hybrid algorithms
 Dynamic selected algorithms
 Allow “run through” fault tolerance
But we have to get applications to use these new ideas
Going to Exascale communication needs a fundamental shift
 Break the deadly cycle of hardware being thrown “over fence”
for the software developers to figure out how to use.
is this crazy talk?
Evolve or Die
20 Managed by UT-Battelle
for the Department of Energy
Questions?
21 Managed by UT-Battelle
for the Department of Energy