Runtime Optimizations - Parallel Programming Laboratory

Transcript Runtime Optimizations - Parallel Programming Laboratory

Adaptive MPI: Intelligent runtime
strategies and performance prediction via
simulation
Laxmikant Kale
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
7/7/2015
ICL@UTK
1
PPL Mission and Approach
• To enhance Performance and Productivity in
programming complex parallel applications
– Performance: scalable to thousands of processors
– Productivity: of human programmers
– Complex: irregular structure, dynamic variations
• Approach: Application Oriented yet CS centered research
– Develop enabling technology, for a wide collection of apps.
– Develop, use and test it in the context of real applications
• How?
–
–
–
–
Develop novel Parallel programming techniques
Embody them into easy to use abstractions
So, application scientist can use advanced techniques with ease
Enabling technology: reused across many apps
7/7/2015
ICL@UTK
2
Develop abstractions in context of full-scale applications
Protein Folding
Quantum Chemistry
(QM/MM)
Molecular Dynamics
Computational Cosmology
Parallel Objects,
Adaptive Runtime System
Libraries and Tools
Crack Propagation
Rocket Simulation
Dendritic Growth
Space-time meshes
The enabling CS technology of parallel objects and intelligent
Runtime systems has led to several collaborative applications in CSE
7/7/2015
ICL@UTK
3
Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition
into virtual processors
Benefits
• Software engineering
– Number of virtual processors can be
independently controlled
– Separate VPs for different modules
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
System implementation
• Message driven execution
– Adaptive overlap of communication
– Predictability :
• Automatic out-of-core
– Asynchronous reductions
• Dynamic mapping
– Heterogeneous clusters
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change set of processors used
– Automatic dynamic load balancing
– Communication optimization
User View
7/7/2015
ICL@UTK
4
Outline
•
•
•
•
Adaptive MPI
Load Balancing
Fault tolerance
Projections:
– performance analysis
• Performance prediction
– bigsim
7/7/2015
ICL@UTK
5
AMPI: MPI with Virtualization
• Each virtual process implemented as a user-level thread
embedded in a Charm object
MPI
processes
“processes”
Implemented
as virtual
processes
(user-level
migratable
threads)
Real Processors
7/7/2015
ICL@UTK
6
Making AMPI work
• Multiple user-level threads per processor:
– Problems with global variable
– Solution I:
• Automatic: switch GOT pointer at context switch
• Available on most machines
– Solution 2: Manual: replace global variables
– Solution 3: automatic: via compiler support (AMPIzer)
• Migrating Stacks
– Use isomalloc technique (Mehaut et al)
– Use memory files and mmap()
• Heap data:
– Isomalloc heaps
– OR user supplied pack/unpack functions for the heap data
7/7/2015
ICL@UTK
7
ELF and global variables
• The Executable and Linking Format (ELF)
–
–
–
–
Executable has a Global Offset Table containing global data
GOT pointer stored at %ebx register
Switch this pointer when switching between threads
Support on Linux, Solaris 2.x, and more
• Integrated in Charm++/AMPI
– Invoke by compile time option -swapglobal
7/7/2015
ICL@UTK
8
Adaptive overlap and modules
SPMD and Message-Driven Modules
(From A.
Gursoy, Simplified expression of message-driven programs and
quantification of their impact on performance, Ph.D Thesis, Apr 1994.)
Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh
SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995
7/7/2015
ICL@UTK
9
Benefit of Adaptive Overlap
1
AMPI(1)
Exec Time [sec]
AMPI(8)
0.1
0.01
0.001
1
10
100
1000
Procs
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
Shows AMPI with virtualization ratio of 1 and 8.
7/7/2015
ICL@UTK
10
Comparison with Native MPI
• Performance
– Slightly worse w/o optimization
– Being improved
• Flexibility
– Small number of PE available
– Special requirement by algorithm
Exec Time [sec]
100
Native MPI
AMPI
10
1
10
100
1000
Procs
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.
7/7/2015
ICL@UTK
11
AMPI Extensions
•
•
•
•
Automatic load balancing
Non-blocking collectives
Checkpoint/restart mechanism
Multi-module programming
7/7/2015
ICL@UTK
12
Load Balancing in AMPI
• Automatic load balancing: MPI_Migrate()
– Collective call informing the load balancer that the thread is
ready to be migrated, if needed.
7/7/2015
ICL@UTK
13
Load Balancing Steps
Regular
Timesteps
Detailed, aggressive Load
Balancing : object migration
Instrumented
Timesteps
7/7/2015
Refinement Load
Balancing
ICL@UTK
14
Refinement
Load
Balancing
Load
Balancing
Aggressive Load
Balancing
Processor Utilization against Time on 128 and 1024 processors
On 128 processor, a single load balancing step suffices, but
On 1024 processors, we need a “refinement” step.
7/7/2015
ICL@UTK
15
Shrink/Expand
• Problem: Availability of computing platform may change
• Fitting applications on the platform by object migration
Time per step for the million-row CG solver on a 16-node cluster
Additional 16 nodes available at step 600
7/7/2015
ICL@UTK
16
Optimized All-to-all “Surprise”
900
60
76 bytes all-to-all on
Lemieux
700
Time (ms)
50
40
Time (ms)
Completion time vs.
computation overhead
800
30
600
500
400
300
20
200
100
10
0
0
16
32
64
96
128
192
256
76
512 1024 1280 1536 2048
Hypercube
876
1276
Mesh
3d Grid
1676
2076
3076
4076
6076
8076
Mesh Compute
CPU is free during most of the time
taken by a collective operation
Radix Sort
20
18
Sort Completion Time(s)
AAPC Completion Time(ms)
900
800
700
600
500
400
300
200
100
0
Mesh
476
Message Size (Bytes)
Processors
MPI
276
16
100B 200B 900B 4KB 8KB
2
Message Size (bytes)
Mesh
7/7/2015
Direct
Led to the development of
Asynchronous Collectives
now supported in AMPI
14
12
10
8
6
4
0
100B
200B
900B
4KB
8KB
Message Size (bytes)
Mesh
ICL@UTK
Direct
17
Asynchronous Collectives
• Our implementation is asynchronous
– Collective operation posted
– Test/wait for its completion
– Meanwhile useful computation can utilize CPU
MPI_Ialltoall( … , &req);
/* other computation */
MPI_Wait(req);
7/7/2015
ICL@UTK
18
Fault Tolerance
7/7/2015
ICL@UTK
19
Motivation
• Applications need fast, low cost and scalable fault
tolerance support
• As machines grow in size
– MTBF decreases
– Applications have to tolerate faults
• Our research
–
–
–
–
Disk based Checkpoint/Restart
In Memory Double Checkpointing/Restart
Sender based Message Logging
Proactive response to fault prediction
• (impending fault response)
7/7/2015
ICL@UTK
20
Checkpoint/Restart Mechanism
• Automatic Checkpointing for AMPI and Charm++
– Migrate objects to disk!
– Automatic fault detection and restart
– Now available in distribution version of AMPI and Charm++
• Blocking Co-ordinated Checkpoint
– States of chares are checkpointed to disk
– Collective call MPI_Checkpoint(DIRNAME)
• The entire job is restarted
– Virtualization allows restarting on different # of processors
– Runtime option
• > ./charmrun pgm +p4 +vp16 +restart DIRNAME
• Simple but effective for common cases
7/7/2015
ICL@UTK
21
In-memory Double Checkpoint
• In-memory checkpoint
– Faster than disk
• Co-ordinated checkpoint
– Simple
MPI_MemCheckpoint(void)
• Double checkpointing
– Each object maintains 2
checkpoints:
– Local physical processor
– Remote “buddy” processor
• For jobs with large memory
100
10
1
0. 1
0. 01
0. 001
6.
4
12
.8
25
.6
51
.2
10
2.
4
20
4.
8
40
9.
6
81
9.
16 2
38
.4
32
76
.
65 8
53
.6
– User can decide what makes up
useful state
Checkpoint overhead (s)
1000
Proble m siz e (MB)
double in-memory (Myrinet)
double in-disk (Myrinet)
NFS disk
32 processors with 1.5GB memory each
– Use local disks!
7/7/2015
ICL@UTK
22
Restart
• A “Dummy” process is created:
– Need not have application data or checkpoint
– Necessary for runtime
– Starts recovery on all other Processors
• Other processors:
– Remove all chares
– Restore checkpoints lost on the crashed PE
– Restore chares from local checkpoints
• Load balance after restart
7/7/2015
ICL@UTK
23
Restart Performance
10 crashes
128 processors
Checkpoint every 10
time steps
7/7/2015
ICL@UTK
24
Scalable Fault Tolerance
Motivation: When a processor out of 100,000 fails, all 99,999
shouldn’t have to run back to their checkpoints!
• How?
• Current progress
– Basic scheme idea implemented and
tested in simple programs
– General purpose implementation in
progress
7/7/2015
ICL@UTK
Execut i on Ti me( s)
– Sender-side message logging
Only failed processor’s objects recover from
– Asynchronous Checkpoints on
checkpoints, playing back their messages,
buddy processors
while others “continue”
– Latency tolerance mitigates costs
Execut i on Ti me wi t h Faul t s
– Restart can be speeded up by
800
spreading out objects from failed
600
processor
400
200
0
0
1
2
3
4
5
6
Number of f aul t s
25
7
Recovery Performance
Execut i on Ti me( s)
Execut i on Ti me wi t h Faul t s
800
600
400
200
0
0
1
2
3
4
5
6
7
Number of f aul t s
Execution Time with increasing number of faults on 8
processors
(Checkpoint period 30s)
7/7/2015
ICL@UTK
26
Projections:
Performance visualization
and analysis tool
7/7/2015
ICL@UTK
27
An Introduction to Projections
• Performance Analysis tool
for Charm++ based
applications.
• Automatic trace
instrumentation.
• Post-mortem visualization.
• Multiple advanced features
that support:
– Data volume control
– Generation of additional user data.
7/7/2015
ICL@UTK
28
Trace Generation
• Automatic instrumentation by runtime system
• Detailed
– In the log mode each event is recorded in full detail (including
timestamp) in an internal buffer.
• Summary
– Reduces the size of output files and memory overhead.
– Produces a few lines of output data per processor.
– This data is recorded in bins corresponding to intervals of size
1 ms by default.
• Flexible
– APIs and runtime options for instrumenting user events and
data generation control.
7/7/2015
ICL@UTK
29
The Summary View
• Provides a
view of the
overall
utilization of the
application.
• Very quick to
load.
7/7/2015
ICL@UTK
30
Graph View
• Features:
• Selectively
view Entry
points.
• Convenient
means to switch
to between axes
data types.
7/7/2015
ICL@UTK
31
Timeline
• The most detailed view in Projections.
• Useful for understanding critical path issues or unusual
entry point behaviors at specific times.
7/7/2015
ICL@UTK
32
Animations
7/7/2015
ICL@UTK
33
7/7/2015
ICL@UTK
34
7/7/2015
ICL@UTK
35
7/7/2015
ICL@UTK
36
Time Profile
• Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early
via the Time Profile tool.
Solved by
prioritizing
entry methods
7/7/2015
ICL@UTK
37
7/7/2015
ICL@UTK
38
7/7/2015
ICL@UTK
39
Overview: one line for each processor, time on X-axis
White: busy,
Black:idle
Red:
intermediate
7/7/2015
ICL@UTK
40
A boring but good-performance overview
7/7/2015
ICL@UTK
41
An interesting but pathetic overview
7/7/2015
ICL@UTK
42
Stretch Removal
Histogram Views
Number of function executions vs. their granularity
Note: log scale on Y-axis
After Optimizations
Before Optimizations
Over 16 large stretched calls
7/7/2015
ICL@UTK
About 5 large stretched calls,
largest of them much smaller, and
almost all calls take less than 3.243ms
Miscellaneous Features Color Selection
• Colors are
automatically supplied
by default.
• We allow users to
select their own colors
and save them.
• These colors can
then be restored the
next time Projections
loads.
7/7/2015
ICL@UTK
44
User APIs
• Controlling trace generation
– void traceBegin()
– void traceEnd()
• Tracing User (Events
–
–
–
–
int traceRegisterUserEvent(char *, int)
void traceUserEvent(char *)
void traceUserBracketEvent(int, double, double)
double CmiWallTimer()
• Runtime options:
– +traceoff
– +traceroot <directory>
– Projections mode only:
• +logsize <# entries>
• +gz-trace
– Summary mode only:
• +bincount <# of intervals>
• +binsize <interval time quanta (us)>
7/7/2015
ICL@UTK
45
Performance Prediction
Via
Parallel Simulation
7/7/2015
ICL@UTK
46
BigSim: Performance Prediction
• Extremely large parallel machines are around
already/about to be available:
– ASCI Purple (12k, 100TF)
– Bluegene/L (64k, 360TF)
– Bluegene/C (1M, 1PF)
• How to write a petascale application?
• What will be the Performance like?
• Would existing parallel applications scale?
– Machines are not there
– Parallel Performance is hard to model without actually running
the program
7/7/2015
ICL@UTK
47
Objectives and Simualtion Model
• Objectives:
– Develop techniques to facilitate the development of
efficient peta-scale applications
– Based on performance prediction of applications on
large simulated parallel machines
• Simulation-based Performance Prediction:
– Focus on Charm++ and AMPI programming models
Performance prediction based on PDES
– Supports varying levels of fidelity
• processor prediction, network prediction.
– Modes of execution :
• online and post-mortem mode
7/7/2015
ICL@UTK
48
Blue Gene Emulator/Simulator
• Actually: BigSim, for simulation of any large
machine using smaller parallel machines
• Emulator:
– Allows development of programming environment and
algorithms before the machine is built
– Allowed us to port Charm++ to real BG/L in 1-2 days
• Simulator:
– Allows performance analysis of programs running on large
machines, without access to the large machines
– Uses Parallel Discrete Event Simulation
7/7/2015
ICL@UTK
49
Architecture of BigNetSim
7/7/2015
ICL@UTK
50
Simulation Details
• Emulate large parallel machines on smaller existing
parallel machines – run a program with multi-million
way parallelism (implemented using user-threads).
– Consider memory and stack-size limitations
• Ensure time-stamp correction
• Emulator layer API is built on top of machine layer
• Charm++/AMPI implemented on top of emulator, like
any other machine layer
• Emulator layer supports all Charm features:
– Load-balancing
– Coomunication optimizations
7/7/2015
ICL@UTK
51
Performance Prediction
• Usefulness of performance prediction:
– Application developer (making small modifications)
• Difficult to get runtimes on huge current machines
• For future machines, simulation is the only possibility
• Performance debugging cycle can be considerably reduced
• Even approximate predictions can identify performance
issues such as load imbalance, serial bottlenecks,
communication bottlenecks, etc
– Machine architecture designer
• Knowledge of how target applications behave on it, can help
identify problems with machine design early
• Record traces during parallel emulation
• Run
(PDES)
7/7/2015 trace-driven simulation
ICL@UTK
52
Performance Prediction (contd.)
• Predicting time of sequential code:
– User supplied time for every code block
– Wall-clock measurements on simulating machine can be used via a
suitable multiplier
– Hardware performance counters to count floating point, integer, branch
instructions, etc
• Cache performance and memory footprint are approximated by
percentage of memory accesses and cache hit/miss ratio
– Instruction level simulation (not implemented)
• Predicting Network performance:
– No contention, time based on topology & other network parameters
– Back-patching, modifies comm time using amount of comm activity
– Network-simulation, modelling the netowrk entirely
7/7/2015
ICL@UTK
53
Performance Prediction Validation
• 7-point stencil program with 3D decomposition
– Run on 32 real processors, simulating 64, 128,... PEs
●
NAMD benchmark Apo-Lipoprotein A1 atom dataset
with 92k atoms, running for 15 timesteps
–
7/7/2015
For large processors, because of cache and memory effects,
the predicted value seems to diverge from actual value
ICL@UTK
54
Performance on Large Machines
• Problem:
– How to predict performance of
applications on future machines?
(E.g. BG/L)
– How to do performance tuning
without continuous access to large
machine?
• Solution:
–
–
–
–
Leverage virtualization
Develop machine emulator
Simulator: accurate time modeling
Run program on “100,000
processors” using only hundreds of
processors
• Analysis:
– Use performance viz. suite
(projections)
7/7/2015
Molecular Dynamics Benchmark
ER-GRE: 36573 atoms
1.6 million objects
8 step simulation
16k BG processors
ICL@UTK
55
Projections: Performance visualization
7/7/2015
ICL@UTK
56
NetWork Simulation
• Detailed implementation of
interconnection networks
• Configurable network
parameters
–
–
–
–
–
Topology / Routing
Input / Output VC seclection
Bandwidth / Latency
NIC parameters
Buffer / Message size, etc
• Support for hardware
collectives in network layer
7/7/2015
ICL@UTK
57
Higher level programming
• Orchestration language
– Allows expressing global control flow in a charm program
– HPF like flavor, but Charm++-like processor virtualization, and
explicit communication
• Multiphase Shared Arrays
– Provides a disciplined use of shared address space
– Each array can be accessed only in one of the following modes:
• ReadOnly, Write-by-One-Thread, Accumulate-only
– Access mode can change from phase to phase
– Phases delineated by per-array “sync”
7/7/2015
ICL@UTK
58
Other projects
• Faucets
• ParFUM:
– Flexible cluster scheduler
– resource management across
clusters
– Multi-cluster applications
– Parallel framework for
Unstructured mesh
• Invite collaborations:
• Load balancing strategies
• Commn optimization
• POSE: Parallel discrete
even simulation
7/7/2015
ICL@UTK
– Virtualization of other languages
and libraries
– New load balancing strategies
– Applications
59
Some Active Collaborations
• Biophysics: Molecular
Dynamics (NIH, ..)
• Material simulation (NSF)
– Long standing, 91-, Klaus Schulten,
Bob Skeel
– Gordon bell award in 2002,
– Production program used by
biophysicists
• Quantum Chemistry (NSF)
– QM/MM via Car-Parinello method
+
– Roberto Car, Mike Klein, Glenn
Martyna, Mark Tuckerman,
– Nick Nystrom, Josep Torrelas,
Laxmikant Kale
7/7/2015
– Dendritic growth, quenching, spacetime meshes, QM/FEM
– R. Haber, D. Johnson, J. Dantzig, +
• Rocket simulation (DOE)
– DOE, funded ASCI center
– Mike Heath, +30 faculty
• Computational Cosmology
(NSF, NASA)
– Simulation:
– Scalable Visualization:
• Others
ICL@UTK
– Simulation of Plasma
– Electromagnetics
60
Summary
• We are pursuing a broad agenda
–
–
–
–
–
–
–
aimed at productivity and performance in parallel programming
Intelligent Runtime System for adaptive strategies
Charm++/AMPI are production level systems
Support dynamic load balancing, communication optimizations
Performance prediction capabiltiees based on simulation
Basic Fault tolerance, performance viz tools are part of the suite
Application-oriented yet Computer Science centered research
Workshop on Charm++ and Applications: Oct 18-20 , UIUC
http://charm.cs.uiuc.edu
7/7/2015
ICL@UTK
61