Runtime Optimizations

Download Report

Transcript Runtime Optimizations

Processor Virtualization
for
Scalable Parallel Computing
Laxmikant Kale
[email protected]
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
7/21/2015
SUN Dec 2002: Virtualization
1
Acknowlwdgements
• Graduate students
including:
–
–
–
–
–
–
–
Gengbin Zheng
Orion Lawlor
Milind Bhandarkar
Arun Singla
Josh Unger
Terry Wilmarth
Sameer Kumar
7/21/2015
• Recent Funding:
– NSF (NGS: Frederica Darema)
– DOE (ASCI : Rocket Center)
– NIH (Molecular Dynamics)
SUN Dec 2002: Virtualization
2
Overview
• Processor Virtualization
– Motivation
– Realization in AMPI and
Charm++
• Part I: Benefits
– Better Software Engineering
– Message Driven Execution
– Flexible and dynamic mapping
to processors
– Principle of Persistence
• Part II:
– PetaFLOPS Machines
– Emulator
• Programming Environments
– Simulator
• Performance prediction
• Part III:
– Programming Models
• Application Examples
7/21/2015
SUN Dec 2002: Virtualization
3
Motivation
• Research Group Mission:
– Improve Performance and Productivity in parallel
programming
– Via Application-oriented but Computer-Science Centered
research
• Parallel Computing/Programming
– Coordination between processes
– Resource management
7/21/2015
SUN Dec 2002: Virtualization
4
Coordination:
• Processes, each with possibly local data
– How do they interact with each other?
– Data exchange and synchronization
• Solutions proposed
–
–
–
–
–
–
Message passing
Shared variables and locks
Global Arrays / shmem
UPC
Asynchronous method invocation
Specifically shared variables :
• readonly, accumulators, tables
– Others: Linda,
7/21/2015
SUN Dec 2002: Virtualization
Each is probably
suitable for different
applications and
subjective tastes of
programmers
5
Parallel Computing Is About Resource Management
• Who needs resources:
– Work units
• Threads, function-calls, method invocations, loop iterations
– Data units
• Array segments, cache lines, stack-frames, messages,
object variables
• Resources:
– Processors, floating point units, thread-units
– Memories: Caches, SRAMs, DRAMs,
• Programmer should not have to manage
resources explicitly, even within one program
7/21/2015
SUN Dec 2002: Virtualization
6
Processor Virtualization
• Basic Idea:
– Divide the computation into a large number of pieces
• Independent of number of processors
• Typically larger than number of processors
– Let the system map these virtual processors to processors
• Old idea? G. Fox Book (’86?),
– DRMS (IBM), Data Parallel C (Michael Quinn), MPVM/UPVM/MIST
• Our approach is “virtualization++”
–Language and runtime support for virtualization
–Exploitation of virtualization to the hilt
7/21/2015
SUN Dec 2002: Virtualization
7
Virtualization: Object-based Parallelization
User is only concerned with interaction between objects (VPs)
System implementation
User View
7/21/2015
SUN Dec 2002: Virtualization
8
Technical Approach
• Seek optimal division of labor between “system” and
programmer:
Decomposition done by programmer, everything else automated
Decomposition
Automation
Mapping
Charm++
AMPI
HPF
Scheduling
Expression
MPI
Specialization
7/21/2015
SUN Dec 2002: Virtualization
9
Message From This Talk
• Virtualization is ready and powerful to meet the needs of
tomorrows applications and machines
• Virtualization and associated techniques that we have
been exploring for the past decade are ready and
powerful enough to meet the needs of high-end parallel
computing and complex and dynamic applications
• These techniques are embodied into:
–
–
–
–
Charm++
AMPI
Frameworks (Strucured Grids, Unstructured Grids, Particles)
Virtualization of other coordination languages (UPC, GA, ..)
7/21/2015
SUN Dec 2002: Virtualization
10
Realizations: Charm++
• Charm++
– Parallel C++ with Data Driven Objects (Chares)
– Asynchronous method invocation
• Prioritized scheduling
– Object Arrays
– Object Groups:
– Information sharing abstractions: readonly, tables,..
– Mature, robust, portable (http://charm.cs.uiuc.edu)
7/21/2015
SUN Dec 2002: Virtualization
11
Object Arrays
• A collection of data-driven objects
– With a single global name for the collection
– Each member addressed by an index
• [sparse] 1D, 2D, 3D, tree, string, ...
– Mapping of element objects to procS handled by the system
User’s view
A[0] A[1] A[2] A[3]
7/21/2015
SUN Dec 2002: Virtualization
A[..]
12
Object Arrays
• A collection of data-driven objects
– With a single global name for the collection
– Each member addressed by an index
• [sparse] 1D, 2D, 3D, tree, string, ...
– Mapping of element objects to procS handled by the system
User’s view
A[0] A[1] A[2] A[3]
A[..]
System
view
A[0]
7/21/2015
A[3]
SUN Dec 2002: Virtualization
13
Object Arrays
• A collection of data-driven objects
– With a single global name for the collection
– Each member addressed by an index
• [sparse] 1D, 2D, 3D, tree, string, ...
– Mapping of element objects to procS handled by the system
User’s view
A[0] A[1] A[2] A[3]
A[..]
System
view
A[0] A[3]
7/21/2015
SUN Dec 2002: Virtualization
14
Adaptive MPI
• A migration path for legacy MPI codes
• AMPI = MPI + Virtualization
• Uses Charm++ object arrays and migratable
threads
• Minimal modifications to convert existing MPI
programs
– Automated via AMPizer
• Based on Polaris Compiler Framework
• Bindings for
– C, C++, and Fortran90
7/21/2015
SUN Dec 2002: Virtualization
15
AMPI:
7 MPI
processes
7/21/2015
SUN Dec 2002: Virtualization
16
AMPI:
7 MPI
“processes”
Implemented
as virtual
processors
(user-level
migratable
threads)
Real Processors
7/21/2015
SUN Dec 2002: Virtualization
17
Benefits of Virtualization
•
•
•
•
Better Software Engineering
Message Driven Execution
Flexible and dynamic mapping to processors
Principle of Persistence:
–
–
–
–
Enables Runtime Optimizations
Automatic Dynamic Load Balancing
Communication Optimizations
Other Runtime Optimizations
7/21/2015
SUN Dec 2002: Virtualization
18
Modularization
• Logical Units decoupled from “Number of processors”
– E.G. Oct tree nodes for particle data
– No artificial restriction on the number of processors
• Cube of power of 2
• Modularity:
– Software engineering: cohesion and coupling
– MPI’s “are on the same processor” is a bad coupling principle
– Objects liberate you from that:
• E.G. Solid and fluid moldules in a rocket simulation
7/21/2015
SUN Dec 2002: Virtualization
19
Rocket Simulation
• Large Collaboration headed Mike Heath
– DOE supported ASCI center
• Challenge:
– Multi-component code, with modules from independent
researchers
– MPI was common base
• AMPI: new wine in old bottle
– Easier to convert
– Can still run original codes on MPI, unchanged
7/21/2015
SUN Dec 2002: Virtualization
20
Rocket simulation via virtual processors
Rocflo
Rocflo
Rocflo
Rocface
Rocsolid
Rocface
Rocface
Rocsolid
Rocsolid
Rocflo
Rocface
Rocsolid
7/21/2015
Rocflo
Rocface
Rocsolid
Rocflo
Rocface
Rocsolid
Rocflo
Rocflo
Rocface
Rocflo
Rocface
Rocsolid
Rocflo
Rocface
Rocsolid
Rocface
Rocsolid
Rocsolid
SUN Dec 2002: Virtualization
21
AMPI and Roc*: Communication
Rocflo
Rocface
Rocsolid
7/21/2015
Rocflo
Rocface
Rocsolid
Rocflo
Rocflo
Rocface
Rocsolid
SUN Dec 2002: Virtualization
Rocflo
Rocface
Rocsolid
Rocface
Rocsolid
22
Message Driven Execution
Virtualization leads to Message Driven Execution
Scheduler
Scheduler
Message Q
Message Q
Which leads to Automatic Adaptive overlap of computation and
communication
7/21/2015
SUN Dec 2002: Virtualization
23
Adaptive Overlap via Data-driven Objects
• Problem:
– Processors wait for too long at “receive” statements
• Routine communication optimizations in MPI
– Move sends up and receives down
– Sometimes. Use irecvs, but be careful
• With Data-driven objects
– Adaptive overlap of computation and communication
– No object or threads holds up the processor
– No need to guess which is likely to arrive first
7/21/2015
SUN Dec 2002: Virtualization
24
Adaptive overlap and modules
SPMD and Message-Driven Modules
(From A.
Gursoy, Simplified expression of message-driven programs and
quantification of their impact on performance, Ph.D Thesis, Apr 1994.)
7/21/2015
SUN Dec 2002: Virtualization
25
Handling Random Load Variations via MDE
• MDE encourages asynchrony
– Asynchronous reductions, for example
– Only data dependence should force synchronization
• One benefit:
– Consider an algorithm with N steps
• Each step has different load balance:Tij
• Loose dependence between steps
– (on neighbors, for example)
– Sum-of-max (MPI) vs max-of-sum (MDE)
• OS Jitter:
– Causes random processors to add delays in each step
– Handled Automatically by MDE
7/21/2015
SUN Dec 2002: Virtualization
26
Example: Molecular Dynamics in NAMD
• Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands of atoms (1,000 - 500,000)
– 1 femtosecond time-step, millions needed!
• At each time-step
– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s
– Calculate velocities and advance positions
– Multiple Time Stepping : PME (3D FFT) every 4 steps
Collaboration with K. Schulten, R. Skeel, and coworkers
7/21/2015
SUN Dec 2002: Virtualization
27
Parallel Molecular Dynamics
700
vps
7/21/2015
192 +
144 vps
30,000 vps
SUN Dec 2002: Virtualization
28
Performance: NAMD on Lemieux
Time (ms)
Procs Per Node Cut
1
128
256
512
510
1024
1023
1536
1800
2250
PME
Speedup
MTS
1 24890 29490 28080
4 207.4 249.3 234.6
4 105.5 135.5 121.9
4 55.4 72.9 63.8
3 54.8 69.5
63
4 33.4 45.1 36.1
3 29.8 38.7 33.9
3 21.2 28.2 24.7
3 18.6 25.8 22.3
3 15.6 23.5 18.4
Cut
1
119
236
448
454
745
835
1175
1340
1599
PME
1
118
217
404
424
653
762
1047
1141
1256
GFLOPS
MTS
1
119
230
440
445
778
829
1137
1261
1527
Cut
PME
0.494
59
116
221
224
368
412
580
661
789
0.434
51
94
175
184
283
331
454
495
545
MTS
0.48
57
110
211
213
373
397
545
605
733
ATPase: 320,000+ atoms including water
7/21/2015
SUN Dec 2002: Virtualization
29
7/21/2015
SUN Dec 2002: Virtualization
30
Molecular Dynamics: Benefits of avoiding barrier
• In NAMD:
– The energy reductions were made asynchronous
– No other global barriers are used in cut-off simulations
• This came handy when:
– Running on Pittsburgh Lemieux (3000 processors)
– The machine (+ our way of using the communication layer)
produced unpredictable, random delays in communication
• A send call would remain stuck for 20 ms, for example
• How did the system handle it?
– See timeline plots
7/21/2015
SUN Dec 2002: Virtualization
31
7/21/2015
SUN Dec 2002: Virtualization
32
Asynchronous reductions in Jacobi
Processor timeline
with sync. reduction
reduction
compute
compute
This gap is
avoided below
Processor timeline
with async. reduction
compute
7/21/2015
reduction
compute
SUN Dec 2002: Virtualization
33
Virtualization/MDE leads to predictability
• Ability to predict:
– Which data is going to be needed and
– Which code will execute
– Based on the ready queue of object method invocations
• So, we can:
–
–
–
–
Prefetch data accurately
Prefetch code if needed
Out-of-core execution
Caches vs controllable SRAM
S
Message:
Q
S
Q
Consider providing programmable SRAMS
7/21/2015
SUN Dec 2002: Virtualization
34
Programmable SRAMs
• Problems with Caches:
– Cache management is based on principle of locality
• A heuristic, not a perfect predictor
– Cache miss handling is in the critical path
• Our approach (Message-driven execution)
– Can exploit a programmable SRAM very effectively
– Load the relevant data into the SRAM just-in-time
7/21/2015
SUN Dec 2002: Virtualization
35
Example: Jacobi Relaxation
Each processor may have hundreds of
such objects (few 10’s of KB each, say).
When all the boundary data for an object is
available, it is added to the “ready” queue.
Ready Queue
DRAM
7/21/2015
Prefetch/
SRAM management
SRAM
SUN Dec 2002: Virtualization
Scheduler’s Queue
Scheduler
36
Flexible Dynamic Mapping to Processors
• The system can migrate objects between processors
– Vacate processor used by a parallel program
– Dealing with extraneous loads on shared workstations
– Adapt to speed difference between processors
• E.g. Cluster with 500 MHz and 1 Ghz processors
• Automatic checkpointing
– Checkpointing = migrate to disk!
– Restart on a different number of processors
• Shrink and Expand the set of processors used by an app
• Shrink from 1000 to 900 procs. Later expand to 1200.
• Adaptive job scheduling for better System utilization
7/21/2015
SUN Dec 2002: Virtualization
37
Faucets: Optimizing Utilization Within/across Clusters
Cluster
Job
Submission
Cluster
Job Monitor
Cluster
http://charm.cs.uiuc.edu/research/faucets
7/21/2015
SUN Dec 2002: Virtualization
38
Inefficient Utilization Within A Cluster
Allocate A
16 Processor
system
Conflict
!
B Queued
Job A
Job B
8 processors
Job A
Job B
Current Job Schedulers can yield low system utilization..
A competetive problem in the context of Faucets-like systems
7/21/2015
SUN Dec 2002: Virtualization
39
Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of processors they
use, at runtime: by migrating virtual processor
AAllocate
Expands
A !!
16 Processor
system
Allocate
BA!
BShrink
Finishes
Job A
Job B
Min_pe = 8
Max_pe= 16
Job A
7/21/2015
SUN Dec 2002: Virtualization
Job B
40
Job Monitoring: Appspector
7/21/2015
SUN Dec 2002: Virtualization
41
AQS Features
•
•
•
•
AQS:Adaptive Queuing System
Multithreaded
Reliable and robust
Supports most features of standard queuing
systems
• Has the ability to manage adaptive jobs currently
implemented in Charm++ and MPI
• Handles regular (non-adaptive) jobs
7/21/2015
SUN Dec 2002: Virtualization
42
Cluster Utilization
Experimental
Simulated
7/21/2015
SUN Dec 2002: Virtualization
43
Experimental MRT
7/21/2015
SUN Dec 2002: Virtualization
44
Principle of Persistence
• Once the application is expressed in terms of
interacting objects:
– Object communication patterns and
computational loads
tend to persist over time
– In spite of dynamic behavior
• Abrupt and large,but infrequent changes (eg:AMR)
• Slow and small changes (eg: particle migration)
• Parallel analog of principle of locality
–
–
–
–
Heuristics, that holds for most CSE applications
Learning / adaptive algorithms
Adaptive Communication libraries
Measurement based load balancing
7/21/2015
SUN Dec 2002: Virtualization
45
Measurement Based Load Balancing
• Based on Principle of persistence
• Runtime instrumentation
– Measures communication volume and computation time
• Measurement based load balancers
– Use the instrumented data-base periodically to make new
decisions
– Many alternative strategies can use the database
• Centralized vs distributed
• Greedy improvements vs complete reassignments
• Taking communication into account
• Taking dependences into account (More complex)
7/21/2015
SUN Dec 2002: Virtualization
46
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements
Added
Num ber of Iterations Per second
50
3. Chunks
Migrated
45
40
35
30
25
2. Load
Balancer
Invoked
20
15
10
5
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
1
0
Iteration Num ber
7/21/2015
SUN Dec 2002: Virtualization
47
Optimizing for Communication Patterns
• The parallel-objects Runtime System can
observe, instrument, and measure
communication patterns
– Communication is from/to objects, not processors
– Load balancers use this to optimize object placement
– Communication libraries can optimize
• By substituting most suitable algorithm for each operation
• Learning at runtime
– E.g. Each to all individualized sends
• Performance depends on many runtime characteristics
• Library switches between different algorithms
V. Krishnan, MS Thesis, 1996
7/21/2015
SUN Dec 2002: Virtualization
48
All to all on Lemieux for a 76 Byte Message
60
50
Time (ms)
40
MPI
Mesh
Hypercube
30
3d Grid
20
10
0
16
32
64
96
128
192
256
512
1024
1280
1536
2048
Processors
7/21/2015
SUN Dec 2002: Virtualization
49
Impact on Application Performance
Namd Performance on Lemieux, with the transpose step
implemented using different all-to-all algorithms
140
120
100
80
Step Time
60
40
20
0
Mesh
Direct
MPI
256
512
1024
Processors
7/21/2015
SUN Dec 2002: Virtualization
50
“Overhead” of Virtualization
Isn’t there significant overhead of virtualization?
No! Not in most cases.
Time (Seconds) per Iteration
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
4
8
16
32
64
128
256
512
1024 2048
Number of Chunks Per Processor
7/21/2015
SUN Dec 2002: Virtualization
51
Ongoing Research
• Fault Tolerance
– Much easier at object level: TMR, efficient variations
– However, checkpointing used to be such an efficient
alternative (low forward-path cost)
– Resurrect past research
• Programming petaFLOPS machines
– Programming Environment:
• Simulation and Performance prediction
– Communication Optimizations: grids
– Dealing with limited Virtual Memory Space
7/21/2015
SUN Dec 2002: Virtualization
52
Applications on the current Emulator
• Using Charm++
• LeanMD:
– Research quality Molecular Dyanmics
– Version 0: only electrostatics + van der Vaal
• Simple AMR kernel
– Adaptive tree to generate millions of objects
• Each holding a 3D array
– Communication with “neighbors”
• Tree makes it harder to find nbrs, but Charm makes it easy
7/21/2015
SUN Dec 2002: Virtualization
53
Applications: Funded Collaborations
• Molecular Dynamics for
biophysics: NAMD
• QM/MM: Car-Parinello
• Materials
• Rocket Simulation
– DOE ASCI Center
• Computational
Astrophysics
– Microstructure: Dendritic
growth
– Bridging the gap between
Atomistic and FEM models
– Space-time Meshing
Developing CS enabling Technology in the Context of Real
Applications
7/21/2015
SUN Dec 2002: Virtualization
54
QM using Car-Parinello method: Glenn Martyna, Mark Tuckerman et al
7/21/2015
SUN Dec 2002: Virtualization
55
Evolution of a Galaxy in its cosmological context
Thomas Quinn et al
Need to bridge length gap; Multiple modules; communication
optimizations; dynamic load balancing
7/21/2015
SUN Dec 2002: Virtualization
56
Ongoing Research
• Load balancing
– Charm framework allows distributed and centralized
– Recent years, we focused on centralized
• Still ok for 3000 processors for NAMD
– Reverting back to older work on distributed balancing
• Need to handle locality of communication
– Topology sensitive placement
• Need to work with global information
– Approx global info
– Incomplete global info (only “neighborhood”)
• Achieving global effects by local action…
7/21/2015
SUN Dec 2002: Virtualization
57
Application
Orchestration Support
A
B
Unmesh
Solvers
Data transfer
C
MBlock
D
Particles
Application
Components
Framework
Components
AMR support
Parallel Standard Libraries
Charm/AMPI
MPI/lower layers
7/21/2015
SUN Dec 2002: Virtualization
58
Benefits of Virtualization: Summary
• Software Engineering
– Number of virtual processors can be
independently controlled
– Separate VPs for modules
• Message Driven Execution
– Adaptive overlap
– Modularity
– Predictability:
• Automatic Out-of-core
• Cache management
• Principle of Persistence:
– Enables Runtime Optimizations
– Automatic Dynamic Load
Balancing
– Communication Optimizations
– Other Runtime Optimizations
• Dynamic mapping
– Heterogeneous clusters:
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change the set of processors
7/21/2015
More info:
http://charm.cs.uiuc.edu
SUN Dec 2002: Virtualization
59