Runtime Optimizations

Download Report

Transcript Runtime Optimizations

Advantages of Processor Virtualization
and AMPI
Laxmikant Kale
CS320
Spring 2003
[email protected]
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
7/21/2015
Virtualization for CS320:
1
Overview
• Processor Virtualization
– Motivation
– Realization in AMPI and
Charm++
• Part I: Benefits
– Better Software Engineering
– Message Driven Execution
– Flexible and dynamic mapping
to processors
– Principle of Persistence
7/21/2015
Virtualization for CS320:
2
Motivation
• We need to Improve Performance and
Productivity in parallel programming
• Parallel Computing/Programming is about:
– Coordination between processes
• Information exchange
• Synchronization
– (knowing when the other guy has done something)
– Resource management
• Allocating work and data to processors
7/21/2015
Virtualization for CS320:
3
Coordination:
• Processes, each with possibly local data
– How do they interact with each other?
– Data exchange and synchronization
• Solutions proposed
–
–
–
–
–
–
Message passing
Shared variables and locks
Global Arrays / shmem
UPC
Asynchronous method invocation
Specifically shared variables :
• readonly, accumulators, tables
– Others: Linda,
7/21/2015
Virtualization for CS320:
Each is probably
suitable for different
applications and
subjective tastes of
programmers
4
Resource Management
• Coordination is one aspect
– But parallel computing is also about resource management
• Who needs resources:
– Work units
• Threads, function-calls, method invocations, loop iterations
– Data units
• Array segments, cache lines, stack-frames, messages, object variables
• What are the resources:
– Processors, floating point units, thread-units
– Memories: caches, SRAMs, drams,
• Idea:
– Programmer should not have to manage resources explicitly, even within one
program
7/21/2015
Virtualization for CS320:
5
Processor Virtualization
• Basic Idea:
– Divide the computation into a large number of pieces
• Independent of number of processors
• Typically larger than number of processors
– Let the system map these virtual processors to processors
• Old idea? G. Fox Book (’86?),
– DRMS (IBM), Data Parallel C (Michael Quinn), MPVM/UPVM/MIST
• Our approach is “virtualization++”
–Language and runtime support for virtualization
–Exploitation of virtualization to the hilt
7/21/2015
Virtualization for CS320:
6
Virtualization: Object-based Parallelization
User is only concerned with interaction between objects (VPs)
System implementation
User View
7/21/2015
Virtualization for CS320:
7
Technical Approach
• Seek optimal division of labor between “system” and
programmer:
Decomposition done by programmer, everything else automated
Decomposition
Automation
Mapping
Charm++
AMPI
HPF
Scheduling
Expression
MPI
Specialization
7/21/2015
Virtualization for CS320:
8
Why Virtualization?
• Advertisement:
– Virtualization is ready and powerful to meet the needs of tomorrows
applications and machines
• Specifically:
– Virtualization and associated techniques that we have been exploring for
the past decade are ready and powerful enough to meet the needs of highend parallel computing and complex and dynamic applications
• These techniques are embodied into:
–
–
–
–
Charm++
AMPI
Frameworks (Strucured grids, unstructured grids, particles)
Virtualization of other coordination languages (UPC, GA, ..)
7/21/2015
Virtualization for CS320:
9
Realizations: Charm++
• Charm++
– Parallel C++ with Data Driven Objects (Chares)
– Asynchronous method invocation
• Prioritized scheduling
– Object Arrays
– Object Groups:
– Information sharing abstractions: readonly, tables,..
– Mature, robust, portable (http://charm.cs.uiuc.edu)
7/21/2015
Virtualization for CS320:
10
Object Arrays
• A collection of data-driven objects
– With a single global name for the collection
– Each member addressed by an index
• [sparse] 1D, 2D, 3D, tree, string, ...
– Mapping of element objects to procS handled by the system
User’s view
A[0] A[1] A[2] A[3]
7/21/2015
Virtualization for CS320:
A[..]
11
Object Arrays
• A collection of data-driven objects
– With a single global name for the collection
– Each member addressed by an index
• [sparse] 1D, 2D, 3D, tree, string, ...
– Mapping of element objects to procS handled by the system
User’s view
A[0] A[1] A[2] A[3]
A[..]
System
view
A[0]
7/21/2015
A[3]
Virtualization for CS320:
12
Object Arrays
• A collection of data-driven objects
– With a single global name for the collection
– Each member addressed by an index
• [sparse] 1D, 2D, 3D, tree, string, ...
– Mapping of element objects to procS handled by the system
User’s view
A[0] A[1] A[2] A[3]
A[..]
System
view
A[0] A[3]
7/21/2015
Virtualization for CS320:
13
Adaptive MPI
• A migration path for legacy MPI codes
– AMPI = MPI + Virtualization
– Uses Charm++ object arrays and migratable threads
• Existing MPI programs:
– Minimal modifications needed to convert existing MPI
programs
• Bindings for
– C, C++, and Fortran90
• We will focus on AMPI
– Ignoring Charm++ for now..
7/21/2015
Virtualization for CS320:
14
AMPI:
7 MPI
processes
7/21/2015
Virtualization for CS320:
15
AMPI:
7 MPI
“processes”
Implemented
as virtual
processors
(user-level
migratable
threads)
Real Processors
7/21/2015
Virtualization for CS320:
16
Benefits of Virtualization
1.
2.
3.
4.
Modularity and Better Software Engineering
Message Driven Execution
Flexible and dynamic mapping to processors
Principle of Persistence:
–
–
–
–
Enables Runtime Optimizations
Automatic Dynamic Load Balancing
Communication Optimizations
Other Runtime Optimizations
7/21/2015
Virtualization for CS320:
17
1: Modularization
• Logical Units decoupled from “Number of processors”
– E.G. Oct tree nodes for particle data
– No artificial restriction on the number of processors
• Cube of power of 2
• Modularity:
– Software engineering: cohesion and coupling
– MPI’s “are on the same processor” is a bad coupling principle
– Objects liberate you from that:
• E.G. Solid and fluid moldules in a rocket simulation
7/21/2015
Virtualization for CS320:
18
Example: Rocket Simulation
• Large Collaboration headed by Prof. M. Heath
– DOE supported ASCI center
• Challenge:
– Multi-component code,
• with modules from independent researchers
– MPI was common base
• AMPI: new wine in old bottle
– Easier to convert
– Can still run original codes on MPI, unchanged
• Example of modularization:
– RocFlo: Fluids code.
– RocSolid: Structures code,
– Rocface: data-transfer at the boundary.
7/21/2015
Virtualization for CS320:
19
Rocket simulation via virtual processors
Rocflo
Rocflo
Rocflo
Rocface
Rocsolid
Rocface
Rocface
Rocsolid
Rocsolid
Rocflo
Rocface
Rocsolid
7/21/2015
Rocflo
Rocface
Rocsolid
Rocflo
Rocface
Rocsolid
Rocflo
Rocflo
Rocface
Rocflo
Rocface
Rocsolid
Rocflo
Rocface
Rocsolid
Rocface
Rocsolid
Rocsolid
Virtualization for CS320:
20
AMPI and Roc*: Communication
Using separate sets of virtual processors for rocflo
and Rocsolid eliminates unnecessary coupling
Rocflo
Rocface
Rocsolid
7/21/2015
Rocflo
Rocface
Rocsolid
Rocflo
Rocflo
Rocface
Rocsolid
Virtualization for CS320:
Rocflo
Rocface
Rocsolid
Rocface
Rocsolid
21
2: Benefits of Message Driven Execution
Virtualization leads to Message Driven Execution:
Since there are potential multiple objects on each processor
Scheduler
Scheduler
Message Q
Message Q
Which leads to Automatic Adaptive overlap of computation and
communication
7/21/2015
Virtualization for CS320:
22
Adaptive Overlap via Data-driven Objects
• Problem:
– Processors wait for too long at “receive” statements
• Routine communication optimizations in MPI
– Move sends up and receives down
– Use irecvs, but be careful
• With Data-driven objects
– Adaptive overlap of computation and communication
– No object or threads holds up the processor
– No need to guess which is likely to arrive first
7/21/2015
Virtualization for CS320:
23
Adaptive overlap and modules
SPMD and Message-Driven Modules
(From A.
Gursoy, Simplified expression of message-driven programs and
quantification of their impact on performance, Ph.D Thesis, Apr 1994.)
7/21/2015
Virtualization for CS320:
24
Handling Random Load Variations via MDE
• MDE encourages asynchrony
– Asynchronous reductions, for example
– Only data dependence should force synchronization
• One benefit:
– Consider an algorithm with N steps
• Each step has different load balance:Tij
• Loose dependence between steps
– (on neighbors, for example)
– Sum-of-max (MPI) vs max-of-sum (MDE)
• OS Jitter:
– Causes random processors to add delays in each step
– Handled Automatically by MDE
7/21/2015
Virtualization for CS320:
25
Asynchronous reductions in Jacobi
Processor timeline
with sync. reduction
reduction
compute
compute
This gap is
avoided below
Processor timeline
with async. reduction
compute
7/21/2015
reduction
compute
Virtualization for CS320:
26
Virtualization/MDE leads to predictability
• Ability to predict:
– Which data is going to be needed and
– Which code will execute
– Based on the ready queue of object method invocations
• So, we can:
–
–
–
–
Prefetch data accurately
Prefetch code if needed
Out-of-core execution
Caches vs controllable SRAM
S
Q
7/21/2015
Virtualization for CS320:
S
Q
27
3: Flexible Dynamic Mapping to Processors
• The system can migrate objects between processors
– Vacate processor used by a parallel program
– Dealing with extraneous loads on shared workstations
– Adapt to speed difference between processors
• E.g. Cluster with 500 MHz and 1 Ghz processors
• Automatic checkpointing
– Checkpointing = migrate to disk!
– Restart on a different number of processors
• Shrink and Expand the set of processors used by an app
• Shrink from 1000 to 900 procs. Later expand to 1200.
• Adaptive job scheduling for better System utilization
7/21/2015
Virtualization for CS320:
28
Inefficient Utilization Within A Cluster
Allocate A
16 Processor
system
Conflict
!
B Queued
Job A
Job B
8 processors
Job A
Job B
Current Job Schedulers can yield low system utilization..
A competetive problem in the context of Faucets-like systems
7/21/2015
Virtualization for CS320:
29
Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of processors they
use, at runtime: by migrating virtual processor
AAllocate
Expands
A !!
16 Processor
system
Allocate
BA!
BShrink
Finishes
Job A
Job B
Min_pe = 8
Max_pe= 16
Job A
7/21/2015
Virtualization for CS320:
Job B
30
AQS Features
•
•
•
•
AQS:Adaptive Queuing System
Multithreaded
Reliable and robust
Supports most features of standard queuing
systems
• Has the ability to manage adaptive jobs currently
implemented in Charm++ and MPI
• Handles regular (non-adaptive) jobs
7/21/2015
Virtualization for CS320:
31
Cluster Utilization
Experimental
Simulated
7/21/2015
Virtualization for CS320:
32
Experimental Mean Response Time
7/21/2015
Virtualization for CS320:
33
4: Principle of Persistence
• Once the application is expressed in terms of
interacting objects:
– Object communication patterns and
computational loads
tend to persist over time
– In spite of dynamic behavior
• Abrupt and large,but infrequent changes (eg:AMR)
• Slow and small changes (eg: particle migration)
• Parallel analog of principle of locality
–
–
–
–
Heuristics, that holds for most CSE applications
Learning / adaptive algorithms
Adaptive Communication libraries
Measurement based load balancing
7/21/2015
Virtualization for CS320:
34
Measurement Based Load Balancing
• Based on Principle of persistence
• Runtime instrumentation
– Measures communication volume and computation time
• Measurement based load balancers
– Use the instrumented data-base periodically to make new
decisions
– Many alternative strategies can use the database
• Centralized vs distributed
• Greedy improvements vs complete reassignments
• Taking communication into account
• Taking dependences into account (More complex)
7/21/2015
Virtualization for CS320:
35
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements
Added
Num ber of Iterations Per second
50
3. Chunks
Migrated
45
40
35
30
25
2. Load
Balancer
Invoked
20
15
10
5
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
1
0
Iteration Num ber
7/21/2015
Virtualization for CS320:
36
Optimizing for Communication Patterns
• The parallel-objects Runtime System can
observe, instrument, and measure
communication patterns
– Communication is from/to objects, not processors
– Load balancers use this to optimize object placement
– Communication libraries can optimize
• By substituting most suitable algorithm for each operation
• Learning at runtime
– E.g. Each to all individualized sends
• Performance depends on many runtime characteristics
• Library switches between different algorithms
V. Krishnan, MS Thesis, 1996
7/21/2015
Virtualization for CS320:
37
All to all on Lemieux for a 76 Byte Message
60
50
Time (ms)
40
MPI
Mesh
Hypercube
30
3d Grid
20
10
0
16
32
64
96
128
192
256
512
1024
1280
1536
2048
Processors
7/21/2015
Virtualization for CS320:
38
Impact on Application Performance
Molecular Dynamics (NAMD) Performance on Lemieux,
with the transpose step implemented using different
all-to-all algorithms
140
120
100
80
Step Time
60
40
20
0
Mesh
Direct
MPI
256
512
1024
Processors
7/21/2015
Virtualization for CS320:
39
“Overhead” of Virtualization
Isn’t there significant overhead of virtualization?
No! Not in most cases.
Here, an application is run with increasing degree of virtualization
Performance
actually
improves with
virtualization
because of better
cache
performance
Time (Seconds) per Iteration
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
4
8
16
32
64
128
256
512
1024 2048
Number of Chunks Per Processor
7/21/2015
Virtualization for CS320:
40
How to decide the granularity
• How many virtual processors should you use?
– This (typically) does not depend on the number physical
processors available
– Granularity:
• Simple definition: amount of computation per message
– Guiding principle:
• Make (the work for) each virtual processor as small as
possible, while making sure it is sufficiently large
compared with the scehduling/messaging overhead.
• In practivce, today:
– Average computation per message > 100 microseconds is enough
– 0.5 msec to several msecs is typically used
7/21/2015
Virtualization for CS320:
41
How to decide the granularity: contd.
• Exceptions:
– Memory overhead
• Virtualization may lead to a large area of memory devoted
to “ghosts”
• Reduce the number of virtual processors
• OR: “fuse” chunks on individual processors to avoid ghost
regions.
– Large messages
• Modify the rule:
– Calculate message overhead
– Ensure granularity is more than 10 times this overhead
7/21/2015
Virtualization for CS320:
42
Benefits of Virtualization: Summary
• Software Engineering
– Number of virtual processors can be
independently controlled
– Separate VPs for modules
• Message Driven Execution
– Adaptive overlap
– Modularity
– Predictability:
• Automatic Out-of-core
• Cache management
• Principle of Persistence:
– Enables Runtime Optimizations
– Automatic Dynamic Load
Balancing
– Communication Optimizations
– Other Runtime Optimizations
• Dynamic mapping
– Heterogeneous clusters:
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change the set of processors
7/21/2015
More info:
http://charm.cs.uiuc.edu
Virtualization for CS320:
43