Generational Cache Management of Code Traces in Dynamic

Download Report

Transcript Generational Cache Management of Code Traces in Dynamic

Generational Cache Management
of Code Traces in Dynamic
Optimization Systems
Original Authors
Kim Hazelwood
Michael D. Smith
Harvard University
Presented w/ modifications by
William M. Jones
For ECE 903 Seminar (Spring 2004)
Clemson University
1
2
Core Algorithms
3
2
4
1
2
Core Algorithms
3
4
Just
Kidding
3
Dynamic Optimization Systems
Profile
exe
Transform
Code
CPU
4
Code
Cache
Run-Time Overheads
 Observing execution
 Forming code regions
 Optimization
 Code caching
 ALL OVERHEAD
For good performance, vast majority of execution should
occur in the code cache. Cache management must be
efficient and cannot dominate.
Code Caches Store Superblocks
Resulting
Superblocks
Original CFG
A
A
BB
C
D
D
Call
EE
FF
II
5
G
G
H
H
Return
A
A
BB
D
D
EE
G
G
H
H
II
Exit A
FG
H
H
II
Potential
superblock
caching entities
Common dynamic
sequence of basic
blocks
Essential Idea
Code Cache Management
Goals
 Maximize execution in code cache
 Minimize runtime overhead
Previous solutions
 Cache flush on program phase change
 Unbounded caches
 Research tools
All motivated by SPEC benchmark performance
6
Contributions of the Research
Characterization of SPEC and interactive
applications
 Investigation of superblock lifetimes
 Generational code cache algorithms
 Evaluation

–
–
–
7
Miss rates
Overheads
Performance improvements
The DynamoRIO Collaboration
Dynamo
From Hewlett-Packard Laboratories
+
•Targets Windows NT’s and Linux
•Addresses challenges of providing
dynamic optimization infrastructure
RIO (Runtime Introspection and Optimization)
from MIT’s Laboratory for Computer Science
8
System Layout
BenchMarks
DynamoRIO
Superblock
Trace
Code Cache
Simulator
9
SB Details
Insertions
Accesses
-Windows
Repeatabilit
y
Results
Initial Studies




10
Determine maximum code cache size
Calculate code expansion factors
Obtain trace generation frequency
Identify causes of fragmentation in cache

11
10,000
1,000
gcc
vortex
eon
perlbmk
gap
sixtrack
twolf
crafty
parser
fma3d
vpr
apsi
galgel
mesa
ammp
wupwise
lucas
applu
mgrid
equake
swim
bzip2
mcf
gzip
art
hmean
Code Cache Size (in KB)
Do We Need Cache Management?
For SPEC2000: Probably not…
100,000
4.3 MB
0.7 MB
100
10
1
Interactive Windows Applications


20 fold increase
Unbounded caches become impractical
Code Cache Size (in KB)
100,000
34 MB
16 MB
10,000
1,000
100
10
12
hmean
solitaire
defrag
winzip
mpeg
pinball
acroread
excel
iexplore
outlook
access
powerpoint
word
1
As a General Rule (in DynamoRIO)
Code Expansion = Final Code Cache Size
Application Footprint
700%
Code Expansion
600%
516%
500%
400%
300%
200%
100%
hmean
solitaire
defrag
winzip
mpeg
pinball
acroread
excel
iexplore
outlook
access
13
powerpoint
word
0%
Trace Generation Frequency
SPEC2000
14
Trace Generation Frequency
Windows Benchmark
15
Fragmentation
Superblocks Vary in Size
1200
Spec2000
Windows
Frequency
1000
800
600
400
200
0
100
130
160
190
220
250
280
310
Superblock Size (bytes)
16
340
370
400
Unmapped Memory
Additional Fragmentation
17
Local vs. Global Cache Management
Two Granularities

Local Cache Management – Eviction policy
for a single code cache (FIFO, LRU, etc.)

Global Cache Management – Policy of
interaction between multiple code caches
–
–
18
Basic block vs. superblock cache
Generational code caches
Cache Management Challenges

Low overhead
–
–

Emphasize temporal locality
–

–
19
Intuitively obvious (that’s the whole point)
Minimize fragmentation
–

Impacts runtime performance
Complex calculations not feasible
Insertion and deletion
Unmapping memory (dynamic loading)
Circular buffer management
Circular Buffer Challenges

Undeletable traces
–
–

Program-forced evictions
–
–

Unmapped memory
Fragmentation
Complications warrant new buffer design
–
–
20
Suspended execution
Exception handling
Pseudo-circular buffer (not strict FIFO)
Skip undeletable traces and ignore program-forced
evictions
Basic Block & Superblock Caches


DynamoRIO generates trace by copying all
basic blocks into a code cache
Once the basic blocks become “hot”,
superblocks are formed and copied into the
superblock cache
Basic Block
Cache

21
50 executions
Superblock
Formation
Superblock
Cache
One weakness of a single FIFO (circular) cache
is that all superblocks are treated equally
Superblock Lifetimes -- SPEC2000
Lifetime = LastExecutionTime – FirstExecutionTime
TotalExecutionTime
22
Superblock Lifetimes -- Windows
Lifetime = LastExecutionTime – FirstExecutionTime
TotalExecutionTime
SUPERBLOCKS
80%
word
photoshop
powerpoint
visualstudio
outlook
pinball
iexplore
winzip
illustrator
40-60%
60-80%
80-100%
60%
40%
20%
0%
0-20%
23
20-40%
LIFETIME
Generational Code Caches
Initial Conceptual Design
24
Persistent
Cache
FiFo
Eviction
If (Live)
PROMOTE
If (Dead)
DELETE
Circular Buffer
New
SuperBlock
Circular Buffer
Nursery
Cache
Generational Hypothesis




25
Generational hypothesis from garbage
collection: Objects tend to die young
Unfortunately, garbage collectors know when
an object is dead
A superblock is dead when it will never be
executed again (impossible to determine
before program ends)
Guessing incorrectly doesn’t impact our
correctness – just expensive
The Probation Cache
FiFo
Eviction
Probation
Cache
If (threshold_met)
PROMOTE
If (threshold_not_met)
DELETE
26
Circular Buffer
Persistent
Cache
Circular
Buffer
New
SuperBlock
Circular Buffer
Nursery
Cache
Experimental Comparison






27
Ensure pressure: cacheSize = (1/3)*maxCache
Local policy fixed at FIFO for all caches
Base Case: One unified FIFO cache
Generational Case: Nursery, probation,
persistent caches totaling cacheSize
How big should each cache be ?
Probation threshold ?
Windows Application Miss Rates
60%
33%/33%/33% 10
40%/20%/40% 5
45%/10%/45% 1
50%
Miss Rate Reduction
40%
30%
20%
10%
0%
word
-10%
-20%
28
powerpoint outlook
iexplore
illustrator photoshop
vstudio
pinball
winzip
average
-10%
29
-20%
average
art
gzip
mcf
bzip2
swim
mgrid
40%/20%/40% 5
applu
wupwise
ammp
mesa
apsi
33%/33%/33% 10
vpr
fma3d
parser
crafty
twolf
sixtrack
gap
perlbmk
eon
vortex
gcc
Miss Rate Reduction
SPEC2000 Miss Rates
60%
45%/10%/45% 1
50%
40%
30%
20%
10%
0%
Incorporating Overheads
Using Pentium-4 performance monitors and
PAPI (counters), we collected overheads for:
Overhead
Superblock formation
Calculation
Size=242B
865 * (SBSize)(0.8)
69834
Cache eviction 2.75 * SBSize + 2650
Cache promotion
DynamoRIO context switch
22 * SBSize + 8030
13354
25
25
All overheads reported in instructions
30
3316
Generating Overhead Estimates
Each overhead estimate was generated using leastsquares linear regression over 30,000 samples
31
32
word
powerpoint
outlook
iexpolore
illustrator
photoshop
vstudio
pinball
winzip
gcc
vortex
eon
perlbmk
gap
sixtrack
twolf
crafty
parser
fma3d
vpr
apsi
mesa
ammp
wupwise
applu
mgrid
swim
bzip2
mcf
gzip
art
geomean
Instruction Overhead Eliminated
Reduction in Runtime Overhead
(45% -- 10% -- 45% proportion scheme)
50%
40%
30%
20%
10%
0%
-10%
Actual Execution Time Improvement



33
Results varied and were highly dependent
on number of misses eliminated
Gzip: 2,288 misses eliminated resulting in
0.07% reduction in execution time
Crafty: 292,486 misses eliminated resulting
in a 8.09% reduction in execution time
Conclusions



Large, interactive applications impose limiting
constraints on code caches
Leverage observations of superblock lifetimes
to improve management policies
Based on trace-driven simulation, replacing a
single code cache with multiple generational
code caches results in
–
–
34
Reduced miss rates
Reduced runtime overhead
Ongoing Research

Increasing cache pressure
–

Code cache eviction granularities
–
–

35
Management overhead can dominate
Fine grain = lower miss ratio rate
Coarse grain = less total management overhead
Going from SIMULATION to IMPLEMENTATION
Questions ?
Appendix A -- Metrics
HMEAN = Harmonic Mean = H
GMEAN = Geometric Mean = G
H <= G <= Arithmetic mean
37
Potential Questions

FIFO versus LRU, WHY?
–
38
Published paper at Interact’02 BIB[12]
DynamoRIO Block Diagram
39