Transcript Slide 1
The Return of Synthetic Benchmarks
Ajay M. Joshi (UT Austin) Lieven Eeckhout (Ghent University) Lizy K. John (UT Austin)
January 28, 2008 Laboratory of Computer Architecture Department of Electrical & Computer Engineering The University of Texas at Austin
Outline
The Need for Synthetic Benchmarks
BenchMaker Framework for Benchmark Synthesis
Workload Characteristics Used in Synthesis
Synthetic Benchmark Construction
Evaluation of BenchMaker
Applications
Summary
2
Benchmark Spectrum
Complete Application Code Application Suites e.g. SPEC CPU Kernel Codes e.g. Livermore Loops Synthetic Benchmarks e.g. Dhrystone, Whetstone Microbenchmarks e.g. STREAM Toy Benchmarks e.g. Heap sort Less Development Effort More Scalable More Maintainable Less Representative More Development Effort Less Scalable Less Maintainable More Representative
3
Focus on Simulation Time Reduction
Benchmark Subsetting [Eeckhout et al., PACT’02] [Vandierendonck et al., CAECW’04] [Phansalkar et al., ISPASS’05] [Eeckhout et al. IISWC’05]
•
Statistical Sampling [Conte et al., ICCD’96 ] [Wunderlich et al., ISCA’03]
•
Representative Sampling [Sherwood et al., ASPLOS’02]
•
Reduced Input Set [ KleinOsowski, CAN’04]
•
Statistical Simulation & Synthetic Workloads [Oskin et al., ISCA’00] [ Eeckhout et al., ISPASS’00] [Nussbaum et al., PACT’01] [Bell et al., ICS’05] Benchmark Run Length M ic ro pr Co m oc pl es ex so ity r
•
Analytical Modeling [Noonburg et al., MICRO’94] [Karkhanis et al., ISCA’04]
•
Speedup Simulation [Schnarr et al., ASPLOS’98] [Loh et al., SIGMETRICS’01]
4
Motivation : Benchmarking Challenges
Using Real-World Applications as Benchmarks Proprietary Nature of Real-World Applications Single-Point Performance Characterization Application Benchmarks are Rigid Applications Evolve Faster than Benchmarks Benchmark Suites are Costly to Develop, Maintain, and Upgrade Studying Commercial Workload Performance Early Design Stage Power/Performance Studies Usefulness of Synthetic Benchmarks Beyond Simulation Time Reduction
5
Resurgence of Synthetic Benchmarks…..
IEEE Computer, August 2003
6
Outline
The Need for Synthetic Benchmarks
BenchMaker Framework for Benchmark Synthesis
Workload Characteristics Used in Synthesis
Synthetic Benchmark Construction
Evaluation of BenchMaker
Applications
Summary
7
Workload Synthesis: Central Idea
Just 40 workload characteristics Application Behavior Space ‘Knobs’ for Changing Program Characteristcs Workload Synthesis Algorithm
Workload Synthesizer
Synthetic Benchmark Compile and Execute
Real Hardware or RTL
A D D R 1 , R 2 , R 3 L D R 4 , R 1 , R 6 M U L R 3 , R 6 , R 7 A D D R 3 , R 2 , R 5 D I V R 1 0 , R 2 , R 1 S U B R 3 , R 5 , R 6 S T O R E R 3 , R 1 0 , R 2 0 A D D R 1 , R 2 , R 3 L D R 4 , R 1 , R 6 M U L R 3 , R 6 , R 7 A D D R 3 , R 2 , R 5 D I V R 1 0 , R 2 , R 1 S U B R 3 , R 5 , R 1 B E Q R 3 , R 6 , L O O P S U B R 3 , R 5 , R 6 S T O R E R 3 , R 1 0 , R 2 0 D I V R 1 0 , R 2 , R 1 … … … … .
Execution Driven Simulator
8
Modeling Real-World Applications
Microarchitecture-Independent Workload Profiling Modeling Workload Attributes into Synthetic Workload Experiment Environment Real World Proprietary Workload Workload Profiler
Binary Instrumentation OR Simulation
Real Hardware Workload Profile = Workload Attributes + Distribution Of Attribute Values Workload Synthesizer Synthetic Benchmark Clone Execution Driven Simulator
9
Outline
The Need for Synthetic Benchmarks
BenchMaker Framework for Benchmark Synthesis
Workload Characteristics Used in Synthesis
Synthetic Benchmark Construction
Evaluation of BenchMaker
Applications
Summary
10
Workload Characteristics as ‘Knobs’
Category instruction mix Num.
10 Characteristic percentage of integer short latency percentage of integer long latency percentage of floating-point short latency percentage of floating-point long latency percentage of integer load percentage of integer store percentage of floating-point load percentage of floating-point store percentage of branches Instruction-level parallelism 8 register-dependency-distance – 8 distributions for register dependencies. Register dependency distance equal to 1 instruction, and the percentage of dependency dependencies that have a distance of up to 2, 4, 6, 8, 16, 32, and greater than 32 instructions.
data locality instruction locality branch predictability 1 10 1 10 data footprint distribution of local stride values instruction footprint distribution of branch transition rate
11
Capturing The Essence of Workloads
Attributes to capture inherent workload behavior – Data Locality: Dominant strides of static Load/Store – Control Flow Predictability: Branch transition rate
Modeling Locality & Control Flow Predictability – Data Locality of Integer, Scientific, and Embedded Workloads effectively modeled using circular streams – Replicating transition-rate of static branches
12
Modeling Data Access Pattern
• Identify streams of data references • A Stream?
– Sequence of memory addresses in an arithmetic progression – Elements of arrays A, B, and C form 3 streams for( ii = 0; ii < N; ii ++) A [ii] = B [ii] + C [ii] 200, 204, 208 .. 320, 324, 328 Issuing Sequence : 320 , 404 , 200 , ..
324 , 408 , 404, 408, 412 204 ….
...
• Streams are interleaved and may contain noise 4 , 8 , 12 , 16 , 1 , 3 , 20 , 24 , 5 , 7 , 2, 9 , 11 , 28 … 13
Extracting Streams
Reference pattern of static Load / Store Instructions
– PC-correlated spatial locality - Dependence on address referenced by nearby Ld / St - Programs with pointer chasing codes – PC-correlated temporal locality - Dependence on previous address generated by same Ld / St - Programs with multidimensional arrays
Could static Load / Store instructions be natural sources of streams ?
Profile every static Load / Store instruction
– Number of different strides with which it accesses data 14
Modeling Instruction Level Parallelism
Dependency Distance ADD R1, R3,R4 MUL R5,R3,R2 ADD R5,R3,R6 LD R4, (R1) SUB R8,R2,R1
Read After Write Dependency Distance = 3 Measure Distribution of Dependency Distances Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32
15
Modeling Control Flow Predictability
Capture behavior of easy and difficult to predict branches Inherent program feature that captures branch behavior Transition Rate [ Haungs et al. HPCA’00 ] # of Taken-Not Taken transitions / # of times executed Branches with low transition-rate (easier to predict) TTTTTTTTTN, NNNNNNNNNT Branches with high transition-rate (easier to predict) TNTNTNTNTN Branches with moderate transition-rate (tougher to predict) 16
Outline
The Need for Synthetic Benchmarks
BenchMaker Framework for Benchmark Synthesis
Workload Characteristics Used in Synthesis
Synthetic Benchmark Construction
Evaluation of BenchMaker
Applications
Summary
17
Instruction Mix Register Dependency Distance Stride Pattern of Load/Store Branch Transition Rate Branch Transition Probabilities A
0.8
BR 0.2
B
BR 1.0
1.0
BR
C D
BR 0.1
0.9
Workload Profile
Workload Synthesis (1)
1 Big Loop
A B D A B D A C D A B D 18
Instruction Mix Register Dependency Distance Stride Pattern of Load/Store Branch Transition Rate Branch Transition Probabilities
Workload Synthesis (2)
Memory Access Model (Strides) A
0.8
BR 0.2
B
BR 1.0
1.0
BR
C D
BR 0.1
0.9
Workload Profile
1 Big Loop
A B D A B D A C D A B D 19
Instruction Mix Register Dependency Distance Stride Pattern of Load/Store Branch Transition Rate Branch Transition Probabilities
Workload Synthesis (3)
Memory Access Model (Strides) A
0.8
BR 0.2
B
BR 1.0
1.0
BR
C D
BR 0.1
0.9
Workload Profile
1 Big Loop Branching Model – Based on Transition Rate
A B D A B D A C D A B D 20
Instruction Mix Register Dependency Distance Stride Pattern of Load/Store Branch Transition Rate Branch Transition Probabilities
Workload Synthesis (4)
Memory Access Model (Strides) A
0.8
BR 0.2
B
BR 1.0
1.0
BR
C D
BR 0.1
0.9
Workload Profile
1 Big Loop Branching Model – Based on Transition Rate
A B D A B D A C D A B D
Register Assignment C code with asm & volatile constructs
21
Outline
The Need for Synthetic Benchmarks
BenchMaker Framework for Benchmark Synthesis
Workload Characteristics Used in Synthesis
Synthetic Benchmark Construction
Evaluation of BenchMaker
Applications
Summary
22
Evaluation of BenchMaker
SPEC CPU2000, SPECjbb2005, and DBT2 workloads Validated Sim-Alpha Performance Model of Alpha 21264
Benchmark bzip2 crafty eon gcc gzip mcf perlbmk twolf vortex vpr gcc gcc Input SimPoint(s)
SPEC CPU2000 Integer
graphic ref rushmeier 166.i
graphic 553 774 403 389 389 ref perfect-ref ref lendian1 route expr 271 476 8, 24, 47, 51, 56, 73, 87, 99
SPEC CPU95 Integer
expr 553 5 1066 0, 3,5,6,7,8,9,10,12
23
1.8
1.6
1.4
1.2
1 0.8
0.6
0.4
0.2
0
Performance Correlation
Original Benchmark Synthetic Benchmark
Trade Accuracy for Flexibility – Average Error of 11%
24
35 30 25 20 15 10 5 0
Energy/Power Correlation
Original Benchmark Synthetic Benchmark
Average Error of 13%
25
Outline
The Need for Synthetic Benchmarks
BenchMaker Framework for Benchmark Synthesis
Workload Characteristics Used in Synthesis
Synthetic Benchmark Construction
Evaluation of BenchMaker
Applications
Summary
26
Altering Individual Program Characteristics
1.4
1.2
1 0.8
0.6
0.4
0.2
0 0 10 20 30 40 50 60 66 70 80
Percentage of References with Stride Value 0
90 100 27
Interaction of Program Characteristics
Data Footprint - 600K Data Footprint - 900K Data Footprint - 300K 0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 0 10 20 30 40 50 60 66 70
Percentage of references w ith Stride Value 0
80 90 100 28
Modeling Impact of Benchmark Drift
Increase in Code Footprint (hypothetical)
1.2
1 0.8
0.6
0.4
0.2
0 1 2 3 4 5 6
Factor by which code size is increased
7 8
Increase in Data Footprint from SPEC CPU95 to SPEC CPU2000 for gcc (Model with 7% accuracy)
29
Summary
Synthetic Benchmarks to Address Benchmarking Challenges
Constructing Synthetic Benchmarks from Hardware-Independent Characteristics
Applications of Synthetic Benchmarks - Altering Program Characteristics - Studying Interaction of Program Characteristics - Modeling Benchmark Drift
30
Questions?
Ajay’s email: [email protected]
31