Design and Evaluation of Architectures for Commercial
Download
Report
Transcript Design and Evaluation of Architectures for Commercial
Design and Evaluation of
Architectures for
Commercial Applications
Part II: tools & methods
Luiz André Barroso
Western Research Laboratory
Overview
Evaluation methods/tools
Introduction
Software instrumentation (ATOM)
Hardware measurement & profiling
– IPROBE
– DCPI
– ProfileMe
Tracing & trace-driven simulation
User-level simulators
Complete machine simulators (SimOS)
2
UPC, February 1999
Studying commercial applications: challenges
Size of the data sets and programs
Complex control flow
Complex interactions with Operating System
Difficult tuning process
Lack of access to source code (?)
Vendor restrictions on publications
important to have a rich set of tools
3
UPC, February 1999
Tools are useful in many phases
Understanding behavior of workloads
Tuning
Performance measurements in existing systems
Performance estimation for future systems
4
UPC, February 1999
Using ordinary system tools
Measuring CPU utilization and balance
Determining user/system breakdown
Detecting I/O bottlenecks
Disks
Networks
Monitoring memory utilization and swap activity
5
UPC, February 1999
Gathering symbol table information
Most database programs are large statically linked
stripped binaries
Most tools will require symbol table information
However, distributions typically consist of object
files with symbolic data
Simple trick:
6
replace system linker with wrapper that remove
“strip” flag, then calls real linker
UPC, February 1999
ATOM: A Tool-Building System
Developed at WRL by Alan Eustace & Amitabh Srivastava
Easy to build new tools
Flexible enough to build interesting tools
Fast enough to run on real applications
Compiler independent: works on existing binaries
7
UPC, February 1999
Code Instrumentation
Trojan Horse
TOOL
V
V
Application appears unchanged
ATOM adds code and data to the application
Information collected as a side effect of execution
8
UPC, February 1999
ATOM Programming Interface
Given an application program:
Navigation: Move around
Interrogation: Ask questions
Definition: Define interface to analysis procedures
Instrumentation: Add calls to analysis procedures
Pass ANYTHING as arguments!
PC, effective addresses, constants, register
values, arrays, function arguments, line
numbers, procedure names, file names, etc.
9
UPC, February 1999
Navigation Primitives
Get{First,Last,Next,Prev}Obj
Get{First,Last,Next,Prev}ObjProc
Get{First,Last,Next,Prev}Block
Get{First,Last,Next,Prev}Inst
GetInstBlock - Find enclosing block
GetBlockProc - Find enclosing procedure
GetProcObj - Find enclosing object
GetInstBranchTarget - Find branch target
ResolveTargetProc - Find subroutine
destination
UPC, February 1999
Interrogation
GetProgramInfo(PInfo)
GetProcInfo(Proc *, BlockInfo)
number of procedures, blocks, and instructions.
text and data addresses
Number of blocks or instructions
Procedure frame size, integer and floating point save
masks
GetBlockInfo(Inst *, InstInfo)
Number of instructions
Any piece of the instruction (opcode, ra, rb,
displacement)
UPC, February 1999
Interrogation(2)
ProcFileName
InstLineNo
Returns the line number of this procedure
GetInstRegEnum
Returns the file name for this procedure
Returns a unique register specifier
GetInstRegUsage
Computes Source and Destination masks
UPC, February 1999
Interrogation(3)
GetInstRegUsage
Computes instruction source and destination masks
GetInstRegUsage(instFirst, &usageFirst);
GetInstRegUsage(instSecond, &usageSecond);
if (usageFirst.dreg_bitvec[0] &
usageSecond.ureg_bitvec[0]) {
/* set followed by a use */
}
Exactly what you need to find static pipeline
UPC, February 1999
stalls!
Definition
AddCallProto(“function(argument list)”)
Constants
Character strings
Program counter
Register contents
Cycle counter
Constant arrays
Effective Addresses
Branch Condition Values
UPC, February 1999
Instrumentation
AddCallProgram(Program{Before,After}, “name”,args)
AddCallProc(p, Proc{Before,After}, “name”,args)
AddCallBlock(b, Block{Before,After}, “name”,args)
AddCallInst(i, Inst{Before,After}, “name”,args)
ReplaceProc(p, “new”)
UPC, February 1999
Example #1: Procedure Tracing
What procedures are executed by the following
mystery program?
#include <stdio.h>
main() {
printf(“Hello world!\n”);
}
Hint: main => printf => ???
UPC, February 1999
Procedure Tracing Example
> cc hello.c -non_shared -g1 -o hello
> atom hello ptrace.inst.c ptrace.anal.c -o hw.ptrace
> hello.ptrace
=> __start
=> main
=> printf
=> _doprnt
=> __getmbcurmaz
<= __getmbcurmax
=> memcpy
<= memcpy
=> fwrite
UPC, February 1999
Procedure Trace (2)
=>
=>
=>
<=
=>
=>
<=
<=
=>
<=
<=
<=
=>
_wrtchk
_findbuf
__geterrno
__geterrno
__isatty
__ioctl
__ioctl
__isatty
__seterrno
__seterrno
_findbuf
_wrtchk
memcpy
<= memcpy
=> memchr
<= memchr
=> _xflsbuf
=> __write
Hello world!
<= __write
<= _xflsbuf
<= fwrite
<= _doprnt
<= printf
<= main
=> exit
=>
=>
<=
<=
=>
=>
=>
<=
=>
<=
<=
=>
=>
__ldr_atexit
__ldr_context_atexit
__ldr_context_atexit
__ldr_atexit
_cleanup
fclose
fflush
fflush
__close
__close
fclose
fclose
fflush
UPC, February 1999
<=
=>
<=
<=
=>
=>
fflush
__close
__close
fclose
fclose
__close
Example #2: Cache Simulator
Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped data
cache with 32 byte lines.
> atom spice cache.inst.o cache.anal.o -o spice.cache
> spice.cache < ref.in > ref.out
> more cache.out
5,387,822,402 620,855,884 11.523%
Great use for 64 bit integers!
UPC, February 1999
Cache Tool Implementation
Application
Instrumentation
main:
lw
move
li
v1,-32592(gp)
v0,zero
a0,20
addiu
addiu
sw
bne
v1,v1,4
v0,v0,4
v1,-32592(gp)
v0,a0,loop:
loop:
Reference(-32592(gp));
Note: Passes addresses as
if uninstrumented!
Reference(-32592(gp));
PrintResults();
jr
ra
UPC, February 1999
Cache Instrumentation File
#include <stdio.h>
#include <cmplrs/atom.inst.h>
unsigned InstrumentAll(int argc, char **argv) {
AddCallProto(“Reference(VALUE)”);
AddCallProto(“Print()”);
for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) {
if (BuildObj(o)) return (1);
if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”);
for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))
for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b))
for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i))
if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore))
AddCallInst(i, InstBefore, “Reference”, EffAddrValue);
WriteObj(o);
}
return (0);
}
UPC, February 1999
Cache Analysis File
#include <stdio.h>
#define CACHE_SIZE 65536
#define BLOCK_SHIFT 5
long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;
Reference(long address) {
int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;
long tag = address >> BLOCK_SHIFT;
if (cache[index] != tag) { misses++; cache[index] = tag ; }
refs++;
}
Print() {
FILE *file = fopen(“cache.out”,”w”);
printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs);
fclose(file);
}
UPC, February 1999
Example #3: TPC-B runtime information
Statistics per transaction:
Instructions
180,398
Loads (% shared)
47,643 (24%)
Stores (% shared)
21,380 (22%)
Lock/Unlock 118
MBs
241
Footprints/CPU
Instr.
Private data
Shared data
300 KB
(1.6 MB in pages)
470 KB
(4 MB in pages)
7 MB (26 MB in pages)
– 50% of the shared data footprint is touched by at
least one other process
23
UPC, February 1999
TPC-B (2)
Memory Footprint vs. Transactions
8.000E+06
7.000E+06
6.000E+06
Bytes
5.000E+06
Instruction
Private data
4.000E+06
Shared data
3.000E+06
2.000E+06
1.000E+06
10
80
12
24
13
68
15
12
16
56
18
00
93
6
79
2
64
8
50
4
36
0
21
6
72
0.000E+00
Transactions
24
UPC, February 1999
TPC-B (3)
Memory Footprint vs. Server processes
8.000E+06
7.000E+06
6.000E+06
Bytes
5.000E+06
4.000E+06
3.000E+06
Shared data
2.000E+06
Private data
1.000E+06
Instructions
0.000E+00
1
2
3
4
5
Server processes
25
UPC, February 1999
6
Oracle SGA activity in TPC-B
26
UPC, February 1999
ATOM wrap-up
Very flexible “hack-it-yourself” tool
Discover detailed information on dynamic behavior
of programs
Especially good when you don’t have source code
Shipped with Digital Unix
Can be used for tracing (later)
27
UPC, February 1999
Hardware measurement tools
IPROBE
DCPI
interface to CPU event counters
hardware assisted profiling
ProfileMe
28
hardware assisted profiling for complex CPU cores
UPC, February 1999
IPROBE
Developed by Digital’s Performance Group
Use event counters provided by Alphas
Operation:
set counter to monitor a particular event (e.g., icache_miss)
start counter
every counter overflow, interrupt wakes up handler and events
are accumulated
stop counter and read total
User can select:
which processes to count
user level, kernel level, both
29
UPC, February 1999
IPROBE: 21164 event types
issues
single_issue_cycles
long_stalls
cycles
dual_issue_cycles
triple_issue_cycles
quad_issue_cycles
split_issue_cycles
pipe_dry
pipe_frozen
replay_trap
branches
cond_branches
jsr_ret
integer_ops
float_ops
loads
stores
icache_access
dcache_access
scache_access
scache_read
scache_write
bcache_hit
bcache_victim
sys_req
branch_mispr
pc_mispr
icache_miss
dcache_miss
dtb_miss
loads_merged
ldu_replays
cycles
scache_miss
scache_read_miss
scache_write
scache_sh_write
scache_write_miss
bcache_miss
sys_inv
itb_miss
wb_maf_full_replays
sys_read_req
external
mem_barrier_cycles
load_locked
scache_victim
30
UPC, February 1999
IPROBE: what you can do
Directly measure relevant events (e.g. cache
performance)
Overall CPU cycle breakdown diagnosis:
microbenchmark machine to estimate latencies
combine latencies with event counts
Main of inaccuracy
31
load/store overlap in the memory system
UPC, February 1999
IPROBE example: 4-CPU SMP
Estimated breakdown of stall cycles
Breakdown of CPU cycles
issuing
10%
TLB Branch
Replay trap 2%
mispr. 2%
6%
Mem.
barrier
5%
Scache hit
16%
data stall
46%
Bcache miss
42%
instructio
n stall
44%
CPI = 7.4
32
UPC, February 1999
Bcache hit
27%
Why did it run so bad?!?
Nominal memory latencies were good: 80 cycles
Micro-benchmarks determined that:
latency under load is over 120 cycles on 4
processors
base dirty miss latency was over 130 cycles
off-chip cache latency was high
IPROBE data uncovered significant sharing:
for P=2, 15% of bcache misses are to dirty blocks
for P=4, 20% of bcache misses are to dirty blocks
33
UPC, February 1999
Dirty miss latency on RISC SMPs
SPEC benchmark has no significant sharing
Current processors/systems optimize local cache
access
All RISC SMPs have high dirty miss penalties
Distribution of bus stall latencies for dirty misses
60
% dirty misses
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
11
bus cycles
34
UPC, February 1999
12
13
14
15
16
DCPI: continuous profiling infrastructure
Developed by SRC and WRL researchers
Based on periodic sampling
Hardware generates periodic interrupts
OS handles the interrupts and stores data
Program Counter (PC) and any extra info
Analysis Tools convert data
for users
for compilers
Other examples:
SGI Speedshop, Unix’s prof(), VTune
35
UPC, February 1999
Sampling vs. Instrumentation
Much lower overhead than instrumentation
DCPI: program 1%-3% slower
Pixie: program 2-3 times slower
Applicable to large workloads
100,000 TPS on Alpha
AltaVista
Easier to apply to whole systems (kernel, device
drivers, shared libraries, ...)
Instrumenting kernels is very tricky
No source code needed
36
UPC, February 1999
Information from Profiles
DCPI estimates
Where CPU cycles went, broken down by
How often code was executed
image, procedure, instruction
basic blocks and CFG edges
Where peak performance was lost and why
37
UPC, February 1999
Example: Getting the Big Picture
Total samples for event type cycles = 6095201
cycles
%
2257103
1658462
928318
650299
37.03%
27.21%
15.23%
10.67%
cycles
%
cum% load file
37.03%
64.24%
79.47%
90.14%
/usr/shlib/X11/lib_dec_ffb_ev5.so
/vmunix
/usr/shlib/X11/libmi.so
/usr/shlib/X11/libos.so
cum% procedure
2064143 33.87% 33.87% ffb8ZeroPolyArc
517464 8.49% 42.35% ReadRequestFromClient
305072 5.01% 47.36% miCreateETandAET
271158 4.45% 51.81% miZeroArcSetup
245450 4.03% 55.84% bcopy
209835 3.44% 59.28% Dispatch
186413 3.06% 62.34% ffb8FillPolygon
170723 2.80% 65.14% in_checksum
161326 2.65% 67.78% miInsertEdgeInET
133768 2.19% 69.98% miX1Y1X2Y2InRegion
38
UPC, February 1999
load file
/usr/shlib/X11/lib_dec_ffb_ev5.so
/usr/shlib/X11/libos.so
/usr/shlib/X11/libmi.so
/usr/shlib/X11/libmi.so
/vmunix
/usr/shlib/X11/libdix.so
/usr/shlib/X11/lib_dec_ffb_ev5.so
/vmunix
/usr/shlib/X11/libmi.so
/usr/shlib/X11/libmi.so
Example: Using the Microscope
Address Instruction
Samples Culprits CPI
9618 addq s0,t6,t6
643
b
D
..
3.5 cycles
.
D
961c ldl
t4,0(t6)
2111
a
a
a
21.0 cycles
di
..
.
di
9620 xor t4,t12,t5 14152
9624 beq 0x963c
0
1.0 cycles
(b = data dep on 2nd operand)
(D = DTLB miss)
9618
3.5 cycles
(a = data dep on 1st operand)
(d = d-cache miss)
(i = i-cache miss)
961c 21.0 cycles
0.0 cycles
Where peak performance is lost and why
39
UPC, February 1999
Example: Summarizing Stalls
I-cache (not ITB) 0.0% to 0.3%
ITB/I-cache miss 0.0% to 0.0%
D-cache miss 27.9% to 27.9%
DTB miss 9.2% to 18.3%
Write buffer 0.0% to 6.3%
Synchronization 0.0% to 0.0%
Branch mispredict
IMUL busy
FDIV busy
Other
0.0% to
0.0% to
0.0% to
0.0% to
2.6%
0.0%
0.0%
0.0%
Unexplained stall 2.3% to 2.3%
Unexplained gain -4.3% to -4.3%
------------------------------------------------------------Subtotal dynamic
44.1%
40
Slotting
1.8%
Ra dependency
2.0%
Rb dependency
1.0%
Rc dependency
0.0%
FU dependency
0.0%
------------------------------------------------------------Subtotal static
4.8%
------------------------------------------------------------Total stall
48.9%
Execution
51.2%
Net sampling error
-0.1%
------------------------------------------------------------Total tallied
100.0%
(35171, 93.1% of all samples)
UPC, February 1999
Example: Sorting Stalls
% cum% cycles cnt
10.0% 10.0% 109885 4998
9.9% 19.8% 108776 5513
7.8% 27.6% 85668 3836
41
cpi
22.0
19.7
22.3
blame
dcache
dcache
dcache
UPC, February 1999
PC
957c
9530
959c
file:line
comp.c:484
comp.c:477
comp.c:488
Typical Hardware Support
Timers
Clock interrupt after N units of time
Performance Counters
Interrupt after N
cycles, issues, loads, L1 Dcache misses, branch
mispredicts, uops retired, ...
Alpha 21064, 21164; PPro, PII;…
Easy to measure total cycles, issues, CPI, etc.
Only extra information is restart PC
42
UPC, February 1999
Problem: Inaccurate Attribution
Experiment
count data loads
loop: single load +
hundreds of nops
In-Order Processor
Alpha 21164
skew
large peak
0
load
2
4
782
6
8
10
12
14
16
18
20
Out-of-Order Processor
43
Intel Pentium Pro
skew
smear
22
24
0
50
100
150
Histogram of Restart PCs
UPC, February 1999
200
Ramification of Misattribution
No skew or smear
Instruction-level analysis is easy!
Skew is a constant number of cycles
Instruction-level analysis is possible
Adjust sampling period by amount of skew
Infer execution counts, CPI, stalls, and stall
explanations from cycles samples and program
Smear
Instruction-level analysis seems hopeless
Examples: PII, StrongARM
44
UPC, February 1999
Desired Hardware Support
Sample fetched instructions
Save PC of sampled instruction
E.g., interrupt handler reads Internal Processor
Register
Makes skew and smear irrelevant
Gather more information
45
UPC, February 1999
ProfileMe: Instruction-Centric
Profiling
Fetch
counter
overflow?
fetch
map
issue
exec
retire
random selection
ProfileMe tag!
arith
units
branch
predict
interrupt!
dcache
icache
done?
tagged?
capture! pc miss? mp? history stage latencies addr
internal processor registers
46
UPC, February 1999
miss?
retired?
Instruction-Level Statistics
PC + Retire Status
PC + Cache Miss Flag
PC + Branch Mispredict
PC + Event Flag
execution frequency
cache miss rates
mispredict rates
event rates
PC + Branch Direction
PC + Branch History
edge frequencies
path execution rates
instruction stalls
PC + Latency
“100-cycle dcache miss” vs. “dcache miss”
47
UPC, February 1999
Data Analysis
Compiled code
Samples
A
N
A
L
Y
S
I
S
Frequency
Cycles per instruction
Stall explanations
Cycle samples are proportional to total time
at head of issue queue (at least on in-order
Alphas)
Frequency indicates frequent paths
CPI indicates stalls
48
UPC, February 1999
Estimating Frequency from Samples
Problem
given cycle samples, compute frequency and CPI
1,000,000 1 CPI
1,000,000 Cycles
?
10,000 100 CPI
Approach
Let F = Frequency / Sampling Period
E(Cycle Samples) = F X CPI
So … F = E(Cycle Samples) / CPI
49
UPC, February 1999
Estimating Frequency (cont.)
F = E(Cycle Samples) / CPI
Idea
If no dynamic stall, then know CPI, so can estimate F
So… assume some instructions have no dynamic stalls
Consider a group of instructions with the same frequency
(e.g., basic block)
Identify instructions w/o dynamic stalls; then average their
sample counts for better accuracy
Key insight:
50
Instructions without stalls have smaller sample counts
UPC, February 1999
Estimating Frequency (Example)
Address
9600
9604
9608
960c
9610
9614
9618
961c
9620
9624
Instruction
Samples MinCPI Samples/MinCPI
subl s6, a1, s6
lda
a3, 16411(s6)
cmovlt s6, a3, s6
bis
zero, zero, s3
sll
s6, 0x5, t6
addl zero, t6, t6
addq s0, t6, t6
ldl
t4, 0(t6)
xor
t4, t12, t5
beq
t5, 963c
Compute MinCPI from Code
Compute Samples/MinCPI
Select Data to Average
51
792
611
649
0
1389
616
643
2111
13152
0
1
1
1
0
2
1
1
1
2
0
UPC, February 1999
792
611
649
Estimate 630
695 (Actual 615)
616
643
2111
6576
Does badly when:
Few issue points
All issue points stall
Frequency Estimate Accuracy
Compare frequency estimates for blocks to
measured values obtained with pixie-like tool
52
UPC, February 1999
Explaining Stalls
Static stalls
Schedule instructions in each basic block
optimistically using a detailed pipeline model for the
processor
Dynamic stalls
Start with all possible explanations
– I-cache miss, D-cache miss, DTB miss, branch
mispredict, ...
Rule out unlikely explanations
List the remaining possibilities
53
UPC, February 1999
Ruling Out D-cache Misses
Is the previous occurrence of an operand register
the destination of a load instruction?
ldq
t0,0(s1)
addq
t3,t0,t4
subq
t0,t1,t2
OR
subq
t0,t1,t2
Search backward across basic block boundaries
Prune by block and edge execution frequencies
54
UPC, February 1999
DCPI wrap-up
Very precise, non-intrusive profiling tool
Gathers both user-level and kernel profiles
Relates architectural events back to original code
Used for profile-based code optimizations
55
UPC, February 1999
Simulation of commercial workloads
Requires scaling down
Options:
Trace-driven simulation
User-level execution-driven simulation
Complete machine simulation
56
UPC, February 1999
Trace-driven simulation
Methodology:
57
create ATOM instrumentation tool that logs a complete trace per
Oracle server process
– instruction path
– data accesses
– synchronization accesses
– system calls
run “atomized” version to derive trace
feed traces to simulator
UPC, February 1999
Trace-driven studies: limitations
No OS activity (in OLTP OS takes 10-15% of the
time)
Trace selected processes only (e.g. server
processes)
Time dilation alters system behavior
I/O looks faster
many places with hardwired timeout values have to
be patched
Capturing synchronization correctly is difficult
need to reproduce correct concurrency for shared
data structures
DB has complex synchronization structure, many
levels of procedures
58
UPC, February 1999
Trace-driven studies: limitations(2)
Scheduling traces into simulated processors
need enough information in the trace to reproduce
OS scheduling
need to suspend processes for I/O & other blocking
operations
need to model activity of background processes that
are not traced (e.g. log writer)
Re-create OS virtual-physical mapping, page
coloring scheme
Very difficult to simulate wrong-path execution
59
UPC, February 1999
User-level execution-driven simulator
Our experience was to modify AINT (MINT for
Alpha)
Problems:
no OS activity measured
Oracle/OS interactions are very complex
OS system call interface has to be virtualized
That’s a hard one to crack…
Our status:
Oracle/TPC-B ran with 1 server process only
we gave up...
60
UPC, February 1999
Complete machine simulator
Bite the bullet: model the machine at the hardware
level
The good news is:
hardware interface is cleaner & better documented
than any software interface (including OS)
all software JUST RUNS!! Including OS
applications don’t have to be ported to simulator
We ported SimOS (from Stanford) to Alpha
61
UPC, February 1999
SimOS
A complete machine simulator
Speed-detail tradeoff for maximum flexibility
Flexible data collection and classification
Originally developed at Stanford University (MIPS ISA)
SimOS-Alpha effort started at WRL in Fall 1996
62
Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo, Ben
Verghese, Basem Nayfeh, and Jamey Hicks (CRL)
UPC, February 1999
SimOS - Complete Machine Simulation
Workloads
Pmake
Oracle
VCS
Operating System of Simulated Machine
SimOS
Hardware
Host
Disks
Caches
TTY
Ethernet
Memory System
CPU/MMU
Host Machine
Models CPUs, caches, buses, memory, disks, network, …
Complete enough to run OS and any applications
63
UPC, February 1999
Multiple Levels of Detail
Tradeoff between speed of simulation and the
amount of detail that is simulated
Multiple modes of CPU simulation
Fast “on-the-fly compilation”: 10X slowdown!
– Workload placement
Simple pipeline emulator, no caches: 50-100X
slowdown
– Rough characterization
Simple pipeline emulator, full cache simulation: 100200X slowdown
– More accurate characterization of workloads
64
UPC, February 1999
Multiple Models for each Component
Multiple models for CPU, cache, memory,and disk.
CPU
Caches
Perfect (0-latency), Bus-based (Tlaser), NUMA (Wildfire)
Disk
Two level set associative caches
Shared caches
Memory
simple pipeline emulator: 100-200X slowdown (EV5)
dynamically-scheduled processor: 1000-10000X slowdown
(e.g.21264)
Fixed latency or more complex HP disk model
Modular: add your own flavors
65
UPC, February 1999
Checkpoint and Sampling
Checkpoint capability for entire machine state
CPU state, main memory, and disk changes
Important for positioning workload for detailed simulation
Switching detail level in a “sampling” study
Run in faster modes, sample in more detailed modes
Repeatability
66
Change parameters for studies
– Cache size
– Memory type and latencies
– Disk models and latencies
– Many others
Debugging race conditions
UPC, February 1999
Data Collection and Classification
Exploits visibility and non-intrusiveness offered by simulation
Can observe low-level events such as cache misses,
references and TLB misses
Tcl-based configuration and control provides ease of use
Powerful annotation mechanism for triggering events
Hardware, OS, or Application
Apps and mechanisms to organize and classify data
67
Some already provided (cache miss counts and classification)
Mechanisms to do more (timing trees and detail tables)
UPC, February 1999
Easy configuration
TCL based configuration of the machine parameters
Example:
set
set
set
set
set
set
set
set
set
set
set
68
PARAM(CPU.Model)
DELTA
detailLevel
1
PARAM(CPU.Clock)
1000
PARAM(CPU.Count)
4
PARAM(CACHE.2Level.L2Size) 1024
PARAM(CACHE.2Level.L2Line)
64
PARAM(CACHE.2Level.L2HitTime) 15
PARAM(MEMSYS.MemSize)
1024
PARAM(MEMSYS.Numa.NumMemories) $PARAM(CPU.Count)
PARAM(MEMSYS.Model)
Numa
PARAM(DISK.Fixed.Latency)
10
UPC, February 1999
Annotations - The building block
Small procedures to be run on encountering certain events
PC, hardware events (cache miss, TLB, …), simulator events
annotation set pc vmunix::idle_thread:START {
set PROCESS($CPU) idle
annotation exec osEvent startIdle
}
annotation set osEvent switchIn {
log "$CYCLES ContextSwitch
$CPU,$PID($CPU),$PROCESS($CPU)\n"
}
annotation set pc 0x12004ba90 {
incr tpcbTOGO -1
console "TRANSACTION $CYCLES togo=$tpcbTOGO \n"
if {$tpcbTOGO == 0} {simosExit}
}
69
UPC, February 1999
Example: Kernel Detail (TPCB)
1% 5%
2%
2%
5%
30%
1%
2%
7%
17%
2%
70
21%
5%
UPC, February 1999
SYS_read
SYS_write
SYS_pid_block
SYS_pid_unblock
lock
Int_clock
Int_IPI
Int_IO
DTLB
ITLB
2XTLB
MM_FOW
Other
SimOS Methodology
Configure and tune the workload on existing machine
build the database schema, create indexes, load data, optimize
queries
more difficult if simulated system much different from existing
platform
Create file(s) with disk image (dd) of the database disk(s)
write-protect “dd” files to prevent permanent modification (i.e.
use copy-on-write)
optionally, umount disks and let SimOS use them as raw
devices
Configure SimOS to see the “dd” files as raw disks
“Boot” a SimOS configuration and mount the disks
71
UPC, February 1999
SimOS Methodology (2)
Boot and startup the database engine on “fast
mode”
Startup the workload
When in steady state: create a checkpoint and exit
Resume from checkpoint with complex (slower)
simulator
72
UPC, February 1999
Sample NUMA TPC-B Profile:
73
UPC, February 1999
Running from a Checkpoint
What can be changed:
processor model
disk model
cache sizes, hierarchy, organization, replacement
how long to run the simulation
What cannot be changed:
number of processors
size of physical memory
74
UPC, February 1999
Tools wrap-up
No single tool will get the job done
Monitoring application execution in a real system is
invaluable
Complete machine simulation advantages:
see the whole thing
portability of software is non-issue
speed/detail trade-off essential for detailed studies
75
UPC, February 1999