Design and Evaluation of Architectures for Commercial

Transcript Design and Evaluation of Architectures for Commercial

Design and Evaluation of
Architectures for
Commercial Applications
Part II: tools & methods
Luiz André Barroso
Western Research Laboratory
Overview

Evaluation methods/tools
Introduction
 Software instrumentation (ATOM)
 Hardware measurement & profiling

– IPROBE
– DCPI
– ProfileMe
Tracing & trace-driven simulation
 User-level simulators
 Complete machine simulators (SimOS)

2
UPC, February 1999
Studying commercial applications: challenges
Size of the data sets and programs
 Complex control flow
 Complex interactions with Operating System
 Difficult tuning process
 Lack of access to source code (?)
 Vendor restrictions on publications

important to have a rich set of tools
3
UPC, February 1999
Tools are useful in many phases
Understanding behavior of workloads
 Tuning
 Performance measurements in existing systems
 Performance estimation for future systems

4
UPC, February 1999
Using ordinary system tools
Measuring CPU utilization and balance
 Determining user/system breakdown
 Detecting I/O bottlenecks

Disks
 Networks


Monitoring memory utilization and swap activity
5
UPC, February 1999
Gathering symbol table information
Most database programs are large statically linked
stripped binaries
 Most tools will require symbol table information
 However, distributions typically consist of object
files with symbolic data
 Simple trick:


6
replace system linker with wrapper that remove
“strip” flag, then calls real linker
UPC, February 1999
ATOM: A Tool-Building System
Developed at WRL by Alan Eustace & Amitabh Srivastava
Easy to build new tools
Flexible enough to build interesting tools
Fast enough to run on real applications
Compiler independent: works on existing binaries
7
UPC, February 1999
Code Instrumentation
Trojan Horse
TOOL
V
V
Application appears unchanged
 ATOM adds code and data to the application
 Information collected as a side effect of execution

8
UPC, February 1999
ATOM Programming Interface
Given an application program:
Navigation: Move around
 Interrogation: Ask questions
 Definition: Define interface to analysis procedures
 Instrumentation: Add calls to analysis procedures
Pass ANYTHING as arguments!

PC, effective addresses, constants, register
values, arrays, function arguments, line
numbers, procedure names, file names, etc.
9
UPC, February 1999
Navigation Primitives









Get{First,Last,Next,Prev}Obj
Get{First,Last,Next,Prev}ObjProc
Get{First,Last,Next,Prev}Block
Get{First,Last,Next,Prev}Inst
GetInstBlock - Find enclosing block
GetBlockProc - Find enclosing procedure
GetProcObj - Find enclosing object
GetInstBranchTarget - Find branch target
ResolveTargetProc - Find subroutine
destination
UPC, February 1999
Interrogation

GetProgramInfo(PInfo)



GetProcInfo(Proc *, BlockInfo)



number of procedures, blocks, and instructions.
text and data addresses
Number of blocks or instructions
Procedure frame size, integer and floating point save
masks
GetBlockInfo(Inst *, InstInfo)


Number of instructions
Any piece of the instruction (opcode, ra, rb,
displacement)
UPC, February 1999
Interrogation(2)

ProcFileName


InstLineNo


Returns the line number of this procedure
GetInstRegEnum


Returns the file name for this procedure
Returns a unique register specifier
GetInstRegUsage

Computes Source and Destination masks
UPC, February 1999
Interrogation(3)

GetInstRegUsage

Computes instruction source and destination masks
GetInstRegUsage(instFirst, &usageFirst);
GetInstRegUsage(instSecond, &usageSecond);
if (usageFirst.dreg_bitvec[0] &
usageSecond.ureg_bitvec[0]) {
/* set followed by a use */
}
Exactly what you need to find static pipeline
UPC, February 1999
stalls!
Definition
AddCallProto(“function(argument list)”)








Constants
Character strings
Program counter
Register contents
Cycle counter
Constant arrays
Effective Addresses
Branch Condition Values
UPC, February 1999
Instrumentation





AddCallProgram(Program{Before,After}, “name”,args)
AddCallProc(p, Proc{Before,After}, “name”,args)
AddCallBlock(b, Block{Before,After}, “name”,args)
AddCallInst(i, Inst{Before,After}, “name”,args)
ReplaceProc(p, “new”)
UPC, February 1999
Example #1: Procedure Tracing
What procedures are executed by the following
mystery program?
#include <stdio.h>
main() {
printf(“Hello world!\n”);
}
Hint: main => printf => ???
UPC, February 1999
Procedure Tracing Example
> cc hello.c -non_shared -g1 -o hello
> atom hello ptrace.inst.c ptrace.anal.c -o hw.ptrace
> hello.ptrace
=> __start
=> main
=> printf
=> _doprnt
=> __getmbcurmaz
<= __getmbcurmax
=> memcpy
<= memcpy
=> fwrite
UPC, February 1999
Procedure Trace (2)
=>
=>
=>
<=
=>
=>
<=
<=
=>
<=
<=
<=
=>
_wrtchk
_findbuf
__geterrno
__geterrno
__isatty
__ioctl
__ioctl
__isatty
__seterrno
__seterrno
_findbuf
_wrtchk
memcpy
<= memcpy
=> memchr
<= memchr
=> _xflsbuf
=> __write
Hello world!
<= __write
<= _xflsbuf
<= fwrite
<= _doprnt
<= printf
<= main
=> exit
=>
=>
<=
<=
=>
=>
=>
<=
=>
<=
<=
=>
=>
__ldr_atexit
__ldr_context_atexit
__ldr_context_atexit
__ldr_atexit
_cleanup
fclose
fflush
fflush
__close
__close
fclose
fclose
fflush
UPC, February 1999
<=
=>
<=
<=
=>
=>
fflush
__close
__close
fclose
fclose
__close
Example #2: Cache Simulator
Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped data
cache with 32 byte lines.
> atom spice cache.inst.o cache.anal.o -o spice.cache
> spice.cache < ref.in > ref.out
> more cache.out
5,387,822,402 620,855,884 11.523%
Great use for 64 bit integers!
UPC, February 1999
Cache Tool Implementation
Application
Instrumentation
main:
lw
move
li
v1,-32592(gp)
v0,zero
a0,20
addiu
addiu
sw
bne
v1,v1,4
v0,v0,4
v1,-32592(gp)
v0,a0,loop:
loop:
Reference(-32592(gp));
Note: Passes addresses as
if uninstrumented!
Reference(-32592(gp));
PrintResults();
jr
ra
UPC, February 1999
Cache Instrumentation File
#include <stdio.h>
#include <cmplrs/atom.inst.h>
unsigned InstrumentAll(int argc, char **argv) {
AddCallProto(“Reference(VALUE)”);
AddCallProto(“Print()”);
for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) {
if (BuildObj(o)) return (1);
if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”);
for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))
for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b))
for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i))
if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore))
AddCallInst(i, InstBefore, “Reference”, EffAddrValue);
WriteObj(o);
}
return (0);
}
UPC, February 1999
Cache Analysis File
#include <stdio.h>
#define CACHE_SIZE 65536
#define BLOCK_SHIFT 5
long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;
Reference(long address) {
int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;
long tag = address >> BLOCK_SHIFT;
if (cache[index] != tag) { misses++; cache[index] = tag ; }
refs++;
}
Print() {
FILE *file = fopen(“cache.out”,”w”);
printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs);
fclose(file);
}
UPC, February 1999
Example #3: TPC-B runtime information

Statistics per transaction:






Instructions
180,398
Loads (% shared)
47,643 (24%)
Stores (% shared)
21,380 (22%)
Lock/Unlock 118
MBs
241
Footprints/CPU



Instr.
Private data
Shared data
300 KB
(1.6 MB in pages)
470 KB
(4 MB in pages)
7 MB (26 MB in pages)
– 50% of the shared data footprint is touched by at
least one other process
23
UPC, February 1999
TPC-B (2)
Memory Footprint vs. Transactions
8.000E+06
7.000E+06
6.000E+06
Bytes
5.000E+06
Instruction
Private data
4.000E+06
Shared data
3.000E+06
2.000E+06
1.000E+06
10
80
12
24
13
68
15
12
16
56
18
00
93
6
79
2
64
8
50
4
36
0
21
6
72
0.000E+00
Transactions
24
UPC, February 1999
TPC-B (3)
Memory Footprint vs. Server processes
8.000E+06
7.000E+06
6.000E+06
Bytes
5.000E+06
4.000E+06
3.000E+06
Shared data
2.000E+06
Private data
1.000E+06
Instructions
0.000E+00
1
2
3
4
5
Server processes
25
UPC, February 1999
6
Oracle SGA activity in TPC-B
26
UPC, February 1999
ATOM wrap-up
Very flexible “hack-it-yourself” tool
 Discover detailed information on dynamic behavior
of programs
 Especially good when you don’t have source code
 Shipped with Digital Unix
 Can be used for tracing (later)

27
UPC, February 1999
Hardware measurement tools

IPROBE


DCPI


interface to CPU event counters
hardware assisted profiling
ProfileMe

28
hardware assisted profiling for complex CPU cores
UPC, February 1999
IPROBE
Developed by Digital’s Performance Group
 Use event counters provided by Alphas
 Operation:






set counter to monitor a particular event (e.g., icache_miss)
start counter
every counter overflow, interrupt wakes up handler and events
are accumulated
stop counter and read total
User can select:
which processes to count
 user level, kernel level, both

29
UPC, February 1999
IPROBE: 21164 event types
issues
single_issue_cycles
long_stalls
cycles
dual_issue_cycles
triple_issue_cycles
quad_issue_cycles
split_issue_cycles
pipe_dry
pipe_frozen
replay_trap
branches
cond_branches
jsr_ret
integer_ops
float_ops
loads
stores
icache_access
dcache_access
scache_access
scache_read
scache_write
bcache_hit
bcache_victim
sys_req
branch_mispr
pc_mispr
icache_miss
dcache_miss
dtb_miss
loads_merged
ldu_replays
cycles
scache_miss
scache_read_miss
scache_write
scache_sh_write
scache_write_miss
bcache_miss
sys_inv
itb_miss
wb_maf_full_replays
sys_read_req
external
mem_barrier_cycles
load_locked
scache_victim
30
UPC, February 1999
IPROBE: what you can do
Directly measure relevant events (e.g. cache
performance)
 Overall CPU cycle breakdown diagnosis:

microbenchmark machine to estimate latencies
 combine latencies with event counts


Main of inaccuracy

31
load/store overlap in the memory system
UPC, February 1999
IPROBE example: 4-CPU SMP
Estimated breakdown of stall cycles
Breakdown of CPU cycles
issuing
10%
TLB Branch
Replay trap 2%
mispr. 2%
6%
Mem.
barrier
5%
Scache hit
16%
data stall
46%
Bcache miss
42%
instructio
n stall
44%
CPI = 7.4
32
UPC, February 1999
Bcache hit
27%
Why did it run so bad?!?
Nominal memory latencies were good: 80 cycles
 Micro-benchmarks determined that:

latency under load is over 120 cycles on 4
processors
 base dirty miss latency was over 130 cycles
 off-chip cache latency was high


IPROBE data uncovered significant sharing:
for P=2, 15% of bcache misses are to dirty blocks
 for P=4, 20% of bcache misses are to dirty blocks

33
UPC, February 1999
Dirty miss latency on RISC SMPs
SPEC benchmark has no significant sharing
 Current processors/systems optimize local cache
access
 All RISC SMPs have high dirty miss penalties

Distribution of bus stall latencies for dirty misses
60
% dirty misses
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
11
bus cycles
34
UPC, February 1999
12
13
14
15
16
DCPI: continuous profiling infrastructure
Developed by SRC and WRL researchers
 Based on periodic sampling
 Hardware generates periodic interrupts
 OS handles the interrupts and stores data



Program Counter (PC) and any extra info
Analysis Tools convert data
for users
 for compilers

Other examples:
SGI Speedshop, Unix’s prof(), VTune
35
UPC, February 1999
Sampling vs. Instrumentation

Much lower overhead than instrumentation
DCPI: program 1%-3% slower
 Pixie: program 2-3 times slower


Applicable to large workloads
100,000 TPS on Alpha
 AltaVista


Easier to apply to whole systems (kernel, device
drivers, shared libraries, ...)
Instrumenting kernels is very tricky
 No source code needed

36
UPC, February 1999
Information from Profiles
DCPI estimates
 Where CPU cycles went, broken down by


How often code was executed


image, procedure, instruction
basic blocks and CFG edges
Where peak performance was lost and why
37
UPC, February 1999
Example: Getting the Big Picture
Total samples for event type cycles = 6095201
cycles
%
2257103
1658462
928318
650299
37.03%
27.21%
15.23%
10.67%
cycles
%
cum% load file
37.03%
64.24%
79.47%
90.14%
/usr/shlib/X11/lib_dec_ffb_ev5.so
/vmunix
/usr/shlib/X11/libmi.so
/usr/shlib/X11/libos.so
cum% procedure
2064143 33.87% 33.87% ffb8ZeroPolyArc
517464 8.49% 42.35% ReadRequestFromClient
305072 5.01% 47.36% miCreateETandAET
271158 4.45% 51.81% miZeroArcSetup
245450 4.03% 55.84% bcopy
209835 3.44% 59.28% Dispatch
186413 3.06% 62.34% ffb8FillPolygon
170723 2.80% 65.14% in_checksum
161326 2.65% 67.78% miInsertEdgeInET
133768 2.19% 69.98% miX1Y1X2Y2InRegion
38
UPC, February 1999
load file
/usr/shlib/X11/lib_dec_ffb_ev5.so
/usr/shlib/X11/libos.so
/usr/shlib/X11/libmi.so
/usr/shlib/X11/libmi.so
/vmunix
/usr/shlib/X11/libdix.so
/usr/shlib/X11/lib_dec_ffb_ev5.so
/vmunix
/usr/shlib/X11/libmi.so
/usr/shlib/X11/libmi.so
Example: Using the Microscope
Address Instruction
Samples Culprits CPI
9618 addq s0,t6,t6
643
b
D
..
3.5 cycles
.
D
961c ldl
t4,0(t6)
2111
a
a
a
21.0 cycles
di
..
.
di
9620 xor t4,t12,t5 14152
9624 beq 0x963c
0
1.0 cycles
(b = data dep on 2nd operand)
(D = DTLB miss)
9618
3.5 cycles
(a = data dep on 1st operand)
(d = d-cache miss)
(i = i-cache miss)
961c 21.0 cycles
0.0 cycles
Where peak performance is lost and why
39
UPC, February 1999
Example: Summarizing Stalls
I-cache (not ITB) 0.0% to 0.3%
ITB/I-cache miss 0.0% to 0.0%
D-cache miss 27.9% to 27.9%
DTB miss 9.2% to 18.3%
Write buffer 0.0% to 6.3%
Synchronization 0.0% to 0.0%
Branch mispredict
IMUL busy
FDIV busy
Other
0.0% to
0.0% to
0.0% to
0.0% to
2.6%
0.0%
0.0%
0.0%
Unexplained stall 2.3% to 2.3%
Unexplained gain -4.3% to -4.3%
------------------------------------------------------------Subtotal dynamic
44.1%
40
Slotting
1.8%
Ra dependency
2.0%
Rb dependency
1.0%
Rc dependency
0.0%
FU dependency
0.0%
------------------------------------------------------------Subtotal static
4.8%
------------------------------------------------------------Total stall
48.9%
Execution
51.2%
Net sampling error
-0.1%
------------------------------------------------------------Total tallied
100.0%
(35171, 93.1% of all samples)
UPC, February 1999
Example: Sorting Stalls
% cum% cycles cnt
10.0% 10.0% 109885 4998
9.9% 19.8% 108776 5513
7.8% 27.6% 85668 3836
41
cpi
22.0
19.7
22.3
blame
dcache
dcache
dcache
UPC, February 1999
PC
957c
9530
959c
file:line
comp.c:484
comp.c:477
comp.c:488
Typical Hardware Support

Timers


Clock interrupt after N units of time
Performance Counters

Interrupt after N
cycles, issues, loads, L1 Dcache misses, branch
mispredicts, uops retired, ...
Alpha 21064, 21164; PPro, PII;…
 Easy to measure total cycles, issues, CPI, etc.

Only extra information is restart PC
42
UPC, February 1999
Problem: Inaccurate Attribution

Experiment



count data loads
loop: single load +
hundreds of nops
In-Order Processor



Alpha 21164
skew
large peak
0
load
2
4
782
6
8
10
12
14
16
18
20

Out-of-Order Processor



43
Intel Pentium Pro
skew
smear
22
24
0
50
100
150
Histogram of Restart PCs
UPC, February 1999
200
Ramification of Misattribution

No skew or smear


Instruction-level analysis is easy!
Skew is a constant number of cycles
Instruction-level analysis is possible
 Adjust sampling period by amount of skew
 Infer execution counts, CPI, stalls, and stall
explanations from cycles samples and program


Smear
Instruction-level analysis seems hopeless
 Examples: PII, StrongARM

44
UPC, February 1999
Desired Hardware Support
Sample fetched instructions
 Save PC of sampled instruction

E.g., interrupt handler reads Internal Processor
Register
 Makes skew and smear irrelevant


Gather more information
45
UPC, February 1999
ProfileMe: Instruction-Centric
Profiling
Fetch
counter
overflow?
fetch
map
issue
exec
retire
random selection
ProfileMe tag!
arith
units
branch
predict
interrupt!
dcache
icache
done?
tagged?
capture! pc miss? mp? history stage latencies addr
internal processor registers
46
UPC, February 1999
miss?
retired?
Instruction-Level Statistics
PC + Retire Status
 PC + Cache Miss Flag
 PC + Branch Mispredict
 PC + Event Flag
 execution frequency
 cache miss rates
 mispredict rates
 event rates
PC + Branch Direction
 PC + Branch History
 edge frequencies
 path execution rates



 instruction stalls
PC + Latency
“100-cycle dcache miss” vs. “dcache miss”
47
UPC, February 1999
Data Analysis
Compiled code
Samples
A
N
A
L
Y
S
I
S
Frequency
Cycles per instruction
Stall explanations
Cycle samples are proportional to total time
at head of issue queue (at least on in-order
Alphas)
 Frequency indicates frequent paths
 CPI indicates stalls

48
UPC, February 1999
Estimating Frequency from Samples

Problem

given cycle samples, compute frequency and CPI
1,000,000  1 CPI
1,000,000 Cycles

?
10,000  100 CPI
Approach
Let F = Frequency / Sampling Period
 E(Cycle Samples) = F X CPI
 So … F = E(Cycle Samples) / CPI

49
UPC, February 1999
Estimating Frequency (cont.)
F = E(Cycle Samples) / CPI

Idea


If no dynamic stall, then know CPI, so can estimate F
So… assume some instructions have no dynamic stalls
Consider a group of instructions with the same frequency
(e.g., basic block)
 Identify instructions w/o dynamic stalls; then average their
sample counts for better accuracy
 Key insight:


50
Instructions without stalls have smaller sample counts
UPC, February 1999
Estimating Frequency (Example)
Address
9600
9604
9608
960c
9610
9614
9618
961c
9620
9624
Instruction
Samples MinCPI Samples/MinCPI
subl s6, a1, s6
lda
a3, 16411(s6)
cmovlt s6, a3, s6
bis
zero, zero, s3
sll
s6, 0x5, t6
addl zero, t6, t6
addq s0, t6, t6
ldl
t4, 0(t6)
xor
t4, t12, t5
beq
t5, 963c
Compute MinCPI from Code
 Compute Samples/MinCPI
 Select Data to Average

51
792
611
649
0
1389
616
643
2111
13152
0
1
1
1
0
2
1
1
1
2
0

UPC, February 1999
792
611
649
Estimate 630
695 (Actual 615)
616
643
2111
6576
Does badly when:


Few issue points
All issue points stall
Frequency Estimate Accuracy

Compare frequency estimates for blocks to
measured values obtained with pixie-like tool
52
UPC, February 1999
Explaining Stalls

Static stalls


Schedule instructions in each basic block
optimistically using a detailed pipeline model for the
processor
Dynamic stalls

Start with all possible explanations
– I-cache miss, D-cache miss, DTB miss, branch
mispredict, ...
Rule out unlikely explanations
 List the remaining possibilities

53
UPC, February 1999
Ruling Out D-cache Misses

Is the previous occurrence of an operand register
the destination of a load instruction?
ldq
t0,0(s1)
addq
t3,t0,t4
subq
t0,t1,t2
OR
subq
t0,t1,t2
Search backward across basic block boundaries
 Prune by block and edge execution frequencies

54
UPC, February 1999
DCPI wrap-up
Very precise, non-intrusive profiling tool
 Gathers both user-level and kernel profiles
 Relates architectural events back to original code
 Used for profile-based code optimizations

55
UPC, February 1999
Simulation of commercial workloads
Requires scaling down
 Options:

Trace-driven simulation
 User-level execution-driven simulation
 Complete machine simulation

56
UPC, February 1999
Trace-driven simulation

Methodology:



57
create ATOM instrumentation tool that logs a complete trace per
Oracle server process
– instruction path
– data accesses
– synchronization accesses
– system calls
run “atomized” version to derive trace
feed traces to simulator
UPC, February 1999
Trace-driven studies: limitations
No OS activity (in OLTP OS takes 10-15% of the
time)
 Trace selected processes only (e.g. server
processes)
 Time dilation alters system behavior

I/O looks faster
 many places with hardwired timeout values have to
be patched


Capturing synchronization correctly is difficult
need to reproduce correct concurrency for shared
data structures
 DB has complex synchronization structure, many
levels of procedures

58
UPC, February 1999
Trace-driven studies: limitations(2)

Scheduling traces into simulated processors
need enough information in the trace to reproduce
OS scheduling
 need to suspend processes for I/O & other blocking
operations
 need to model activity of background processes that
are not traced (e.g. log writer)

Re-create OS virtual-physical mapping, page
coloring scheme
 Very difficult to simulate wrong-path execution

59
UPC, February 1999
User-level execution-driven simulator
Our experience was to modify AINT (MINT for
Alpha)
 Problems:

no OS activity measured
 Oracle/OS interactions are very complex
 OS system call interface has to be virtualized
 That’s a hard one to crack…


Our status:
Oracle/TPC-B ran with 1 server process only
 we gave up...

60
UPC, February 1999
Complete machine simulator
Bite the bullet: model the machine at the hardware
level
 The good news is:

hardware interface is cleaner & better documented
than any software interface (including OS)
 all software JUST RUNS!! Including OS
 applications don’t have to be ported to simulator


We ported SimOS (from Stanford) to Alpha
61
UPC, February 1999
SimOS

A complete machine simulator

Speed-detail tradeoff for maximum flexibility

Flexible data collection and classification

Originally developed at Stanford University (MIPS ISA)

SimOS-Alpha effort started at WRL in Fall 1996

62
Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo, Ben
Verghese, Basem Nayfeh, and Jamey Hicks (CRL)
UPC, February 1999
SimOS - Complete Machine Simulation
Workloads
Pmake
Oracle
VCS
Operating System of Simulated Machine
SimOS
Hardware
Host
Disks
Caches
TTY
Ethernet
Memory System
CPU/MMU
Host Machine
Models CPUs, caches, buses, memory, disks, network, …
Complete enough to run OS and any applications
63
UPC, February 1999
Multiple Levels of Detail
Tradeoff between speed of simulation and the
amount of detail that is simulated
 Multiple modes of CPU simulation


Fast “on-the-fly compilation”: 10X slowdown!
– Workload placement

Simple pipeline emulator, no caches: 50-100X
slowdown
– Rough characterization

Simple pipeline emulator, full cache simulation: 100200X slowdown
– More accurate characterization of workloads
64
UPC, February 1999
Multiple Models for each Component
Multiple models for CPU, cache, memory,and disk.
 CPU




Caches



Perfect (0-latency), Bus-based (Tlaser), NUMA (Wildfire)
Disk


Two level set associative caches
Shared caches
Memory


simple pipeline emulator: 100-200X slowdown (EV5)
dynamically-scheduled processor: 1000-10000X slowdown
(e.g.21264)
Fixed latency or more complex HP disk model
Modular: add your own flavors
65
UPC, February 1999
Checkpoint and Sampling

Checkpoint capability for entire machine state

CPU state, main memory, and disk changes
Important for positioning workload for detailed simulation
 Switching detail level in a “sampling” study



Run in faster modes, sample in more detailed modes
Repeatability


66
Change parameters for studies
– Cache size
– Memory type and latencies
– Disk models and latencies
– Many others
Debugging race conditions
UPC, February 1999
Data Collection and Classification

Exploits visibility and non-intrusiveness offered by simulation

Can observe low-level events such as cache misses,
references and TLB misses
Tcl-based configuration and control provides ease of use
 Powerful annotation mechanism for triggering events



Hardware, OS, or Application
Apps and mechanisms to organize and classify data


67
Some already provided (cache miss counts and classification)
Mechanisms to do more (timing trees and detail tables)
UPC, February 1999
Easy configuration
TCL based configuration of the machine parameters
 Example:

set
set
set
set
set
set
set
set
set
set
set
68
PARAM(CPU.Model)
DELTA
detailLevel
1
PARAM(CPU.Clock)
1000
PARAM(CPU.Count)
4
PARAM(CACHE.2Level.L2Size) 1024
PARAM(CACHE.2Level.L2Line)
64
PARAM(CACHE.2Level.L2HitTime) 15
PARAM(MEMSYS.MemSize)
1024
PARAM(MEMSYS.Numa.NumMemories) $PARAM(CPU.Count)
PARAM(MEMSYS.Model)
Numa
PARAM(DISK.Fixed.Latency)
10
UPC, February 1999
Annotations - The building block

Small procedures to be run on encountering certain events

PC, hardware events (cache miss, TLB, …), simulator events
annotation set pc vmunix::idle_thread:START {
set PROCESS($CPU) idle
annotation exec osEvent startIdle
}
annotation set osEvent switchIn {
log "$CYCLES ContextSwitch
$CPU,$PID($CPU),$PROCESS($CPU)\n"
}
annotation set pc 0x12004ba90 {
incr tpcbTOGO -1
console "TRANSACTION $CYCLES togo=$tpcbTOGO \n"
if {$tpcbTOGO == 0} {simosExit}
}
69
UPC, February 1999
Example: Kernel Detail (TPCB)
1% 5%
2%
2%
5%
30%
1%
2%
7%
17%
2%
70
21%
5%
UPC, February 1999
SYS_read
SYS_write
SYS_pid_block
SYS_pid_unblock
lock
Int_clock
Int_IPI
Int_IO
DTLB
ITLB
2XTLB
MM_FOW
Other
SimOS Methodology

Configure and tune the workload on existing machine



build the database schema, create indexes, load data, optimize
queries
more difficult if simulated system much different from existing
platform
Create file(s) with disk image (dd) of the database disk(s)


write-protect “dd” files to prevent permanent modification (i.e.
use copy-on-write)
optionally, umount disks and let SimOS use them as raw
devices
Configure SimOS to see the “dd” files as raw disks
 “Boot” a SimOS configuration and mount the disks

71
UPC, February 1999
SimOS Methodology (2)

Boot and startup the database engine on “fast
mode”

Startup the workload

When in steady state: create a checkpoint and exit

Resume from checkpoint with complex (slower)
simulator
72
UPC, February 1999
Sample NUMA TPC-B Profile:
73
UPC, February 1999
Running from a Checkpoint

What can be changed:
processor model
 disk model
 cache sizes, hierarchy, organization, replacement
 how long to run the simulation


What cannot be changed:
number of processors
 size of physical memory

74
UPC, February 1999
Tools wrap-up
No single tool will get the job done
 Monitoring application execution in a real system is
invaluable
 Complete machine simulation advantages:

see the whole thing
 portability of software is non-issue
 speed/detail trade-off essential for detailed studies

75
UPC, February 1999

Design and Evaluation of Architectures for Commercial

Transcript Design and Evaluation of Architectures for Commercial

Directory