Transcript Caches

ENGS 116 Lecture 13
1
Caches Cont.’d
Vincent H. Berk
October 24, 2005
Reading for Wednesday: Sections 5.1 – 5.4, (Jouppi article)
Reading for Friday: Sections 5.5 – 5.8
Reading for Monday: Sections 5.8 – 5.12 and 5.16
ENGS 116 Lecture 13
5. Reducing Misses by Compiler Optimizations
•
•
•
McFarling [1989] reduced cache misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts (using tools they developed)
Data
– Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored
in memory
– Loop Fusion: Combine 2 independent loops that have same looping and
some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
2
ENGS 116 Lecture 13
3
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of structures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality
ENGS 116 Lecture 13
4
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every
100 words; improved spatial locality
ENGS 116 Lecture 13
5
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access;
improve temporal locality
ENGS 116 Lecture 13
6
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) {
r = 0;
for (k = 0; k < N; k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
•
•
Two inner loops:
– Read all N  N elements of z[ ]
– Read N elements of 1 row of y[ ] repeatedly
– Write N elements of 1 row of x[ ]
Capacity misses a function of N & Cache Size:
•
– 3 N  N  4  no capacity misses; otherwise ...
Idea: compute on B  B submatrix that fits
ENGS 116 Lecture 13
7
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1) {
r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
};
• B called blocking factor
• Capacity misses reduced from 2N3 + N2 to 2N3/B +N2
• Conflict misses, too?
• Blocks don’t have to be square.
ENGS 116 Lecture 13
8
Reducing Conflict Misses by Blocking
Miss Rate
0.15
0.1
Direct Mapped Cache
0.05
Fully Associative Cache
0
0
50
100
150
Blocking Factor
• Conflict misses in caches not FA vs. Blocking size
– Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite fact that both fit in cache
ENGS 116 Lecture 13
9
Summary of Compiler Optimizations to Reduce Cache
Misses
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky (nasa7)
compress
1
1.5
2
2.5
Performance Improvement
merged arrays
loop interchange
loop fusion
blocking
3
ENGS 116 Lecture 13
Reducing Miss Penalty or Miss Rate via
Paralellism
• Nonblocking Caches
• Hardware prefetching of instructions or data
• Compiler-controlled prefetching
10
ENGS 116 Lecture 13
11
1. Parallelism:
Non-blocking Caches to Reduce Stalls on Misses
• Non-blocking cache or lockup-free cache allow data cache to continue
to supply cache hits during a miss
– requires out-of-order execution CPU
• “hit under miss” reduces the effective miss penalty by working during
miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further lower the
effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
– Requires multiple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
ENGS 116 Lecture 13
12
Value of Hit Under Miss for SPEC
Hit Under i Misses
2
Avg. Mem. Access Time
1.8
1.6
1.4
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
“Hit under n Misses”
Integer
•
•
•
ora
spice2g6
nasa7
alvinn
hydro2d
mdljdp2
wave5
su2cor
doduc
swm256
tomcatv
fpppp
ear
mdljsp2
compress
xlisp
espresso
eqntott
0.2
0
01
12
2  64
Base
Floating Point
FP programs on average: AMAT= 0.68  0.52  0.34  0.26
Int programs on average: AMAT= 0.24  0.20  0.19  0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
ENGS 116 Lecture 13
2. Reducing Misses by Hardware
Prefetching of Instructions & Data
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks, too:
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from 2 64KB, 4-way set
associative caches
• Prefetching relies on having extra memory bandwidth that can be
used without penalty
13
ENGS 116 Lecture 13
14
3. Reducing Misses by
Software Prefetching of Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults;
a form of speculative execution
• Issuing prefetch instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth
ENGS 116 Lecture 13
15
Reducing Hit Time
•
•
•
•
Small and Simple Caches
Avoiding Address Translation during Indexing of the cache
Pipelined Cache Access
Trace Caches
ENGS 116 Lecture 13
16
1. Fast Hit Times
via Small and Simple Caches
•
Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level
cache
– Small cache and fast clock rate
• Intel Pentium 4 moved from 2-way 16KB data + 16KB instruction to
4-way 8KB + 8KB
– Now cache runs at core-speed
•
Direct mapped, on chip
ENGS 116 Lecture 13
17
2. Fast Hits by Avoiding Address Translation
•
Send virtual address to cache? Called Virtually Addressed Cache or Virtual
Cache vs. Physical Cache
– Every time process is switched logically must flush the cache; otherwise
get false hits
>> Cost is time to flush + “compulsory” misses from empty cache
– Must handle aliases (sometimes called synonyms):
Two different virtual addresses map to same physical address
•
•
Solution to aliases
– HW guarantees each block a unique physical address OR page coloring
used to ensure virtual and physical addresses match in last x bits
Solution to cache flush
– Add process identifier tag that identifies process as well as address within
process: cannot get a hit if wrong process
ENGS 116 Lecture 13
18
Virtually Addressed Caches
CPU
CPU
VA
Tags
$
VA
PA
TB
$
PA
VA
VA
VA
TB
CPU
PA
MEM
MEM
Conventional
Organization
Virtually Addressed Cache
Translate only on miss
Synonym Problem
PA
Tags
$
TB
L2 $
PA
MEM
Overlap $ access
with VA translation:
requires $ index to
remain invariant
across translation
ENGS 116 Lecture 13
19
2. Fast Cache Hits by Avoiding Translation: Index
with Physical Portion of Address
•
If index is physical part of address, can start tag access in parallel with
translation so that can compare to physical tag
31
Page Address
Address Tag
•
12
11
Page Offset
Index
Block Offset
Limits cache to page size: what if want bigger caches and uses same
trick?
– Higher associativity moves barrier to right
– Page coloring
0
ENGS 116 Lecture 13
20
3. Fast Hit Times Via Pipelined Writes
• Pipeline Tag Check and Update Cache as separate stages;
current write tag check & previous write cache update
• Only STORES in the pipeline; empty during a miss
Store r2, (r1)
Add
Sub
Store r4, (r3)
CPU
address
Data Data
in
out
=?
Tag
Delayed write buffer
=?
M
U
X
Check r1
--M[r1]r2 &
check r3
Data
Write
buffer
Lower level memory
• In shade is “Delayed Write Buffer”; must be checked on reads;
either complete write or read from buffer
ENGS 116 Lecture 13
21
4. Trace Caches
• Combine branch-prediction and instruction prefetching
• Ability to load instruction blocks including taken branches into a
cache block.
• Basis for Pentium 4 NetBurst architecture
ENGS 116 Lecture 13
22
Cache Example
Suppose we have a processor with a base CPI of 1.0, assuming that all references
hit in the primary cache, and a clock rate of 500 MHz. Assume a main memory
access time of 200 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 5%. How much faster will the machine be if
we add a secondary cache that has a 20-ns access time for either a hit or a miss
and is large enough to reduce the global miss rate to main memory to 2%?
ENGS 116 Lecture 13
23
What is the Impact of What You’ve Learned
About Caches?
•
CPU
2000
1999
1998
1997
1996
1995
DRAM
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
•
1981
•
1960-1985: Speed =
1000
ƒ(no. operations)
1990
– Pipelined
100
Execution &
Fast Clock Rate
– Out-of-Order
10
execution
– Superscalar
Instruction Issue
1
1998: Speed =
ƒ(non-cached memory accesses)
What does this mean for
– Compilers, Operating Systems, Algorithms, Data Structures?
1980
•