Lecture 12 Cache Storage System

Download Report

Transcript Lecture 12 Cache Storage System

Lecture 13
Cache Storage System
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 1
4 Questions for
Cache Storage Designers
• Q1: Where can a block be placed in the Cache?
(Block placement)
• Q2: How is a block found if it is in the Cache?
(Block identification)
• Q3: Which block should be replaced on a Cache miss?
(Block replacement)
• Q4: What happens on a Cache write?
(Write strategy)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 2
Example:
Alpha 21064 Data Cache
Index = 8 bits: 256 blocks = 8192/(32bytesx1)
Block
Block
Address Offset
<21> <8> <5>
Tag
Address CPU
34
Index
Data
In
Valid
<1>
Tag
<21>
Data
Out
Direct
Mapped
Data
<256>
. . . 256 blocks
...
Write
Buffer
If 1
compare
=?
4:1 MUX
Lower Level Memory
Miss
Cache Storage System
Hit(send load signal to CPU)
CS510 Computer Architectures
Lecture 13 - 3
Example:
Alpha 21064 Data Cache
• READ: four steps
Block
Address
Tag
Address
34
CPU
Index
Data Data
In
Out
Valid
<1>
Tag
<21>
Data
<256>
. . 256
. blocks . . .
Write
Buffer
=?
4:1 MUX
Lower Level
Memory
Miss
Cache Storage System
Hit
[1] CPU sends 34-bit address to cache
for Tag comparison
[2] Index selects the Tag to be
compared with the Tag in the
address from CPU
[3] Compare the Tags if the
information in the directory is Valid
[4] If Tags match, signal CPU to load
the data
2 clock cycles for these 4 steps
Instructions in this 2 cycles need to
be stalled
• WRITE:
– the first three steps + data write
CS510 Computer Architectures
Lecture 13 - 4
Writes in Alpha 21064
•
Write Through Cache
– Write is not complete in 2 clock cycles, but data is written in
the Write Buffer in 2 clock cycles, and from where data must
be stored in the memory while CPU continues on working
•
No-Write Merging vs. Write Merging in write buffer
Write Address
V
V
V
V
100
104
108
112
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
Write Address
V
V
V
V
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
No-Write
Merging
Write
Merging
Cache Storage System
100
CS510 Computer Architectures
4 Writes - 4 entries
4 Writes - merged into
a single buffer entry
16 sequential
writes in a buffer
Lecture 13 - 5
Structural Hazard:
Split Cache or Unified Cache?
Miss Rates for Separate Instruction Cache and Data Cache
- DM, 32-byte Blocks, SPEC92 avg(instruction reference
is about 75%), DECstation 5000
Size
a
Split Cache
Instruction Cache
1 KB
2 KB
4 KB
8 KB
16 KB
32 KB
64 KB
128 KB
Cache Storage System
3.06%
2.26%
1.78%
1.10%
0.64%
0.39%
0.15%
0.02%
Unified Cache
Data Cache
24.61%
20.57%
15.94%
10.19%
6.47%
4.82%
3.77%
2.88%
CS510 Computer Architectures
a
13.34%
9.78%
7.24%
4.57%
2.87%
1.99%
1.35%
0.95%
Lecture 13 - 6
Example:
Miss Rate and Average Access Time
•
•
Compare 16 KB I-cache and 16 KB D-cache vs. 32 KB U-cache
Assume
– a hit takes 1 clock cycle and a miss takes 50 clock cycles
– a load or a store hit takes 1 extra clock cycle on a U-cache
since there is only one cache port to satisfy two
simultaneous request(1 Instruction fetch and 1 LD or 1 ST)
– Integer program: LD;26%, ST;9%
•
Answer
– instruction accesses: 100%/(100%+26%+9%) = 75%
– data accesses: (26%+9%) /(100%+26%+9%) = 25%
– the overall miss rate for the split cache
(75% x 0.64%) + (25% x 6.47%) = 2.10%
U-cache has a slightly lower miss rate of 1.99%
Continue on the next slide
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 7
Example:
Miss Rate and Average Access Time
•
•
•
•
Average access time = % instructions x (hit time + instruction
miss rate x Miss penalty) + %data x (hit time + data miss rate x
Miss penalty)
16 KB split cache average access timesplit
= 75% x (1+0.64%x50) + 25% x (1+6.47%x50) = 0.990 + 1.059
= 2.05
32 KB unified cache average access timeunified
= 75% x (1+1.99%x50) + 25% x (1+1+1.99%x50) = 1.496 + 0.749
Stall cycle due to instruction fetch
= 2.24
The split cache in this example have a better average access time
than the single-ported unified cache even though their effective
miss rate is higher
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 8
Cache Performance
CPU time = (CPU execution clock cycles + Memory stall clock
cycles) x clock cycle time
Memory stall clock cycles = (Reads x Read miss rate x Read
miss penalty + Writes x Write miss rate x Write miss penalty)
Memory stall clock cycles = Memory accesses x Miss rate x
Miss penalty
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 9
Cache Performance
CPUtime = IC x (CPIexecution + Mem accesses per instruction x
Miss rate x Miss penalty) x Clock cycle time
Misses per instruction = Memory accesses per instruction x
Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss
penalty) x Clock cycle time
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 10
Cache Performance
•
•
What is the impact of two different cache organizations(DM and SA
Cache) on the performance of a CPU?
Assume
–
–
–
–
–
–
CPI with a perfect cache -- 2.0
clock cycle time(=cache access time) -- 2 ns
1.3 memory references per instruction
64 KB cache : block size -- 32 bytes
one cache - direct mapped, the other - two-way SA
assume the CPU clock cycle time must be stretched 1.10
times to accommodate the selection MUX of the SA(2x1.1)
– cache miss penalty -- 70 ns for either caches
– miss rate : Direct mapped -- 1.4%, SA -- 1.0%
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 11
2-way Set Associative,
Address to Select Word
Block
Block Address offset
<22> <7> <5>
Tag Index
Two sets of
Address tags
and data RAM
Data
<64>
Valid Tag
<1> <22>
Set.0. index
.
=?
=?
Use address
bits to select
correct DRAM
CPU
Address
Data Data
in
Out
...
Set 0
2:1
MUX
Write
Buffer
Set 1. index
..
. . . Set 1
Low Level Memory
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 12
Cache Performance
Answer
– Average access time = hit time + Miss rate x Miss penalty
• Average access timedirect = 2+(.014x70) = 2.98ns
• Average access time2-way = 2x1.10+(.010x70) = 2.90ns
– CPU performance
CPU time = IC x (CPIex + Misses per Instr. x Miss penalty) x
clock cycle time
• CPU timedirect = IC x (2.0x2+(1.3x.014x70)) = 5.27 x IC
• CPU time2-way = IC x (2.0x2x1.10+(1.3x.010x70))= 5.31 x IC
– the direct mapped cache leads to slightly better average
performance
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 13
Improving Cache
Performance
• Average memory access time = Hit time + Miss rate x Miss
penalty (ns or clocks)
• Improve performance by:
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 14
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 15
Improving Cache
Performance
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 16
Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 17
Cache
Performance Improvement
Reducing Miss Rate
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 18
Reducing Misses
Classifying Misses: 3 Cs
Compulsory: The first access to a block is not in the cache, so the
block must be brought into the cache. These are also called
cold start misses or first reference misses.
(Misses in Infinite Cache)
Capacity: If The cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks
being discarded and later retrieved.
(Misses in Size of Cache)
Conflict: If the block-placement strategy is set associative or direct
mapped, conflict misses (in addition to compulsory and capacity
misses) will occur because a block can be discarded and later
retrieved if too many blocks map to its set. These are also called
collision misses or interference misses.
(Misses in N-way Set Associative, Size of Cache)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 19
3Cs Absolute Miss Rate
0.14
1-way(DM)
Conflict
Miss Rate per Type
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
2
4
8
16
Cache Size (KB)
32
64
128
Compulsory
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 20
2:1 Cache Rule on
Set Associativity
2:1 Cache rule of thumb
0.14
1-way(DM)
Miss Rate per Type
0.12
A direct-mapped
cache of size N
has about the
same miss rate as
a 2-way setassociative cache
of size N/2
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
2
4
8
16
Cache Size (KB)
Cache Storage System
CS510 Computer Architectures
32
64
128
Compulsory
Lecture 13 - 21
3Cs Relative Miss Rate
(Scaled To Direct-Mapped Miss Ratio)
100%
80%
Conflict
Miss Rate per Type
1-way
60%
2-way
4-way
8-way
40%
Capacity
20%
Compulsory
0%
1
2
4
8
16
32
64
128
Cache Size (KB)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 22
How Can Reduce Misses?
• Change Block Size? Which of 3Cs affected?
• Change Associativity? Which of 3Cs affected?
• Change Compiler? Which of 3Cs affected?
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 23
1. Reduce Misses via
Larger Block Size
Cache Capacity
25%
1K
4K
16K
64K
256K
20%
Miss
Rate
15%
10%
5%
0%
16
32
64
128
256
For larger caches, Miss
rate reduces as the block
size increases.
But block size should not
be large with respect to
the cache capacity.
It will rather heart the
miss rate.
Block Size (bytes)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 24
2. Reduce Misses via
Higher Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N  Miss Rate SA(2-way) cache size N/2
• Beware: Execution time is only final measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested
hit times for TTL or ECL board-level external cache +10%, and
CMOS internal cache + 2% for 2-way vs. 1-way
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 25
Example: AMAT vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way,
1.14 for 8-way vs. CCT direct mapped, 32 bytes /block
Cache
Memmory Access Time
Size (KB) 1-way 2-way 4-way 8-way
1
2
4
8
16
32
64
128
Cache Storage System
7.56
5.09
4.60
3.30
2.45
2.00
1.70
1.50
6.60
4.90
3.95
3.00
2.20
1.80
1.60
1.45
6.22
4.62
3.57
2.87
2.12
1.77
1.57
1.42
5.44
4.09
3.19
2.59
2.04
1.79
1.59
1.44
CS510 Computer Architectures
AMAT is improved
by more associativity
AMAT is not improved
significantly by more
associativity
Lecture 13 - 26
3. Reducing Misses via
Victim Cache
CPU
Address
Data Data
In
Out
Tag
=?
Data
Victim Cache
=?
Write
Buffer
• How to combine fast hit time
of Direct Mapped yet still
avoid conflict misses?
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
conflicts for a 4 KB direct
mapped data cache
Cache
Low Level Memory
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 27
4. Reducing Misses via
Pseudo-Associativity
• How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
• Divide cache: on a DM miss, check other half of cache
to see if there, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo-Hit Time
Miss Penalty
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
– Better for caches not tied directly to processor
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 28
5. Reducing Misses by
HW Prefetching
• Prefetch into the Cache or into an external buffer e.g.,
Instruction Prefetching
– Alpha 21064 fetches 2 consecutive blocks on a miss
– Extra prefetched block is placed in instruction stream buffer
– On miss check instruction stream buffer
• Works with data blocks too:
– Jouppi [1990]: 1 data stream buffer got 25% of misses from 4KB
cache; 4 streams got 43%
– Palacharla & Kessler [1994]: for scientific programs for 8 streams
got 50% to 70% of misses from two 64KB, 4-way set associative
caches
• Prefetching relies on extra memory bandwidth that can be
used without penalty
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 29
6. Reducing Misses by
Compiler Prefetching of Data
• Prefetch Instructions to request the data in advance
• Data Prefetch
– Register Prefetch: Load the value of data into register (HP PA-RISC
loads)
– Cache Prefetch: load data into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults;
non-faulting or nonbinding prefetch- it does’t change R or M content
and it cannot cause exceptions---a form of speculative execution
– Make sense only if a nonblocking cache(lockup-free cache) is used
i.e., processor can proceed while waiting for prefetched data
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 30
Big Picture
The challenge in designing memory hierarchies is that every change that
potentially improves the miss rate can also negatively affect overall
performance.
This combination of positive and negative effects is what makes the
design of a memory hierarchy challenging.
Design change
Increase size
Increase assoc.
Increase blk size
Cache Storage System
Effect on miss rate
Possible negative
performance effect
May incr access time
May incr access time
Decr capacity misses
Decreases miss rate
due to conflict misses
Decr miss rate for a
May incr miss penalty
wide range of blk sizes
CS510 Computer Architectures
Lecture 13 - 31
Compiler Prefetching of Data
a and b are 8-bytes long -> 2 data in a block
for (i=0; i<3; i=i+1)
for (j=0; j<100; j=j+1)
a[i,j] = b[j,0]*b[j+1,0];
Analysis
– Spatial Locality: Exists in a but not in b
The misses due to a -> misses on every other a[i,j](2 data/blk)
a[0,0], a[0,1],..,a[0,99],a[1,0],a[1,1],..,a[1,99],a[2,0],..,a[2,99]
• Total 3 x 100 /2 or 150 misses
– Misses due to b: ignoring potential conflict misses
b[0,0],b[0,1],b[0,2],b[1,0],b[1,2]b[1,2],…,b[99,0],b[99,1],b[99,2],b[100,0],b[100,1],b[100,2]
• 100 misses on b[j+1,0] when i=0
• 1 miss on b[j,0] when i=0 and j=0
i=0: b[0,0]b[1,0]b[1,0]b[2,0]b[2,0]b[3,0]…b[99,0]b[99,0]b[100,0] ; purples represents misses
Hit on every b accesses when i=1 and i=2
• Total 101 misses
– Total misses : 251 misses
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 32
Compiler Prefetching of Data
Assume that the miss penalty is so large that we need to prefetch at least
seven iterations in advance
Split the iterations of i-loop such that i=0(j-loop,fetch all b’s) and
remainders(i-loop)
for (j =0; j < 100; j = j+1) {
prefetch (b[j+7,0]);
PF: b[7,0],b[8,0],…,b[100,0]
prefetch (a[0, j+7]);
PF: a[0,7],a[0,8],…,a[0,99]
a[0,j] = b[j,0] * b[j+1,0];}
7/2 misses of a, 7 misses of b
a[0,0]a[0.1]a[0,2]a[0,3]a[0,4]a[0,5]a[0,6]a[0,7]a[0,8]a[0,9]…..
for (i =1; i <3; i= i+1)
b[0,0]b[1,0]b[1,0]b[2,0]….b[5,0]b[6,0]b[6.0]b[7,0]b[7,0]b[8,0]
for (j =0; j < 100; j = j+1) {
prefetch (a[i,j+7]);
PF: a[1,7],a[1,8],…,a[1,99],…,a[2,99]
a[i,j] = b[j,0] * b[j+1,0];}
7/2 x 2 misses of a
Total number of misses;
Cache Storage System
7/2 x 3 misses of a
and 7 misses of b = 19 misses
CS510 Computer Architectures
Lecture 13 - 33
7. Reducing Misses by
Compiler Optimizations
• Instructions
– Reorder procedures in memory so as to reduce misses
– Profiling to look at conflicts
– McFarling [1989] reduced cache misses by 75% on 8KB direct
mapped cache with 4 byte blocks
• Data
– Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in the
order stored in memory
– Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 34
Merging Arrays Example
/* Before */
int val[SIZE];
int key[SIZE];
/* After */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 35
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Processing Sequence
x[0,0],x[1,0],...,x[n,0],
x[0,1],x[1,1],...,x[n,1], No
………….
Spatial
x[0,m],x[1,m],…,x[n,m] Locality
x[0,0],x[0,1],…,x[0,m],
x[1,0],x[1,1],…,x[1,m], Spatial
…………
Loacity
x[n,0],x[n,1],…,x[n,m]
Sequential accesses Instead of striding through
memory every 100 words
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 36
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
2 misses per access to a & c
Cache Storage System
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
vs
1 miss per access
CS510 Computer Architectures
Lecture 13 - 37
x
Blocking Example
j
1
i
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
y
k
1
i
z
j
3
• Two Inner Loops:
– Read all NxN elements of z[ ]
– Read N elements of 1 row of y[ ] repeatedly
– Write N elements of 1 row of x[ ]
k
• Capacity Misses is a function of N & Cache Size:
– Can store 3 NxN => no capacity misses; possible conflict misses
• Idea: compute on BxB sub-matrix
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 38
Blocking Example
x
Blocking is useful to avoid capacity
misses when dealing with large arrays
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
j
1
i
y
k
1
i
z
j
k
• Capacity Misses from 2N3 + N2 to 2N3/B +N2
• B called Blocking Factor, in this example B=3
• Conflict Misses Too?
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 39
Reducing Conflict Misses
by Blocking
Blocking to reduce capacity misses,
also, reduce conflict misses by increasing associativity.
Block size smaller than capacity can also reduce conflict misses
Miss rate %
0.1
0.05
Direct Mapped Cache
Fully Associative Cache
0
0
50
100
Blocking Factor
150
Impact of Conflict misses in caches not FA on Blocking factor
– Lam et al [1991] a blocking factor of 24 had a fifth the misses of 48 despite
both fit in cache
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 40
Summary: Compiler Optimization
to Reduce Cache Misses
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky(nasa7)
compress
1
1.5
2
2.5
3
Performance Improvement
merged
arrays
Cache Storage System
loop
interchange
loop fusion
CS510 Computer Architectures
blocking
Lecture 13 - 41
Summary
• 3 Cs: Compulsory, Capacity, Conflict Misses
• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Remember danger of concentrating on just one parameter
when evaluating performance
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 42
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 43
Cache
Performance Improvement
Reducing Miss Penalty
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 44
Reducing Miss Penalty:
1. Read Priority over Write on Miss
• Write Through with write buffers offer RAW conflicts with
main memory reads on cache misses
– If simply wait for write buffer to empty might increase read miss
penalty by 50% (old MIPS 1000)
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write Back?
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stalls less since restarts as soon as do read
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 45
Reducing Miss Penalty:
2. Subblock Placement
• Don’t have to load full block on a miss
• Have bits per sub-block to indicate valid
• (Originally invented to reduce tag storage)
100
1
1
1
1
300
1
1
0
0
200
0
1
0
1
204
0
0
0
0
Valid Bits
Cache Storage System
Sub-blocks
CS510 Computer Architectures
Lecture 13 - 46
Reducing Miss Penalty:
3. Early Restart and Critical Word First
• Don’t wait for full block to be loaded before restarting CPU
– Early Restart : As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution while loading
remainder of the block
– Critical Word First: Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first
• Generally useful only in large blocks,
• Spatial locality a problem; tend to want the next sequential word,
so not clear if benefit by early restart
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 47
Reducing Miss Penalty:
4. Non-blocking Caches
to Reduce Stalls on Misses
• Non-blocking cache or lockup-free cache allowing the data
cache to continue to supply cache hits during a miss
• “hit under miss” reduces the effective miss penalty by being
helpful during a miss instead of ignoring the requests of the CPU
• “hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 48
Value of Hit under Miss
for SPEC
2
Hit Under i Misses
Avg. Mem. Access Time
1.8
1.6
1.4
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
Integer
•
•
•
ora
spice2g6
nasa7
alvinn
hydro2d
mdljdp2
wave5
su2cor
doduc
swm256
tomcatv
fpppp
ear
mdljsp2
compress
xlisp
eqntott
0
espresso
0.2
Floating Point
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 49
Reducing Miss Penalty:
5. Second Level Cache
CPU
L1
L2
Memory
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 50
Reducing Miss Penalty:
5. Second Level Cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x
Miss PenaltyL2)
• Definitions:
– Local miss rate: misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate: misses in this cache divided by the total
number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 51
Second Level Cache:
An Example
•
Example :
– 1000 memory references
– 40 misses in L1
– 20 misses in L2
• the miss rate (either local or global) for L1 : 40/1000 or 4%
• the local miss rate for L2 : 20/40 or 50 %
• the global miss rate for L2 : 20/1000 or 2%
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 52
Comparing Local and
Global Miss Rates
L1
•
•
•
•
•
•
•
32 KByte 1st level cache;
Increasing 2nd level cache
Global miss rate close to
single level cache rate
provided L2 >> L1
Don’t use local miss rate, it must be
a function of L1 miss rate, so use
L2 global miss rate
L2 is not tied to CPU clock cycle,
thus it will only affect the L1 miss
penalty, not the CPU clock cycle
In L2 design, will it lower AMAT?
and how much it will cost? must be
considered
Generally Fast Hit Times
and fewer misses
Since hits are few, target
miss rate reduces
Linear Scale
L2
Log Scale
L2
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 53
Reducing Misses:
Which Apply to L2 Cache?
Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler Optimizations
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 54
The Impact of L2 Cache
Associativity on the Miss Penalty
• Does set associativity make more sense for L2 caches?
– two way set associativity : increases hit time by 10% of
clock cycle(0.1 clock cycle)
– hit timeL2 for direct mapped = 10 cycles
– local miss rateL2 for direct mapped = 25%
– local miss rateL2 for 2-way SA = 20%
– Miss penaltyL2 = 50 cycles
• the first-level cache miss penalty
(Miss penaltyL1=Hit TimeL2 + Miss rateL2 x Miss penaltyL2)
Miss penalty 1-wayL1 = 10 + 25%x50 = 22.5 clock cycles
Miss penalty 2-wayL1 = 10.1 + 20%x50= 20.1 clock cycles
*worst case(hit time is the integral number of clocks, thus
rounding 10.1=11)
Miss penalty 2-wayL1 = 11 + 20%x50= 21.0 clock cycles
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 55
Relative CPU execution time
L2 Cache Block Size and AMAT
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
1.95
Small L2 cache;
increase conflict misses as
the block size increases,
but L2 capacity is large
1.54
1.36
16
1.28
1.27
32
64
popular size
1.34
128
256
512
Block Size of L2 cache(bytes)
512KB L2 cache; memory bus=32-bit.
Long MAT: 1 clock for sending address,
6 clocks for accessing data: 1 word/clock
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 56
L2 Cache Block Size
•
Multilevel Inclusion Property
– All data in the first-level cache are always in the second-level
cache
– Consistency between I/O and caches ( or between caches in
multiprocessors) can be determined by checking L2 cache
– Drawback
• Different block sizes between caches in different levels
• Usually, smaller cache has smaller cache blocks
• Thus, different block sizes for L1 and L2
• Still maintain multilevel inclusion property
• On a miss, complex invalidations, require non-blocking secondary
caches
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 57
Reducing Miss Penalty Summary
• Five techniques
–
–
–
–
–
Read priority over write on miss
Sub-block placement
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit Under Miss)
Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to access DRAM will increase with
multiple levels in between
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 58
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 59
Cache
Performance Improvement
Fast Hit Time
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 60
Fast Hit Times:
1. Small and Simple Caches
Cache Hit Time
Most of time is spent on Reading TAG Memory using
INDEX part of address and Compare
•
Smaller and Simple hardware is faster
– Smaller Cache
• On-chip Cache
• On-chip Tag, Off-chip Data
– Simple Cache
• Direct Mapped Cache allows to transmit data while checking
tag
•
Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 61
Fast Hit Times:
•
2. Fast Hits by Avoiding
Address Translation
Send virtual address to cache? Called Virtually Addressed Cache or
just Virtual Cache vs. Physical Cache
– Make the common case fast
• Common case is hit, thus allows to eliminate the virtual address
translation on hit
– Every time process is switched logically must flush the cache;
otherwise get false hits with the same virtual address
• Cost is time to flush + Compulsory misses from empty cache
– Dealing with aliases (sometimes called synonyms);
OS as well as user programs map two different virtual
addresses to the same physical address
– I/O(physical address) must interact with cache--virtual address
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 62
Solutions to
Cache Flush and Alias
• Solution to aliases
– HW: Anti-aliasing: guarantee that every cache block
has a unique physical address
– SW: guarantee the lower n bits of aliases must be
same; and they are as long as to cover the
index field
For direct mapped, they must be unique;
page coloring makes the least significant
several bits of physical and virtual addresses
identical
• Solution to cache flush
– Add process identifier(PID) to the address tag that
identifies the process as well as address
within process: cannot get a hit if wrong
process
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 63
Conventional
Virtually Addressed Cache
CPU
1. Translate the virtual address into
a physical address
2. Then access Cache with the
physical address
VA
TB
PA
Hit Time is slow
Cache
PA
MEM
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 64
Alternative
Virtually Addressed Cache
CPU
VA
VA
Tags
Cache
VA
1. Access cache with the virtual address
2. If miss, translate the virtual address for
memory access
• Synonym Problem
• Higher Miss penalty
TB
PA
MEM
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 65
Cache Access and
VA Translation Overlapping
Virtual address
VPA
CPU
VPA
P-Offset
Cache
TB
P-offset
1. Access cache and the virtual address
translation simultaneously
2. When miss, access L2 cache with the
physical address
PPA
P-Offset
L2 $
Requires cache index to remain
invariant across translation,
i.e. Index is physical address part
MEM
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 66
2. Avoiding Translation:
Process ID impact
Miss Rate
20%
• Green: Uniprocess
Without process switch
18%
• Red: Multiprocess
Process switches using
PIDs when flush cache
14%
• Blue: Multiprocess
Process switches w/o
PIDs, simply flushes
cache
Direct mapped cache
Block size=16 bytes
For Ultrix running on a VAX
16%
12%
10%
8%
6%
4%
2%
0%
2k 4k
8k
16k 32k 64k 128k 256k 512k 1024k
Cache Size
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 67
2. Avoiding Translation:
Index with Physical Portion of Address
• If INDEX is a physical part of address, can start
TAG access in parallel with address translation
so that can compare to physical tag
31
12 11
Page Address
Address Tag
Address Translation
0
Page Offset
Index
Block Offset
Tag Access
• Limits cache to page size: what if want bigger
caches and uses same trick?
– Higher associativity increases the Index, thus enlarge page
– Page coloring - OS makes the last few bits of virtual and
physical address identical causes to compare only remainder
of the page offset bits
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 68
Fast Hit times:
3. Fast Hit Times via Pipelined Writes
•
•
Pipeline Tag Check and Update Cache as separate stages;
current write tag check & previous write cache update
Only Write in the pipeline; empty during a miss
Current
Write
Previous
Write
CPU
Address
Data
Data
In
Out
=?
Tag
Delayed Write Buffer
=?
MUX
Write
Buffer
Data
Low Level Memory
Delayed Write Buffer;
must be checked on reads;
either complete write or read from buffer
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 69
Fast Hit times:
4. Fast Writes on Misses via
Small Sub-blocks
• If most writes are 1 word, sub-block size is 1 word, & write through
then always write sub-block and tag immediately
– Tag match and valid bit already set: Writing the block was proper, &
nothing lost by setting valid bit on again.
– Tag match and valid bit not set: The tag match means that this is the
proper block; writing the data into the sub-block makes it appropriate to
turn the valid bit on.
– Tag mismatch: This is a miss and will modify the data portion of the
block. As this is a write-through cache, however, no harm was done;
memory still has an up-to-date copy of the old value. Only the tag to the
address of the write and the valid bits of the other sub-blocks need to be
changed because the valid bit for this sub-block has already been set
• Doesn’t work with write back due to last case
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 70
Cache Optimization Summary
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Miss over write miss
Sub-block Placement
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
Small & Simple Cache
Avoiding Address Translation
Pipelining Writes for fast write hit
Cache Storage System
MR
MP HT Complexity
+
+
+
+
+
+
+
-
+
+
+
+
+
-
CS510 Computer Architectures
+
+
+
+
0
1
2
2
2
3
0
1
1
2
3
2
0
2
1
Lecture 13 - 71
What is the Impact of What
You’ve Learned About Caches?
1000
•
10
DRAM
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1995: Speed =
(non-cached memory accesses)
What does this mean for
1984
1
1983
•
100
1982
– Pipelined
Execution &
Fast Clock Rate
– Out-of-Order
completion
– Superscalar
Instruction Issue
CPU
1981
•
1960-1985: Speed
= (no. operations)
1995
1980
•
– Compilers?,Operating Systems?, Algorithms? Data Structures?
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 72
Cross Cutting Issues
• Parallel Execution vs. Cache locality
– Want far separation to find independent operations vs.
want reuse of data accesses to avoid misses
• I/O and consistency of data between cache and
memory
– Caches => multiple copies of data
– Consistency by HW or by SW?
– Where connect I/O to computer?
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 73
Alpha 21064
•
•
•
•
•
•
•
•
Separate Instr & Data TLB
& Caches
TLBs fully associative
TLB updates in SW
(‘priv Arch Libr”)
Caches 8KB direct
mapped
Critical 8 bytes first
Prefetch instr. stream
buffer
2 MB L2 cache, direct
mapped (off-chip)
256 bit path to main
memory, 4 x 64-bit
modules
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 74
100.00%
Cache Storage System
CS510 Computer Architectures
10.00%
1.00%
I$
Su2cor
Nasa7
Mdljp2
Hydro2d
Spice
Alvinn
Tomcatv
Wave5
Doduc
Swm256
Ear
Fpppp
Ora
Compress
Mdljsp2
Gcc
Sc
Eqntott
Li
Espresso
AlphaSort
TPC-B (db2)
TPC-B (db1)
Miss Rate(log scale)
Alpha Memory Performance:
Miss Rates
D$
I $ 8K
D $ 8K
L2 2M
0.10%
L2
0.01%
Lecture 13 - 75
Alpha CPI Components
Primary limitations of performance
– Instruction stall: branch mispredict;
– Other: compute + reg conflicts, structural conflicts
4.50
4.00
3.50
L2
I$
2.50
D$
2.00
I Stall
Commercial
workloads
1.50
Cache Storage System
CS510 Computer Architectures
Hydro2d
Mdljp2
Wave5
Tomcatv
Alvinn
Doduc
Swm256
Ear
Fpppp
Ora
Mdljsp2
Compress
Gcc
Sc
Eqntott
Li
AlphaSort
0.00
Espresso
0.50
Other
TPC-B (db1)
1.00
TPC-B (db2)
CPI
3.00
Lecture 13 - 76
Pitfall: Predicting Cache Performance
from Different Pgm (ISA, compiler,...)
Miss Rate varies depending 35%
on programs
•
•
•
4KB Data cache miss
rate 8%,12%,
or 28%?
1KB Instr cache miss
rate 0%,3%,
Miss
or 10%?
Rate
Alpha vs. MIPS for 8KB
Data:
17% vs. 10%
D: tomcatv
30%
D: gcc
D: espresso
25%
I: tomcatv
D$ miss rate
20%
I: gcc
I: espresso
15%
10%
5%
I$ miss rate
0%
1
2
4
8
16
32
64
128
Cache Size (KB)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 77
Pitfall: Simulating
too Small an Address Trace
4.5
SOR(FORTRAN)
4
Cummlati
Cumulative
ve
average
memory
Average
access
Memory
time
Access
Time
3.5
3
Tree(Scheme)
Multi(multiprogrammed workload)
2.5
2
TV(Pascal)
1.5
1
0
1
2
3
4
5
6
7
8
9
10
11
12
Instructions Executed(billions)
Instructions Executed (billions)
Cache Storage System
CS510 Computer Architectures
Lecture 13 - 78