Multilevel Caches

Download Report

Transcript Multilevel Caches

Multilevel Memory
Caches
Prof. Sirer
CS 316
Cornell University
Storage Hierarchy
off chip
Technology
Tape
Disk
DRAM
SRAM off
SRAM on
DRAM
Capacity and latency are closely coupled, cost
is inversely proportional
Disk
How do we create the illusion of large and fast
memory?
SRAM
on chip
SRAM
Tape
Capacity
1 TB
300 GB
4GB
512KB
16 KB
Cost/GB
$.17
$.34
$520
$123000
???
Latency
100s
4ms
20ns
5ns
2ns
Memory Hierarchy
Principle: Hide latency using small, fast memories
called caches
Caches exploit locality


Temporal locality: If a memory location is referenced, it
is likely to be referenced again in the near future
Spatial locality: If a memory location is referenced,
other locations near it will be referenced in the near
future
Cache Lookups (Read)
Look at address issued by processor, search
cache tags to see if that block is in the
cache


Hit: Block is in the cache, return requested
data
Miss: Block is not in the cache, read line from
memory, evict an existing line from the cache,
place new line in cache, return requested data
Cache Organization
Cache has to be fast and small


Gain speed by performing lookups in parallel, requires
die real estate
Reduce hardware required by limiting where in the
cache a block might be placed
Three common designs



Fully associative: Block can be anywhere in the cache
Direct mapped: Block can only be in one line in the
cache
Set-associative: Block can be in a few (2 to 8) places in
the cache
Tags and Offsets
Cache block size determines cache
organization
31
31
Virtual Address
Tag
5
Block
0
4 Offset 0
Fully Associative Cache
V
Tag
Block
Tag
=
line
select
Offset
=
hit encode
word/byte
select
Direct Mapped Cache
Tag
Offset
Index
Tag
V
=
Block
2-Way Set-Associative Cache
Block
V Tag
Offset
Index
Tag
V Tag
=
=
Block
Valid Bits
Valid bits indicate whether cache line
contains an up-to-date copy of the values
in memory

Must be 1 for a hit

Reset to 0 on power up
An item can be removed from the cache by
setting its valid bit to 0
Eviction
Which cache line should be evicted from
the cache to make room for a new line?

Direct-mapped
 no choice, must evict line selected by index

Associative caches
 random: select one of the lines at random
 round-robin: similar to random
 FIFO: replace oldest line
 LRU: replace line that has not been used in the
longest time
Cache Writes
addr
CPU
data
Cache
SRAM
Memory
DRAM
No-Write

writes invalidate the cache and go to memory
Write-Through

writes go to main memory and cache
Write-Back

write cache, write main memory only when block is evicted
Dirty Bits and Write-Back
Buffers
V D Tag
1 0
1 1
1 0
Data Byte 0, Byte 1 …
Byte N
Line
Dirty bits indicate which lines have been written
Dirty bits enable the cache to handle multiple writes to
the same cache line without having to go to memory
Write-back buffer



A queue where dirty lines are placed
Items added to the end as dirty lines are evicted from the cache
Items removed from the front as memory writes are completed
Misses
Three types of misses

Cold
 The line is being referenced for the first time

Capacity
 The line was evicted because the cache was not
large enough

Conflict
 The line was evicted because of another access
whose index conflicted
Cache Design
Need to determine parameters


Block size
Number of ways

Eviction policy
Write policy

Separate I-cache from D-cache

Virtual vs. Physical Caches
addr
CPU
MMU
Cache
Memory
SRAM
DRAM
data
Cache works on physical addresses
addr
Cache
CPU
data
MMU
SRAM
Memory
DRAM
Cache works on virtual addresses
L1 (on-chip) caches are typically virtual
L2 (off-chip) caches are typically physical
Cache Conscious Programming
int a[NCOL][NROW];
int sum = 0;
for(i = 0; i < NROW; ++i)
for(j = 0; j < NCOL; ++j)
sum += a[j][i];
Speed up this program
Cache Conscious Programming
int a[NCOL][NROW];
int sum = 0;
for(j = 0; j < NCOL; ++j)
for(i = 0; i < NROW; ++i)
sum += a[j][i];
1
11
2
12
3
13
4
14
5
15
6
7
8
9
10
Every access is a cache miss!
Cache Conscious Programming
int a[NCOL][NROW];
1
2
3
4
5
11
12
13
14
15
6
7
8
9
10
int sum = 0;
for(i = 0; i < NROW; ++i)
for(j = 0; j < NCOL; ++j)
sum += a[j][i];
Same program, trivial transformation, 3 out
of four accesses hit in the cache