CS152: Computer Architecture and Engineering

Download Report

Transcript CS152: Computer Architecture and Engineering

EEM 486: Computer Architecture
Lecture 6
Memory Systems and Caches
EEM 486
The Big Picture: Where are We Now?
 The Five Classic Components of a Computer
Processor
Input
Control
Memory
Datapath
Output
Lec 6.2
The Art of Memory System Design
Workload or
Benchmark
programs
Processor
reference stream
<op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
op: i-fetch, read, write
SRAM
DRAM
Memory
Cache
Main
Memory
Optimize the memory system organization
to minimize the average memory access time
for typical workloads
Lec 12.3
Technology Trends
Year
1980
DRAM
Size
Cycle Time
64 Kb 2:1! 250 ns
1983
1986
1989
256 Kb
1 Mb
4 Mb
220 ns
190 ns
165 ns
1992
1995
16 Mb
64 Mb
145 ns
120 ns
1000:1!
Lec 6.4
Processor-DRAM Memory Gap
CPU
“Moore’s Law”
100
10
Processor-Memory
performance gap
grows 50% / year
“Less’ Law?”
DRAM
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
µProc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
Time
Lec 6.5
The Goal: illusion of large, fast, cheap memory
 Facts
• Large memories are slow but cheap (DRAM)
• Fast memories are small yet expensive (SRAM)
 How do we create a memory that is large, fast and cheap?
• Memory hierarchy
• Parallelism
Lec 6.6
The Principle of Locality
The principle of locality: Programs access a relatively small
portion of their address space at any instant of time
 Temporal Locality (Locality in Time)
=> If an item is referenced, it will tend to be referenced again soon
=> Keep most recently accessed data items closer to the processor
 Spatial Locality (Locality in Space)
=> If an item is referenced, nearby items will tend to be referenced soon
=> Move blocks of contiguous words to the upper levels
Q: Why does code have locality?
Lec 6.7
Memory Hierarchy
 Based on the principle of locality
 A way of providing large, cheap, and fast memory
Processor
Control
On-Chip
Cache
Registers
Datapath
Second
Level
Cache
(SRAM)
Speed (ns): 1s
10s
Size (bytes): 100s
Ks
Main
Memory
(DRAM)
100s
Ms
$ per Mbyte
Secondary
Storage
(Disk)
Tertiary
Storage
(Tape)
10,000,000s 10,000,000,000s
(10s ms)
(10s sec)
Ts
Gs
increases
Lec 6.8
Cache Memory
word
CPU
block
Cache
Memory
MEMORY
CACHE
Tag
Block
0
1
2
0
1
2
Block
C-1
Block length
(K words)
Block
2^n-1
Word
Lec 6.9
Elements of Cache Design
 Cache size
 Mapping function
• Direct
• Set Associative
• Fully Associative
 Replacement algorithm
• Least recently used (LRU)
• First in first out (FIFO)
• Random
 Write policy
• Write through
• Write back
 Line size
 Number of caches
• Single or two level
• Unified or split
Lec 6.10
Terminology
 Hit: data appears in some block in the upper level
• Hit Rate: the fraction of memory accesses found in the upper level
• Hit Time: time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Memory
Processor
(1)
X1
(2)
Read
hit
Cache
X4
X1
Xn-2
Xn
Xn-1
X2
X3
Upper
level
Lower
level
Lec 6.11
Terminology
 Miss: data needs to be retrieved from a block
in the lower level
• Miss Rate = 1 - (Hit Rate)
• Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
 Hit Time << Miss Penalty
Processor
(1)
Xn
(4)
Read
miss
Memory
Cache
X4
X1
Xn-2
Xn-1
X2
(2)
Xn
Xn
(3)
X3
Upper
level
Lower
level
Lec 6.12
Direct Mapped Cache
Each memory location is mapped to exactly one location
in the cache:
Cache block # = (Block address) modulo (# of cache blocks)
= Low order log2 (# of cache blocks) bits of the address
000
001
010
011
100
101
110
111
Cache
00001
00101
01001
01101
10001
10101
11001
11101
Memory
Lec 6.13
64 KByte Direct Mapped Cache
Address (showing bit positions)
31 30
17 16 15
16
14
Hit
Data
16 bits
Valid Tag
• Why do we need a Tag field?
543210
Byte
offset
• Why do we need a Valid bit field?
• What kind of locality are we taking
care of?
32 bits
Data
16K
entries
• Total number of bits in a cache
2^n x (|valid| + |tag| + |block|)
16
32
2^n : # of cache blocks
|valid| = 1 bit
|tag| = 32 – (n + 2); 32-bit byte address
1 word blocks
|block| = 32 bit
Lec 6.14
Reading from Cache
 Address the cache by PC or ALU
 If the cache signals hit, we have a read hit
• The requested word will be on the data lines
 Otherwise, we have a read miss
• stall the CPU
• fetch the block from memory and write into cache
• restart the execution
Lec 6.15
Writing to Cache
 Address the cache by PC or ALU
 If the cache signals hit, we have a write hit
• We have two options:
- write-through: write the data into both cache and memory
- write-back: write the data only into cache and
write it into memory only when it is replaced
 Otherwise, we have a write miss
• Handle write miss as if it were a write hit
Lec 6.16
64 KByte Direct Mapped Cache
 Taking advantage of spatial locality
Address (showing bit positions)
31
16 15
4 32 1 0
16
Hit
12
2 Byte
offset
Tag
Data
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
Lec 6.17
Writing to Cache
 Address the cache by PC or ALU
 If the cache signals hit, we have a write hit
• Write-through cache: write the data into both cache and memory
Otherwise, we have a write miss
• stall the CPU
• fetch the block from memory and write into cache
• restart the execution and rewrite the word
Lec 6.18
Associativity in Caches
One-way set associative
(direct mapped)
Block
Tag Data
0
Two-way set associative
1
Set
2
3
0
4
1
5
2
6
3
Tag Data Tag Data
7
Four-way set associative
Set
Tag Data Tag Data Tag Data Tag Data
0
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
 Compute the set number:
(Block number) modulo (Number of sets)
 Choose one of the blocks in the computed set
Lec 6.19
Set Asscociative Cache

N-way set associative
• N direct mapped caches operates in parallel
• N entries for each cache index
• N comparators and a N-to-1 mux
• Data comes AFTER Hit/Miss decision and set selection
31 30
Address
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
3210
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
32
4-to-1 multiplexor
Hit
Data
A four-way set associative cache
Lec 6.20
Fully Associative Cache
 A block can be anywhere in the cache => No Cache Index
 Compare the Cache Tags of all cache entries in parallel
 Practical for small number of cache blocks
31
4
Cache Tag (27 bits long)
0
Byte Select
Ex: 0x01
Valid Bit
=
Cache Data
Byte 31
=
Byte 63
: :
Cache Tag
Byte 1
Byte 0
Byte 33 Byte 32
=
=
=
:
:
:
Lec 6.21
Four Questions for Caches
 Q1: Block placement?
Where can a block be placed in the upper level?
 Q2: Block identification?
How is a block found if it is in the upper level?
 Q3: Block replacement?
Which block should be replaced on a miss?
 Q4: Write strategy?
What happens on a write?
Lec 6.22
Q1: Block Placement?
 Block 12 to be placed in an 8 block cache:
Any block
Block
no.
01234567
Fully associative
(12 mod 8) = 4
Only block 4
Block
no.
01234567
Direct mapped
(12 mod 4) = 0
Any block in set 0
Block
no.
01234567
Set Set Set Set
0 1 2 3
Set associative
Direct mapped: One place - (Block address) mod (# of cache blocks)
Set associative: A few places - (Block address) mod (# of cache sets)
# of cache sets = # of cache blocks/degree of associativity
Fully associative: Any place
Lec 6.23
Q2: Block Identification?
Block Address
Tag
Block
offset
Index
Set Select
Data Select
Direct mapped: Indexing – index, 1 comparison
N-way set associative: Limited search – index the set, N comparison
Fully associative: Full search – search all cache entries
Lec 6.24
Q3: Replacement Policy on a Miss?
 Easy for Direct Mapped
 Set Associative or Fully Associative:
• Random: Randomly select one of the blocks in the set
• LRU (Least Recently Used): Select the block in the set which has been
unused for the longest time
Associativity: 2-way
4-way
8-way
Size
LRU Random
LRU Random
LRU Random
16 KB
5.2% 5.7%
4.7% 5.3%
4.4% 5.0%
64 KB
1.9% 2.0%
1.5%
1.4% 1.5%
256 KB
1.15% 1.17%
1.7%
1.13% 1.13%
1.12% 1.12%
Lec 6.25
Q4: Write Policy?
 Write through— The information is written to both the
block in the cache and to the block in the lower-level
memory
 Write back— The information is written only to the block
in the cache. The modified cache block is written to main
memory only when it is replaced
• is block clean or dirty?
 Pros and Cons of each?
• WT: read misses cannot result in writes
• WB: no writes of repeated writes
 WT always combined with write buffers to avoid
waiting for lower level memory
Lec 6.26
Cache Performance
CPU time = (CPU execution clock cycles +
Memory stall clock cycles) x Cycle time
Note: memory hit time is included in execution cycles
Stalls due to cache misses:
Memory stall clock cycles = Read-stall clock cycles +
Write-stall clock cycles
Read-stall clock cycles= Reads x Read miss rate x Read miss penalty
Write-stall clock cycles= Writes x Write miss rate x Write miss penalty
If read miss penalty = write miss penalty,
Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty
Lec 6.27
Cache Performance
CPU time = Instruction count x CPI x Cycle time
= Inst count x Cycle time x
(ideal CPI + Memory stalls/Inst + Other stalls/Inst)
Memory Stalls/Inst =
Instruction Miss Rate x Instruction Miss Penalty +
Loads/Inst x Load Miss Rate x Load Miss Penalty +
Stores/Inst x Store Miss Rate x Store Miss Penalty
Average Memory Access time (AMAT) =
Hit Time + (Miss Rate x Miss Penalty) =
(Hit Rate x Hit Time) + (Miss Rate x Miss Time)
Lec 6.28
Example
 Suppose a processor executes at
• Clock Rate = 200 MHz (5 ns per cycle)
• Base CPI = 1.1
• 50% arith/logic, 30% ld/st, 20% control
 Suppose that 10% of memory operations get 50 cycle miss penalty
 Suppose that 1% of instructions get same miss penalty
 CPI = Base CPI + average stalls per instruction
= 1.1(cycles/ins) +
[ 0.30 (Data Mops/ins)
x 0.10 (miss/Data Mop) x 50 (cycle/miss)] +
[ 1 (Inst Mop/ins)
x 0.01 (miss/Inst Mop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
 AMAT= (1/1.3)x[1+0.01x50]+ (0.3/1.3)x[1+0.1x50]= 2.54
Lec 6.29
Improving Cache Performance
CPU Time = IC x CT x (ideal CPI + memory stalls)
Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty)
= (Hit Rate x Hit Time) + (Miss Rate x Miss Time)
Options to reduce AMAT:
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache
Lec 6.30
Reduce Misses: Larger Block Size
25%
1K
20%
Miss
Rate
4K
15%
16K
10%
64K
5%
256K
256
128
64
32
16
0%
Block Size (bytes)
Increasing block size also increases miss penalty !
Lec 6.31
Reduce Misses: Higher Associativity
15%
12%
Miss rate
9%
6%
3%
0%
One-way
Two-way
Four-way
Associativity
Eight-way
1 KB
2 KB
4 KB
8 KB
16 KB
32 KB
64 KB
128 KB
Increasing associativity also increases both time and hardware cost !
Lec 6.32
Reducing Penalty: Second-Level Cache
 L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Proc
L1 Cache
L2 Cache
Lec 6.33
Designing the Memory System to Support Caches
 Wide:
• CPU/Mux 1 word;
Mux/Cache, Bus,
Memory N words
 Interleaved:
• CPU, Cache, Bus- 1 word
• N Memory Modules
 Simple:
• CPU, Cache, Bus, Memory
same width (32 bits)
Lec 6.34
Main Memory Performance
Cycle Time
Access Time
Time
 DRAM (Read/Write) Cycle Time >>
DRAM (Read/Write) Access Time
 DRAM (Read/Write) Cycle Time :
• How frequent can you initiate an access?
 DRAM (Read/Write) Access Time:
• How quickly will you get what you want once you initiate an access?
 DRAM Bandwidth Limitation
Lec 6.35
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving:
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory
Bank 0
Access Pattern with 4-way Interleaving:
CPU
Memory
Bank 1
Access Bank 0
Memory
Bank 2
Memory
Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
Lec 6.36
Summary #1/2
 The Principle of Locality:
• Program likely to access a relatively small portion of the address space
at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space
 Three (+1) Major Categories of Cache Misses:
• Compulsory Misses: sad facts of life. Example: cold start misses.
• Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!
• Capacity Misses: increase cache size
 Cache Design Space
• total size, block size, associativity
• replacement policy
• write-hit policy (write-through, write-back)
• write-miss policy
Lec 6.37
Summary #2/2: The Cache Design Space
 Several interacting dimensions
•
•
•
•
Cache Size
cache size
block size
associativity
replacement policy
Associativity
• write-through vs write-back
• write allocation
Block Size
 The optimal choice is a compromise
• depends on access characteristics
- workload
- use (I-cache, D-cache, TLB)
• depends on technology / cost
 Simplicity often wins
Bad
Good
Factor A
Less
Factor B
More
Lec 6.38