Lectures for 2nd Edition

Download Report

Transcript Lectures for 2nd Edition

Chapter Seven
Large and Fast: Exploiting Memory Hierarchy
2004 Morgan Kaufmann Publishers
1
2004 Morgan Kaufmann Publishers
2
Outline
•
•
•
•
•
•
7.1
7.2
7.3
7.4
7.5
7.6
•
•
•
7.7
7.8
7.9
Introduction
The Basic of Caches
Measuring and Improving Cache Performance
Virtual Memory
A Common Framework for Memory Hierarchies
Real Stuff: The Pentium P4 and the AMD Opteron Memory
Hierarchies
Fallacies and Pitfalls
Concluding Remarks
Historical Perspective and Further Reading
2004 Morgan Kaufmann Publishers
3
7.1
Introduction
The principle of locality states that programs access a relatively small
portion of their address space at any instant of time, just as you accessed
a very small portion of the library’s collection.
2004 Morgan Kaufmann Publishers
4
Keywords
•
Temporal locality The principle stating that if a data location is
referenced then it will tend to be referenced again soon.
•
Spatial locality The locality principle stating that if a data location is
referenced, data locations with nearby addresses will tend to be
referenced soon.
•
Memory hierarchy A structure that uses multiple levels of memories; as
the distance from the CPU increases, the size of the memories and the
access time both increase.
•
Block The minimum unit of information that can be either present or not
present in the two-level hierarchy.
•
Hit rate
•
Miss rate The fraction of memory accesses not found in a level of the
memory hierarchy.
The fraction of memory accesses found in a cache.
2004 Morgan Kaufmann Publishers
5
Keywords
•
Hit time The time required to access a level of the memory hierarchy,
including the time needed to determine whether the access is a hit or a
miss.
•
Miss penalty The time required to fetch a block into a level of the
memory hierarchy from the lower level, including the time to access the
block, transmit it from one level to the other, and insert it in the level that
experienced the miss.
2004 Morgan Kaufmann Publishers
6
Memories: Review
•
SRAM:
– value is stored on a pair of inverting gates
– very fast but takes up more space than DRAM (4 to 6 transistors)
•
DRAM:
– value is stored as a charge on capacitor (must be refreshed)
– very small but slower than SRAM (factor of 5 to 10)
Word line
A
A
B
B
Pass transistor
Capacitor
Bit line
2004 Morgan Kaufmann Publishers
7
Memory technology
Typical access time
$ per GB in 2004
SRAM
0.5 – 5 ns
$ 4,000 – $10,000
DRAM
50 – 70 ns
$100 – $200
Magnetic disk
5,000,000 – 20,000,000 ns
$0.50 –$2
2004 Morgan Kaufmann Publishers
8
Figure 7.1 The basic structure of a memory hierarchy.
2004 Morgan Kaufmann Publishers
9
Figure 7.2 Every pair of levels in the memory hierarchy can be
thought of as having an upper and lower level.
2004 Morgan Kaufmann Publishers
10
Exploiting Memory Hierarchy
•
Users want large and fast memories!
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.
DRAM access times are 50-70ns at cost of $100 to $200 per GB.
Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
•
2004
Try and give it to them anyway
– build a memory hierarchy
CPU
Increasing distance
Level 1
from the CPU in
access time
Levels in the
Level 2
memory hierarchy
Level n
Size of the memory at each level
2004 Morgan Kaufmann Publishers
11
Locality
•
A principle that makes having a memory hierarchy a good idea
•
If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
•
Our initial focus: two levels (upper, lower)
– block: minimum unit of data
– hit: data requested is in the upper level
– miss: data requested is not in the upper level
2004 Morgan Kaufmann Publishers
12
7.2
The Basics of Caches
2004 Morgan Kaufmann Publishers
13
Keywords
•
Direct-mapped cache A cache structure in which each memory
location is mapped to exactly one location in the cache.
•
Tag A field in a table used for a memory hierarchy that contains the
address information required to identify whether the associated block in
the hierarchy corresponds to a requested word.
•
Valid bit A field in the tables of a memory hierarchy that indicates that
the associated block in the hierarchy contains valid data.
•
Cache miss A request for data from the cache that cannot be filled
because the data is not present in the cache.
•
Write-through A scheme in which writes always update both the cache
and the memory, ensuring that data is always consistent between the two.
•
Write buffer A queue that holds data while the data are waiting to be
written to memory.
2004 Morgan Kaufmann Publishers
14
Keywords
•
Write-back A scheme that handles writes by updating values only to the
block in the cache, then writing the modified block to the lower level of the
hierarchy when the block is replaced.
•
Split cache A scheme in which a level of the memory hierarchy is
composed of two independent caches that operate in parallel with each
other with one handling instructions and one handling data.
2004 Morgan Kaufmann Publishers
15
Cache
•
•
Two issues:
– How do we know if a data item is in the cache?
– If it is, how do we find it?
Our first example:
– block size is one word of data
– "direct mapped"
For each item of data at the lower level,
there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
2004 Morgan Kaufmann Publishers
16
Figure 7.4 The cache just before and just after a reference to
a word Xn that is not initially in the cache.
2004 Morgan Kaufmann Publishers
17
Direct Mapped Cache
Mapping: address is modulo the number of blocks in the cache
Cache
000
001
010
011
100
101
110
111
•
00001
00101
01001
01101
10001
10101
11001
11101
Memory
2004 Morgan Kaufmann Publishers
18
Decimal address
of reference
Binary address
of reference
Hit or miss
in cache
Assigned cache block (where
found or placed)
22
10110two
Miss (7.6b)
26
11010two
10110two
11010two
Miss (7.6c)
(10110two mod8)  110two
(11010two mod8)  010two
10000two
00011two
Miss (7.6d)
10000two
10010two
Hit
(00011two mod8)  011two
(10000two mod8)  000two
Miss (7.6f)
(10010two mod8)  010two
22
26
16
3
16
18
Hit
Hit
Miss (7.6e)
(10110two mod8)  110two
(11010two mod8)  010two
(10000two mod8)  000two
2004 Morgan Kaufmann Publishers
19
Figure 7.6 The cache contents are shown after each reference request
that misses, with the index and tag fields shown in binary.
Index
V
000
Tag
Data
Index
V
N
000
N
001
N
001
N
010
N
010
N
011
N
011
N
100
N
100
N
101
N
101
N
110
N
110
Y
111
N
111
N
a. The initial state of the cache after power-on
Tag
Data
10two Memory (10110two )
b. After handling a miss of address ( 10110two )
2004 Morgan Kaufmann Publishers
20
Index
V
000
Tag
Data
Index
V
Tag
N
000
Y
10two Memory (10000two )
001
N
001
N
010
Y
010
Y
011
N
011
N
100
N
100
N
101
N
101
N
110
Y
110
Y
111
N
111
N
11two Memory (11010two )
10two Memory (10110two )
c. After handling a miss of address ( 11010two )
Data
11two Memory (11010two )
10two Memory (10110two )
d. After handling a miss of address ( 10000two )
2004 Morgan Kaufmann Publishers
21
Index
V
Tag
Data
Index
V
Tag
000
Y
10two Memory (10000two )
001
N
010
Y
011
Y
100
000
Y
10two Memory (10000two )
001
N
11two Memory (11010two )
010
Y
10two Memory (10010two )
00two Memory (00011two )
011
Y
00two Memory (00011two )
N
100
N
101
N
101
N
110
Y
110
Y
111
N
111
N
10two Memory (10110two )
e. After handling a miss of address ( 00011two )
Data
10two Memory (10110two )
f. After handling a miss of address ( 10010two )
2004 Morgan Kaufmann Publishers
22
Direct Mapped Cache
Address (showing bit positions)
•
31 30
13 12 11
2 10
Byte
offset
For MIPS:
Hit
20
10
Tag
Data
Index
Index
0
1
2
Valid Tag
Data
1021
1022
1023
20
32
=
What kind of locality are we taking advantage of?
2004 Morgan Kaufmann Publishers
23
Bits in Cache
•
How many total bits are required for a direct-mapped cache with 16
KB of data and 4-word clocks, assuming a 32-bit address?
•
Answer
12
We know that 16 KB is 4 K words, which is 2 words, and, with a
block size of 4 words (22 ) , 210 blocks. Each block has 4 32 or
128 bits of data plus a tag, which is 32-10-2-2 bits, plus a valid bit.
Thus, the total cache size is
210  (128 (32 10  2  2)  1)  210 147  147 Kbits
or 18.4 KB for a 16 KB cache. For this cache, the total number of bits
in the cache is about 1.15 times as many as needed just for the
storage of the data.
2004 Morgan Kaufmann Publishers
24
Mapping an Address to a Multiword Cache Block
•
•
Consider a cache with 64 blocks and a clock size of 16 bytes. What
block number does byte address 1200 map to?
Answer
Byteaddress
– The address of the block is (formula on page 474):
Bytesper block
– Notice that this block address is the block containing all addresses
between
 Byteaddress 
 Bytesper block  Bytesper block


 Byteaddress 
 Bytesper block   Bytesper block  (Bytesper block -1)


– Thus, with 16 bytes per block, byte address 1200 is block address
and
1200
 16   75
which maps to cache block number (75 modulo 64) = 11. In fact, this
block maps all addresses between 1200 and 1215.
2004 Morgan Kaufmann Publishers
25
Figure 7.8 Miss rate versus block size.
2004 Morgan Kaufmann Publishers
26
Direct Mapped Cache
•
Taking advantage of spatial locality:
Address (showing bit positions)
31
14 13
18
Hit
65
8
210
4
Tag
Byte
offset
Data
Block offset
Index
18 bits
V
512 bits
Tag
Data
256
entries
16
32
32
32
=
Mux
32
2004 Morgan Kaufmann Publishers
27
Hits vs. Misses
•
Read hits
– this is what we want!
•
Read misses
– stall the CPU, fetch block from memory, deliver to cache, restart
•
Write hits:
– can replace data in cache and memory (write-through)
– write the data only into the cache (write-back the cache later)
•
Write misses:
– read the entire block into the cache, then write the word
2004 Morgan Kaufmann Publishers
28
To understand the impact of different organizations of memory, let’s
define a set of hypothetical memory access times, Assume
•
•
1 memory bus clock cycle to send the address
15 memory bus clock cycles for each DRAM access initiated
•
1 memory bus clock cycle to send a word of data
If we have a cache clock of four words and a one-word-wide bank of DRAMs,
the miss penalty would be 1+4*15+4*1=65 memory bus clock cycles. Thus,
the number of bytes transferred per bus clock cycle for a single miss would
be
4 4
65
 0.25
Figure 7.11 shows three options for designing the memory system.
1. Memory is one word wide, and all access are made sequentially.
2. Increases the bandwidth to memory by widening the memory and the
buses between the processor and memory.
3. The memory chips are organized in 4 banks.
2004 Morgan Kaufmann Publishers
29
Hardware Issues
•
Make reading multiple words easier by using banks of memory
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Bus
Memory
b. Wide memory organization
Bus
Memory
Memory
Memory
Memory
bank 0
bank 1
bank 2
bank 3
c. Interleaved memory organization
Memory
a. One-word-wide
memory organization
•
It can get a lot more complicated...
2004 Morgan Kaufmann Publishers
30
The miss penalty of each memory system
• A. 1+4*15+4*1 = 65
• B. With a main memory width if two words:1+2*15+2*1 = 33
four words:17
• C. 1cycle to transmit the address and read request to the
banks, 15 cycles for all four banks to access memory, and 4
cycles to send the four words back to the cache.
1+1*15+4*1 = 20
2004 Morgan Kaufmann Publishers
31
Figure 7.12 DRAM size increased by multiples of four approximately once every
three years until 1996, and thereafter doubling approximately every two years.
Year introduced
1980
Chip size
$ per MB
64 Kbit $ 1500
Total access time to
a new row / column
Column access time
to existing row
250 ns
150 ns
1983
256 Kbit
$ 500
185 ns
100 ns
1985
1 Kbit
$ 200
135 ns
40 ns
1989
4 Kbit
$ 50
110 ns
40 ns
1992
16 Kbit
$ 15
90 ns
30 ns
1996
64 Kbit
$ 10
60 ns
12 ns
1998
128 Kbit
$4
60 ns
10 ns
2000
256 Kbit
$1
55 ns
7 ns
2002
512 Kbit
$ 0.25
50 ns
5 ns
2004
1024 Kbit
$ 0.10
45 ns
3 ns
2004 Morgan Kaufmann Publishers
32
7.3
Measuring and Improving Cache
Performance
2004 Morgan Kaufmann Publishers
33
Keywords
•
Fully associative cache A cache structure in which a block can be
placed in any location in the cache.
•
Set-associative cache A cache that has a fixed number of locations (at
least two) where each block can be placed.
•
Least recently used (LRU) A replacement scheme in which the block
replaced is the one that has been unused for the longest time.
•
Multilevel cache A memory hierarchy with multiple levels of caches,
rather than just a cache and main memory.
•
Global miss rate
multilevel cache.
•
Local miss rate The fraction of references to one level of a cache that
miss; used in multilevel hierarchies.
The fraction of references that miss in all levels of a
2004 Morgan Kaufmann Publishers
34
 CP U time (CP U executionclock cycles Memory- stall clock cycles)
 Clock cycle time
 Memory- stall clock cycles Read - stall cycles Write- stall cycles
Reads
 Read - stall cycles
 Read miss rate Read miss penalty
P rogram
 Writes

 Write- stall cycles 
 Writemiss rate Writemiss penalty
 P rogram

 Writebuffer stalls
Memoryaccesses
 Memory- stall clock cycles
 Miss rate Miss penalty
P rogram
Instructions
Misses
 Memory- stall clock cycles

 Miss penalty
P rogram Instruction
2004 Morgan Kaufmann Publishers
35
Calculating Cache Performance
•
•
Assume an instruction cache miss rate for a program is 2 % and a
data cache miss rate is 4%. If a processor has a CPI of 2 without and
memory stalls and the miss penalty is 100 cycles for all misses,
determine how much faster a processor would run with a perfect
cache that never missed. Use the instruction frequencies for
SPECint2000 from Chapter 3, Figure 3.26, on page 228.
Answer
Instruction miss cycles I  2% 100  2.00 I
Data miss cycles I  36% 4% 100  1.44 I
I  CP Istall  Clock cycle
CP U time with stalls

CP U time with perfectcache I  CP Iperfect  Clock cycle

CP Istall
5.44

CP Iperfect
2
5.44
T heperformanc
e with theperfectcacheis betterby
 2.72
2
2004 Morgan Kaufmann Publishers
36
Cache Performance with Increased Clock Rate
•
Suppose we increase the performance of the computer in the previous
example by doubling its clock rate. Since the main memory speed is
unlikely to change, assume that the absolute time to handle a cache miss
does not change. How much faster will the computer be with faster clock,
assuming the same miss rate as the previous example?
•
Answer
T otalmiss cyclesoer instruction  (2% 200) 36% (4% 200) 6.88
Thus, the faster computer with cache misses will have a CPI of 2 + 6.88 = 8.88,
compared to a CPI with cache misses of 5.44 for the slower computer.
Using the formula for APU time from the previous example, we can compute
the relative performance as
Performanc
e with fast clock Executiontime with slow clock

Performanc
e with slow clock Executiontime with fast clock
IC CPIslowclock Clock cycle
5.44


 1.23
Clock cycle
1
IC CPIfastclock
8.88
2004 Morgan Kaufmann Publishers 37
2
2
Performance
•
Increasing the block size tends to decrease miss rate:
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
256
1 KB
8 KB
16 KB
64 KB
256 KB
•
Use split caches because there is more spatial locality in code:
Program
gcc
spice
Block size in
words
1
4
1
4
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
2004 Morgan Kaufmann Publishers
38
Performance
•
Simplified model:
execution time = (execution cycles + stall cycles) ´ cycle time
stall cycles = # of instructions ´ miss ratio ´ miss penalty
•
Two ways of improving performance:
– decreasing the miss ratio
– decreasing the miss penalty
What happens if we increase block size?
2004 Morgan Kaufmann Publishers
39
Decreasing miss ratio with associativity
One-way set associative
(direct mapped)
Block
Tag
Data
0
Two-way set associative
1
Set
2
Tag
Data
Tag
Data
0
3
1
4
2
5
3
6
7
Four-way set associative
Set
Tag
Data
Tag
Data
Tag
Data
Tag
Data
0
1
Eight-way set associative (fully associative)
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Compared to direct mapped, give a series of references that:
– results in a lower miss ratio using a 2-way set associative cache
– results in a higher miss ratio using a 2-way set associative cache
assuming we use the “least recently used” replacement strategy
Tag
Data
2004 Morgan Kaufmann Publishers
40
An implementation
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
321 0
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
32
4-to-1 multiplexor
Hit
Data
2004 Morgan Kaufmann Publishers
41
Performance
15%
1 KB
12%
2 KB
9%
4 KB
6%
8 KB
16 KB
32 KB
3%
64 KB
128 KB
0
One-way
Two-way
Four-way
Eight-way
Associativity
2004 Morgan Kaufmann Publishers
42
Decreasing miss penalty with multilevel caches
•
Add a second level cache:
– often primary cache is on the same chip as the processor
– use SRAMs to add another cache above primary memory (DRAM)
– miss penalty goes down if data is in 2nd level cache
•
Example:
– CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access
– Adding 2nd level cache with 5ns access time decreases miss rate to .5%
•
Using multilevel caches:
– try and optimize the hit time on the 1st level cache
– try and optimize the miss rate on the 2nd level cache
2004 Morgan Kaufmann Publishers
43
Cache Complexities
•
Not always easy to understand implications of caches:
1200
2000
Radix sort
1000
Radix sort
1600
800
1200
600
800
400
200
Quicksort
400
0
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Theoretical behavior of
Radix sort vs. Quicksort
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Observed behavior of
Radix sort vs. Quicksort
2004 Morgan Kaufmann Publishers
44
Cache Complexities
•
Here is why:
5
Radix sort
4
3
2
1
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
•
Memory system performance is often critical factor
– multilevel caches, pipelined processors, make it harder to predict outcomes
– Compiler optimizations to increase locality sometimes hurt ILP
•
Difficult to predict best algorithm: need experimental data
2004 Morgan Kaufmann Publishers
45
7.4
Virtual Memory
2004 Morgan Kaufmann Publishers
46
Keywords
•
Virtual memory A technique that uses main memory as a “cache” for
secondary storage.
•
Physical address
•
Protection A set of mechanisms for ensuring that multiple processes
sharing the processor, memory, or I/O devices cannot interfere,
intentionally or unintentionally, with one another by reading or writing each
other’s data. These mechanisms also isolate the operating system from a
user process.
•
Page fault An event that occurs when an accessed page is not present
in main memory.
•
Virtual address An address that corresponds to a location in virtual
space and is translated by address mapping to a physical address when
memory is accessed.
An address in main memory.
2004 Morgan Kaufmann Publishers
47
Keywords
•
Address translation Also called address mapping. The process by
which a virtual address is mapped to an address used to access memory.
•
Segmentation A variable-size address mapping scheme in which an
address consists of two parts: a segment number, which is mapped to a
physical address, and a segment offset.
•
Page table The table containing the virtual to physical address
translations in a virtual memory system. The table, which is stored in
memory, is typically indexed by the virtual page number; each entry in the
table contains the physical page number for that virtual page if the page
is currently in memory.
•
Swap space The space on the disk reserved for the full virtual memory
space of a process.
•
Reference bit Also called use bit. A field that is set whenever a page is
accessed and that is used to implement LRU or other replacement
schemes.
2004 Morgan Kaufmann Publishers
48
Keywords
•
Translation-lookaside buffer (TLB) A cache that keeps track of
recently used address mappings to avoid an access to the page table.
•
Virtually addressed cache A cache that is accessed with a virtual
address rather than a physical address.
•
Aliasing A situation in which the same object is accessed by two
addresses; can occur in virtual memory when there are two virtual
addresses for the same physical page.
•
Physically addressed cache
address.
•
Kernel mode Also called supervisor mode. A mode indicating that a
running process is an operating system process.
•
System call A special instruction that transfers control from user mode
to a dedicated location in supervisor code space, invoking the exception
mechanism in the process.
A cache that is addressed by a physical
2004 Morgan Kaufmann Publishers
49
Keywords
•
Context switch A changing of the internal state of the processor to
allow a different process to use the processor that includes saving the
state needed to return to the currently executing process.
•
Exception enable Also called interrupt enable. A signal or action that
controls whether the process responds to an exception or not; necessary
for preventing the occurrence of exceptions during intervals before the
processor has safely saved the state needed to restart.
•
Restartable instruction An instruction that can resume execution after
an exception is resolved without the exception’s affecting the result of the
instruction.
•
Handler Name of a software routine invoked to “handle” an exception or
interrupt.
•
Unmapped
A portion of the address space that cannot have page faults.
2004 Morgan Kaufmann Publishers
50
Virtual Memory
•
Main memory can act as a cache for the secondary storage (disk)
Virtual addresses
Physical addresses
Address translation
Disk addresses
•
Advantages:
– illusion of having more physical memory
– program relocation
– protection
2004 Morgan Kaufmann Publishers
51
Pages: virtual memory blocks
•
Page faults: the data is not in memory, retrieve it from disk
– huge miss penalty, thus pages should be fairly large (e.g., 4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is too expensive so we use writeback
Virtual address
31 30 29 28 27
15 14 13 12 11 10 9 8
3210
Page offset
Virtual page number
Translation
29 28 27
15 14 13 12 11 10 9 8
Physical page number
Physical address
3210
Page offset
2004 Morgan Kaufmann Publishers
52
Page Tables
Virtual page
number
Page table
Physical page or
Valid disk address
1
1
1
1
0
1
1
0
1
1
0
1
Physical memory
Disk storage
2004 Morgan Kaufmann Publishers
53
Page Tables
Page table register
Virtual address
31 30 29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Virtual page number
Page offset
12
20
Valid
3 2 1 0
Physical page number
Page table
18
If 0 then page is not
present in memory
29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Physical page number
3 2 1 0
Page offset
Physical address
2004 Morgan Kaufmann Publishers
54
Making Address Translation Fast
•
A cache for address translations: translation lookaside buffer
TLB
Virtual page
number Valid Dirty Ref
1
1
1
1
0
1
Tag
Physical page
address
0 1
1 1
1 1
0 1
0 0
0 1
Physical memory
Page table
Valid Dirty Ref
1
1
1
1
0
1
1
0
1
1
0
1
Typical values:
1
0
0
0
0
0
0
0
1
1
0
1
Physical page
or disk address
1
0
0
1
0
1
1
0
1
1
0
1
Disk storage
16-512 entries,
miss-rate: .01% - 1%
miss-penalty: 10 – 100 cycles
2004 Morgan Kaufmann Publishers
55
TLBs and caches
Virtual address
TLB access
TLB miss
exception
No
Yes
TLB hit?
Physical address
No
Try to read data
from cache
Cache miss stall
while read block
No
Cache hit?
Yes
Write?
No
Yes
Write access
bit on?
Write protection
exception
Yes
Try to write data
to cache
Deliver data
to the CPU
Cache miss stall
while read block
No
Cache hit?
Yes
Write data into cache,
update the dirty bit, and
put the data and the
2004
Morgan
Kaufmann Publishers
address into
the write
buffer
56
Virtual address
31
30
29
14
Virtual page number
13
12
11
10
9
3
1
0
Page offset
12
20
Valid Dirty
2
Tag
Physical page number
=
=
=
=
=
=
TLB
TLB hit
20
Page offset
Physical page number
Physical address
Block
Cache index
Physical address tag
offset
18
8
4
Byte
offset
2
8
12
Valid
Data
Tag
Cache
=
Cache hit
32
2004 Morgan Kaufmann Publishers
Data
57
Figure 7.26 The possible combinations of events in the TLB,
virtual memory system, and cache.
TLB
Page
Table
Cache
Possible? If so, under what circumstance?
hit
hit
miss
Possible, although the page table is never really checked if TLB hits.
miss
hit
hit
TLB misses, but entry found in page table; after retry, data is found in cache.
miss
hit
miss
TLB misses, but entry found in page table; after retry, data misses in cache.
miss
miss
miss
TLB misses and is followed by a page fault; after retry, data must miss in cache.
hit
miss
miss
Impossible: cannot have a translation in TLB if page is not present in memory.
hit
miss
hit
Impossible: cannot have a translation in TLB if page is not present in memory.
hit
miss
hit
Impossible: data cannot be allowed in cache if the page is not in memory.
2004 Morgan Kaufmann Publishers
58
Figure 7.27 MIPS control registers.
Register
CP0 register number
Description
EPC
14
Where to restart after exception
Cause
13
Cause of exception
BadVAddr
8
Address that caused exception
Index
0
Location in TLB to be read or written
Random
1
Pseudorandom location in TLB
EntryLo
2
Physical page address and flags
EntryHi
10
Virtual page address
Context
4
Page table address and page number
2004 Morgan Kaufmann Publishers
59
7.5
A Common Framework for Memory
Hierarchies
2004 Morgan Kaufmann Publishers
60
Keywords
•
Three Cs model A cache model in which all cache misses are
classified into one of three categories: compulsory misses, capacity
misses, and conflict misses.
•
Compulsory miss Also called cold start miss. A cache miss caused by
the first access to a clock that has never been in the cache.
•
Capacity miss A cache miss that occurs because the cache, even with
full associatively, cannot contain all the block needed to satisfy the
request.
•
Conflict miss Also called collision miss. A cache miss that occurs in a
set-associative or direct-mapped cache when multiple blocks compete for
the same set and that are eliminated in a fully associative cache of the
same size.
2004 Morgan Kaufmann Publishers
61
Question 1:Where can a block be placed?
•
We have seen that block placement in the upper level of the
hierarchy can use a range of schemes, from direct mapped to set
associative to fully associative. As mentioned above, this entire
range of schemes can be through of as variations on a setassociative scheme where the number of sets and the number of
blocks per set varies:
Scheme name
Number of sets
Blocks per set
Direct mapped
Number of blocks in cache
1
Set associative
Number of blocksin cache Associativity (typically 2 —16)
Associativity
Fully associative
1
Number of blocks in the cache
2004 Morgan Kaufmann Publishers
62
Figure 7.29 The key quantitative design parameters that characterize
the major elements of memory hierarchy in a computer.
Feature
Typical values
for L1 caches
Typical values
for L2 caches
Typical values for paged
memory
Typical values
for a TLB
250–2000
4000–250,000
16,000–250,000
16–512
Total size in kilobytes
16–64
500–8000
250,000–1,000,000,000
0.25–16
Block size in bytes
32–64
32–128
4000–64,000
4–32
Miss penalty in clocks
10–25
100–1000
10,000,000–100,000,000
10–1000
2%–5%
0.1%–2%
0.00001%–0.0001%
0.01%–2%
Total size in blocks
Miss rates (global for L2)
2004 Morgan Kaufmann Publishers
63
Figure 7.30 The data cache miss rates for each of eight cache
sizes improve as the associativity increases.
2004 Morgan Kaufmann Publishers
64
Question 2:How is a block found?
•
The choice of how we locate a block depends on the block
placement scheme, since that dictates the number of possible
locations. We can summarize the schemes as follows:
Associativity
Location method
Comparisons required
Direct mapped
Index
1
Set associative
Index the set, search
among elements
Degree of associativity
Full
Search all cache entires
Size of the cache
Separate lookup table
02
2004 Morgan Kaufmann Publishers
65
Question 3:Which block should be replaced on a cache miss?
•
We have already mentioned the two primary strategies for
replacement in set-associative or fully associative caches:
– Random: Candidate blocks are randomly selected, possibly using
some hardware assistance. For example, MIPS supports random
replacement for TLB misses.
– Least recently used (LRU): The block replaced is the one that has
been unused for the longest time.
2004 Morgan Kaufmann Publishers
66
Question 4: What happens on a Write?
•
A key characteristic of any memory hierarchy is how it deals with
writes. We have already seen the two basic options:
– Write-through: The information is written to both the block in the cache
and to the block in the lower level of the memory hierarchy (main
memory for a cache). The caches in Section 7.2 used this scheme.
– Write-back (also called copy-back): The information is written only to
the block in the cache. The modified block is written to the lower level
of the hierarchy only when it is replaced. Virtual memory systems
always use write-back, for the reasons discussed in Section 7.4.
2004 Morgan Kaufmann Publishers
67
The three Cs:An intuitive model for understanding the
behavior of memory hierarchies
•
Compulsory misses:These are cache misses caused by the first
access to a block that has never been in the cache. These are also called
cold-start misses.
•
Capacity misses:These are cache misses caused when the cache
cannot contain all the blocks needed during execution of a program.
Capacity misses occur when blocks are replaced and then later retrieved.
•
Conflict misses:These are cache misses that occur in set-associative
or direct-mapped caches when multiple blocks complete for the same set.
Conflict misses are those misses in a direct-mapped or set-associative
cache that are eliminated in a fully associative cache of the same size.
These cache misses are also called collision misses.
2004 Morgan Kaufmann Publishers
68
Figure 7.31 The miss rate can be broken into three sources of misses.
2004 Morgan Kaufmann Publishers
69
Figure 7.32 Memory hierarchy design challenges.
Design change
Effect on miss rate
Possible negative
performance effect
Increase cache size
Decreases capacity misses
May increase access time
Increase associativity
Decreases miss rate due to
conflict misses
May increase access time
Increase block size
Decreases miss rate for a wide
range of block sizes due to
spatial locality
Increases miss penalty. Very
large block could increase
miss rate
2004 Morgan Kaufmann Publishers
70
7.6
Real Stuff: The Pentium P4 and the
AMD Opteron Memory Hierarchies
2004 Morgan Kaufmann Publishers
71
Keywords
•
Nonblocking cache A cache that allows the processor to make
references to the cache while the cache is handling an earlier miss.
2004 Morgan Kaufmann Publishers
72
Figure 7.33 An AMD Opteron die processor photo with the
components labeled.
2004 Morgan Kaufmann Publishers
73
Modern Systems
•
2004 Morgan Kaufmann Publishers
74
2004 Morgan Kaufmann Publishers
75
7.7
Fallacies and Pitfalls
2004 Morgan Kaufmann Publishers
76
• Pitfall: Forgetting to account for byte addressing or the cache block
size in simulating a cache.
• Pitfall: Ignoring memory system behavior when writing programs or
when generating code in a compiler.
• Pitfall: Using average memory access time to evaluate the memory
hierarchy of an out-of-order processor.
• Pitfall: Extending an address space by adding segments on top of an
unsegmented address space.
2004 Morgan Kaufmann Publishers
77
7.8
Concluding Remarks
2004 Morgan Kaufmann Publishers
78
Keywords
•
Prefetching A technique in which data clocks needed in the future are
brought into the cache early by the use of special instructions that specify
the address of the block.
2004 Morgan Kaufmann Publishers
79
Modern Systems
•
Things are getting complicated!
2004 Morgan Kaufmann Publishers
80
Some Issues
•
Processor speeds continue to increase very fast
— much faster than either DRAM or disk access times
100,000
10,000
1,000
Performance
CPU
100
10
Memory
1
•
Year
Design challenge: dealing with this growing disparity
– Prefetching? 3rd level caches and more? Memory design?
2004 Morgan Kaufmann Publishers
81