CA226: Advanced Computer Architectures

Download Report

Transcript CA226: Advanced Computer Architectures

CPE 631 Lecture 04:
Review of the ABC of Caches
Electrical and Computer
Engineering
University of Alabama in
Huntsville
Outline
CPE
631
AM
 Memory
Hierarchy
 Four Questions for Memory Hierarchy
20/07/2015
UAH-CPE631
2
Processor-DRAM Latency Gap
CPE
631
AM
Processor:
2x/1.5 year
Performance
1000
CPU
100
Processor-Memory
Performance Gap
grows 50% / year
10
Memory:
2x/10 years
DRAM
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1
Time
20/07/2015
UAH-CPE631
3
Solution: The Memory Hierarchy (MH)
CPE
631
AM
User sees as much memory as is available in
cheapest technology and access it at the speed
offered by the fastest technology
Levels in Memory Hierarchy
Lower
Upper
Processor
Control
Datapath
Speed:
Capacity:
Cost/bit:
Slowest
Biggest
Lowest
Fastest
Smallest
Highest
20/07/2015
UAH-CPE631
4
Why hierarchy works?
CPE
631
AM

Principal of locality
Rule of thumb:
Programs spend
90% of their
execution time in
only 10% of code
Probability
of reference
Address space

Temporal locality: recently accessed items
are likely to be accessed in the near future
 Keep them close to the processor
 Spatial locality: items whose addresses
are near one another tend to be referenced
close together in time
 Move blocks consisted of contiguous words
to the upper level
20/07/2015
UAH-CPE631
5
Cache Measures
CPE
631
AM
Upper Level
Memory
Lower Level
Memory
To Processor
Bl. X
Bl. Y
Hit time << Miss Penalty
From Processor

Hit: data appears in some block in the upper level (Bl. X)



Miss: data needs to be retrieved from the lower level (Bl. Y)



Hit Rate: the fraction of memory access found in the upper level
Hit Time: time to access the upper level
(RAM access time + Time to determine hit/miss)
Miss rate: 1 - (Hit Rate)
Miss penalty: time to replace a block in the upper level +
time to retrieve the block from the lower level
Average memory-access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
20/07/2015
UAH-CPE631
6
Levels of the Memory Hierarchy
Capacity
Access Time
Cost
CPE
631
AM CPU Registers
Registers
100s Bytes
<1s ns
Cache
10s-100s K Bytes
1-10 ns
$10/ MByte
Main Memory
M Bytes
100ns- 300ns
$1/ MByte
Disk
10s G Bytes, 10 ms
(10,000,000 ns)
$0.0031/ MByte
Tape
infinite
sec-min
$0.0014/ MByte
20/07/2015
Upper Leve
Staging
Xfer Unit
faster
Instr. Operands prog./compiler
Cache
Blocks
Memory
1-8 bytes
cache cntl
8-128 bytes
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
UAH-CPE631
Larger
Lower Level
7
Four Questions for Memory Heir.
CPE
631
AM

Q#1:
Where can a block be placed in the upper level?
 Block placement

direct-mapped, fully associative, set-associative

Q#2:
How is a block found if it is in the upper level?
 Block identification
 Q#3: Which block should be replaced on a miss?
 Block replacement


Random, LRU (Least Recently Used)
Q#4: What happens on a write?  Write strategy


Write-through vs. write-back
Write allocate vs. No-write allocate
20/07/2015
UAH-CPE631
8
Direct-Mapped Cache
CPE
631
AM
 In
a direct-mapped cache,
each memory address is associated with
one possible block within the cache
 Therefore,
we only need to look in a single
location in the cache for the data if it exists in
the cache
 Block is the unit of transfer between cache
and memory
20/07/2015
UAH-CPE631
9
Direct-Mapped Cache (cont’d)
CPE
631
AM
Memory
Address
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
20/07/2015
Memory
Cache
Cache (4 byte)
Index
0
1
2
3
UAH-CPE631
10
Direct-Mapped Cache (cont’d)
CPE
631
AM

Since multiple memory addresses map to same
cache index, how do we tell which one is in
there?
 What if we have a block size > 1 byte?
 Result: divide memory address into three fields:
Block Address
tttttttttttttttttt iiiiiiiiii oooo
TAG: to check if
have the correct
block
20/07/2015
INDEX: to
select block
UAH-CPE631
OFFSET: to
select byte
within the
block
11
Direct-Mapped Cache Terminology
CPE
631
AM

INDEX: specifies the cache index
(which “row” of the cache we should look in)
 OFFSET: once we have found correct block,
specifies which byte within the block we want
 TAG: the remaining bits after offset and index
are determined; these are used to distinguish
between all the memory addresses
that map to the same location
 BLOCK ADDRESS: TAG + INDEX
20/07/2015
UAH-CPE631
12
Direct-Mapped Cache Example
CPE
631
AM

Conditions



32-bit architecture (word=32bits), address unit is byte
8KB direct-mapped cache with 4 words blocks
Determine the size of the Tag, Index, and
Offset fields



OFFSET (specifies correct byte within block):
cache block contains 4 words = 16 (24) bytes  4 bits
INDEX (specifies correct row in the cache):
cache size is 8KB=213 bytes, cache block is 24 bytes
#Rows in cache (1 block = 1 row): 213/24 = 29  9 bits
TAG: Memory address length - offset - index =
32 - 4 - 9 = 19  tag is leftmost 19 bits
20/07/2015
UAH-CPE631
13
1 KB Direct Mapped Cache, 32B blocks

For a 2 ** N byte cache:


The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ** M)
31
9
Cache Tag
Example: 0x50
4
0
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Valid Bit
Cache Tag
Cache Data
Byte 31
0x50
Byte 63
: :
Byte 1
Byte 0
0
Byte 33 Byte 32 1
2
3
:
:
:
Byte 1023
20/07/2015
UAH-CPE631
:
CPE
631
AM
Byte 992 31
14
Two-way Set Associative Cache
CPE
631
AM

N-way set associative: N entries for each Cache Index


Example: Two-way set associative cache



Valid
N direct mapped caches operates in parallel (N typically 2 to 4)
Cache Index selects a “set” from the cache
The two tags in the set are compared in parallel
Data is selected based on the tag result
Cache Tag
:
:
Adr Tag
Compare
Cache Data
Cache Index
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
:
:
Compare
OR
20/07/2015
Hit
Cache Block
UAH-CPE631
15
Disadvantage of Set Associative Cache
CPE
631
AM

N-way Set Associative Cache v. Direct Mapped Cache:




N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss
In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:

Possible to assume a hit and continue. Recover later if miss.
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
:
:
Compare
OR
20/07/2015
Hit
Cache Block
UAH-CPE631
16
Q1: Where can a block be placed
in the upper level?
CPE
631
AM

Block 12 placed in 8 block cache:


Fully associative, direct mapped, 2-way set
associative
S.A. Mapping = Block Number Modulo Number Sets
Full Mapped
Direct Mapped
(12 mod 8) = 4
2-Way Assoc
(12 mod 4) = 0
01234567
01234567
01234567
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
20/07/2015
UAH-CPE631
17
Q2: How is a block found if it is
in the upper level?
CPE
631
AM
 Tag
on each block
 No
need to check index or block offset
 Increasing
associativity shrinks index,
expands tag
Block Address
Tag
20/07/2015
Index
UAH-CPE631
Block
Offset
18
Q3: Which block should be
replaced on a miss?
CPE
631
AM
 Easy
for Direct Mapped
 Set Associative or Fully Associative:
 Random
 LRU
Assoc:
Size
16 KB
64 KB
256 KB
20/07/2015
(Least Recently Used)
2-way
4-way
LRU Ran
LRU
Ran
5.2% 5.7%
4.7% 5.3%
1.9% 2.0%
1.5% 1.7%
1.15% 1.17% 1.13% 1.13%
UAH-CPE631
8-way
LRU Ran
4.4% 5.0%
1.4% 1.5%
1.12% 1.12%
19
Q4: What happens on a write?
CPE
631
AM

Write through—The information is written to
both the block in the cache and to the block in
the lower-level memory.
 Write back—The information is written only to
the block in the cache. The modified cache
block is written to main memory only when it is
replaced.


Pros and Cons of each?



is block clean or dirty?
WT: read misses cannot result in writes
WB: no repeated writes to same location
WT always combined with write buffers so that
don’t wait for lower level memory
20/07/2015
UAH-CPE631
20
Write stall in write through caches
CPE
631
AM
 When
the CPU must wait for writes to
complete during write through, the CPU is
said to write stall
 Common optimization
=> Write buffer which allows the
processor to continue as soon as the data
is written to the buffer, thereby
overlapping processor execution with
memory updating
 However, write stalls can occur
even with write buffer (when buffer is full)
20/07/2015
UAH-CPE631
21
Write Buffer for Write Through
CPE
631
AM
Processor
Cache
DRAM
Write Buffer

A Write Buffer is needed between the Cache and
Memory



Write buffer is just a FIFO:



Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Typical number of entries: 4
Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycle
Memory system designer’s nightmare:


Store frequency (w.r.t. time) -> 1 / DRAM write cycle
Write buffer saturation
20/07/2015
UAH-CPE631
22
What to do on a write-miss?
CPE
631
AM

Write allocate (or fetch on write)
The block is loaded on a write-miss,
followed by the write-hit actions
 No-write allocate (or write around)
The block is modified in the memory and
not loaded into the cache
 Although either write-miss policy
can be used with write through or write back,
write back caches generally use write allocate
and write through often use no-write allocate
20/07/2015
UAH-CPE631
23
Summary:
Pipelining & Performance
CPE
631
AM


Just overlap tasks; easy if tasks are independent
Speed Up  Pipeline Depth; if ideal CPI is 1, then:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined

Hazards limit performance on computers:





Structural: need more HW resources
Data (RAW,WAR,WAW): need forwarding, compiler scheduling
Control: delayed branch, prediction
Time is measure of performance: latency or throughput
CPI Law:
20/07/2015
UAH-CPE631
24
Summary: Caches
CPE
631
AM

The Principle of Locality:

Program access a relatively small portion of the address space
at any instant of time.



Three Major Categories of Cache Misses:




Compulsory Misses: sad facts of life. Example: cold start
misses.
Capacity Misses: increase cache size
Conflict Misses: increase cache size and/or associativity
Write Policy:



Temporal Locality: Locality in Time
Spatial Locality: Locality in Space
Write Through: needs a write buffer.
Write Back: control can be complex
Today CPU time is a function of (ops, cache misses)
vs. just f(ops): What does this mean to
Compilers, Data structures, Algorithms?
20/07/2015
UAH-CPE631
25
Summary:
The Cache Design Space
CPE
631
AM

Cache Size
Several interacting dimensions






cache size
block size
associativity
replacement policy
write-through vs write-back
Block Size
The optimal choice is a compromise

depends on access characteristics




Associativity
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
20/07/2015
UAH-CPE631
Bad
Good Factor A
Less
Factor B
More
26