PowerPoint 프레젠테이션 - German University in Cairo
Download
Report
Transcript PowerPoint 프레젠테이션 - German University in Cairo
Computer Architecture:
Memory Architecture
Dr. Eng. Amr T. Abdel-Hamid
Slides are Adapted from:
• J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative
Approach, 2nd Edition, Morgan Kaufmann Publishing Co., Menlo Park, CA. 1
996 . Copyright 1998 UCB
•S.Tahar, Computer Architecture, Concordia University
Winter 2012
Computer Architecture
Elect707
Why Memory Organization?
1000
CPU
µProc
60%/yr.
Performance
“Moore’s Law”
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
7%/yr.
1999
2000
1992
1993
1994
1995
1996
1997
1998
1986
1987
1988
1989
1990
1991
1982
1983
1984
1985
1
1980
1981
Dr. Amr Talaat
DRAM
The Principle of Locality
The Principle of Locality:
Programs access a relatively small portion of the address sp
ace at any instant of time.
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, i
tems whose addresses are close by tend to be referenced so
on (e.g., straightline code, array access)
Dr. Amr Talaat
Basic Principle to overcome the memory/proc. Interaction
problem
What is a cache?
Small, fast storage used to improve average access time to slow m
emory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers a cache on variables
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
Proc/Regs
Dr. Amr Talaat
Bigger
L1-Cache
L2-Cache
Memory
Disk, etc.
Faster
Cache Definitions
Hit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of access tim
e + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block
Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to del
iver the block the processor
Hit Time << Miss Penalty
Dr. Amr Talaat
Review: Cache performance
AMAT HitTime MissRate MissPenalty
HitTimeInst MissRate Inst MissPenalty Inst
HitTimeData MissRate Data MissPenaltyData
Average Memory Access Time (AMAT) = Hit time + Miss
rate x Miss penalty (ns or clocks)
Dr. Amr Talaat
Cache performance
• Miss-oriented Approach to Memory Access:
MemAccess
CPUtime IC CPI
MissRate MissPenalty CycleTime
Execution
Inst
MemMisses
CPUtime IC CPI
MissPenalty CycleTime
Execution
Inst
– CPIExecution includes ALU and Memory instructions
• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
Dr. Amr Talaat
AluOps
CPUtime IC
CPI
Inst
AluOps
MemAccess
AMAT CycleTime
Inst
AMAT HitTime MissRate MissPenalty
HitTimeInst MissRate Inst MissPenalty Inst
HitTimeData MissRateData MissPenaltyData
Impact on Performance
Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cycle miss penal
ty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction=
= 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/Data
Mop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMo
p) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
58% of the time the proc is stalled waiting for memory!
Dr. Amr Talaat
AMAT HitTime MissRate MissPenalty
Four Questions for Memory Hierarchy Desig
ners
Q1: Where can a block be placed in the upper level? (Block plac
ement)
Q2: How is a block found if it is in the upper level?
Q3: Which block should be replaced on a miss?
Q4: What happens on a write?
Dr. Amr Talaat
Block Placement
Q1: Where can a block be placed in the upper level?
Fully Associative,
Set Associative,
Direct Mapped
Dr. Amr Talaat
10
1 KB Direct Mapped Cache, 32B blocks
For a 2^N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ^M)
31
9
Cache Tag
Example: 0x50
4
0
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Cache Data
Byte 31
Byte 63
Byte 1
Byte 33 Byte 32 1
Byte 0
0
2
3
:
:
:
Byte 1023
:
Dr. Amr Talaat
0x50
:
Cache Tag
:
Valid Bit
11
Byte 992 31
Set Associative Cache
N-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallel
How big is the tag?
Example: Two-way set associative cache
Cache Index selects a “set” from the cache
The two tags in the set are compared to the input in parallel
Data is selected based onCache
the tag
Indexresult
Valid
Cache Tag
Dr. Amr Talaat
:
:
Adr Tag
Compare
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Block
Valid
:
:
Compare
OR
Hit
Cache Tag
12
Block Placement
Dr. Amr Talaat
Q2: How is a block found if it is in the upper level?
Index identifies set of possibilities
Tag on each block
No need to check index or block offset
Increasing associativity shrinks index, expands tag
Block Address
Tag
Index
Block
Offset
Dr. Amr Talaat
Cache size = Associativity * 2indexsize * 2offestsize
14
Q3: Which block should be replaced on a mi
ss?
Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)
Dr. Amr Talaat
Assoc:
2-way
4-way
8-way
Size
LRU Ran LRU Ran
LRU Ran
16 KB
5.2% 5.7% 4.7%
5.3% 4.4%
5.0%
64 KB
1.9% 2.0% 1.5%
1.7% 1.4%
1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
15
Q4: What happens on a write?
Dr. Amr Talaat
Write-through: all writes update cache and underlying memory/cache
Can always discard cached data - most up-to-date data is in memory
Cache control bit: only a valid bit
Write-back: all writes simply update cache
Can’t just discard cached data - may have to write it back to memory
Cache control bits: both valid and dirty bits
Other Advantages:
Write-through:
memory (or other processors) always have latest data
Simpler management of cache
Write-back:
much lower bandwidth, since data often overwritten multiple times
Better tolerance to long-latency memory?
16
Write Buffer for Write Through
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Dr. Amr Talaat
Typical number of entries: 4
Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cyc
le
17
The Cache Design Space
Several interacting dimensions
Cache Size
cache size
block size
associativity
replacement policy
write-through vs write-back
Associativity
Block Size
The optimal choice is a compromise
depends on access characteristics
Dr. Amr Talaat
Bad
workload
use (I-cache, D-cache, TLB)
Factor A
Good
depends on technology / cost
Less
Simplicity often wins
18
Factor B
More
How to improve Cache Performance?
CPUtime IC CPI
Executio n
Memory accesses
Miss rate Miss penalty Clock cycle time
Instruction
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Dr. Amr Talaat
19
Reducing Misses
Classifying Misses: 3 Cs
Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first refere
nce misses.
Capacity—If the cache cannot contain all the blocks needed during execut
ion of a program, capacity misses will occur due to blocks being discarded
and later retrieved.
Dr. Amr Talaat
Conflict—If block-placement strategy is set associative or direct mapped, c
onflict misses (in addition to compulsory & capacity misses) will occur bec
ause a block can be discarded and later retrieved if too many blocks map t
o its set. Also called collision misses or interference misses.
More recent, 4th “C”:
Coherence - Misses caused by cache coherence.
20
Classify Cache Misses, How?
(1) Infinite cache, fully associative
Compulsory misses
(2) Finite cache, fully associative
Compulsory misses + Capacity misses
(3) Finite cache, limited associativity
Compulsory misses + Capacity misses + Conflict misses
Dr. Amr Talaat
21
3Cs Miss Rate
0.14
1-way
Miss Rate per Type
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
128
64
32
16
8
4
2
Dr. Amr Talaat
1
0
Compulsory
Cache Size (KB)
Compulsory extremely small
22
Reduce the miss rate,
Larger cache
Reduce Misses via Larger Block Size
Reduce Misses via Higher Associativity
Reducing Misses via Victim Cache
Reducing Misses via Pseudo-Associativity
Reducing Misses by HW Prefetching Instr, Data
Reducing Misses by SW Prefetching Data
Reducing Misses by Compiler Optimizations
Dr. Amr Talaat
Reduce Misses via Larger Block Size
Dr. Amr Talaat
Reduce Misses via Higher Associativity
miss rate 1-way associative cache size X
~= miss rate 2-way associative cache size X/2
0.14
1-way
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
128
64
32
16
8
4
2
0
1
Dr. Amr Talaat
Miss Rate per Type
0.12
Compulsory
25
Reducing Misses via a “Victim Cache”
Add buffer to place data dis
carded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95%
of conflicts for a 4 KB direct
mapped data cache
TAGS
DATA
Dr. Amr Talaat
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
To Next Lower Level In
Hierarchy
26
Reducing Misses via “Pseudo-Associativity”
How to combine fast hit time of Direct Mapped and have the lo
wer conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if the
re, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time
Miss Penalty
Time
Dr. Amr Talaat
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC
27
Reducing Misses by Hardware Pre-fetchi
ng of Instructions & Data
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in “stream buffer”
On miss check stream buffer
Works with data blocks too:
Dr. Amr Talaat
Jouppi [1990] 1 data stream buffer got 25% misses from 4K
B cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8 str
eams got 50% to 70% of misses from 2 64KB, 4-way set as
sociative caches
Prefetching relies on having extra memory bandwidth tha
t can be used without penalty
28
Reducing Misses by Software Prefetching Dat
a
Build a Special prefetching instructions that cannot cause fa
ults; a form of speculative execution
Data Prefetch
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
Dr. Amr Talaat
Relies on having extra memory bandwidth that can be used
without penalty
29
Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
How?
For instructions:
look at conflicts (using tools they developed)
Reorder procedures in memory so as to reduce co
nflict misses
For Data
Dr. Amr Talaat
Merging Arrays: improve spatial locality by single a
rray of compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to acce
ss data in order stored in memory
Loop Fusion: Combine 2 independent loops that h
ave same looping and some variables overlap
30
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Dr. Amr Talaat
Reducing conflicts between val & key;
improve spatial locality
31
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Dr. Amr Talaat
Sequential accesses instead of striding through memory every
100 words; improved spatial locality
32
Loop Fusion Example
Dr. Amr Talaat
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; impro
ve spatial locality
33
Reduce the miss penalty,
Read priority over write on miss
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Dr. Amr Talaat
Read Priority over Write on Miss
CPU
in out
Dr. Amr Talaat
write
buffer
DRAM
(or lower mem)
Write Buffer
Read Priority over Write on Miss
Write-through w/ write buffers => RAW conflicts with main me
mory reads on cache misses
If simply wait for write buffer to empty, might increase read miss p
enalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the me
mory access continue
Write-back want buffer to hold displaced blocks
Read miss replacing dirty block
Dr. Amr Talaat
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read, an
d then do the write
CPU stall less since restarts as soon as do read
Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU
Early restart—As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
Critical Word First—Request the missed word first from memory a
nd send it to the CPU as soon as it arrives; let the CPU continue e
xecution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first
Dr. Amr Talaat
Generally useful only in large blocks,
Spatial locality a problem; tend to want next sequential word, s
o not clear if benefit by early restart
block
Non-blocking Caches to reduce stalls on misse
s
Non-blocking cache or lockup-free cache allow data cache to c
ontinue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
“hit under miss” reduces the effective miss penalty by working
during miss vs. ignoring CPU requests
“hit under multiple miss” or “miss under miss” may further lowe
r the effective miss penalty by overlapping multiple misses
Dr. Amr Talaat
Significantly increases the complexity of the cache controller as th
ere can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Penium Pro allows 4 outstanding memory misses
Add a second-level cache
L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
Definitions:
Dr. Amr Talaat
Local miss rate— misses in this cache divided by the total number of m
emory accesses to this cache (Miss rateL2)
Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters
Dr. Amr Talaat
miss penalty
miss rate
Cache Optimization Summary
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
MR
+
+
+
+
+
+
+
MP HT
–
–
+
+
+
+
Complexity
0
1
2
2
2
3
0
1
2
3
2