PowerPoint 프레젠테이션 - German University in Cairo

Transcript PowerPoint 프레젠테이션 - German University in Cairo

Computer Architecture:
Memory Architecture
Dr. Eng. Amr T. Abdel-Hamid
Slides are Adapted from:
• J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative
Approach, 2nd Edition, Morgan Kaufmann Publishing Co., Menlo Park, CA. 1
996 . Copyright 1998 UCB
•S.Tahar, Computer Architecture, Concordia University
Winter 2012
Computer Architecture
Elect707
Why Memory Organization?
1000
CPU
µProc
60%/yr.
Performance
“Moore’s Law”
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
7%/yr.
1999
2000
1992
1993
1994
1995
1996
1997
1998
1986
1987
1988
1989
1990
1991
1982
1983
1984
1985
1
1980
1981
Dr. Amr Talaat
DRAM
The Principle of Locality
 The Principle of Locality:
 Programs access a relatively small portion of the address sp
ace at any instant of time.
 Two Different Types of Locality:
 Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon (e.g., loops, reuse)
 Spatial Locality (Locality in Space): If an item is referenced, i
tems whose addresses are close by tend to be referenced so
on (e.g., straightline code, array access)
Dr. Amr Talaat
 Basic Principle to overcome the memory/proc. Interaction
problem
What is a cache?
 Small, fast storage used to improve average access time to slow m
emory.
 Exploits spatial and temporal locality
 In computer architecture, almost everything is a cache!




Registers a cache on variables
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
Proc/Regs
Dr. Amr Talaat
Bigger
L1-Cache
L2-Cache
Memory
Disk, etc.
Faster
Cache Definitions
 Hit: data appears in some block in the upper level (example: Block X)
 Hit Rate: the fraction of memory access found in the upper level
 Hit Time: Time to access the upper level which consists of access tim
e + Time to determine hit/miss
 Miss: data needs to be retrieve from a block in the lower level (Block
Y)
 Miss Rate = 1 - (Hit Rate)
 Miss Penalty: Time to replace a block in the upper level + Time to del
iver the block the processor
 Hit Time << Miss Penalty
Dr. Amr Talaat
Review: Cache performance
AMAT  HitTime  MissRate  MissPenalty
  HitTimeInst  MissRate Inst  MissPenalty Inst  
 HitTimeData  MissRate Data  MissPenaltyData 
Average Memory Access Time (AMAT) = Hit time + Miss
rate x Miss penalty (ns or clocks)
Dr. Amr Talaat
Cache performance
• Miss-oriented Approach to Memory Access:
MemAccess


CPUtime  IC   CPI

 MissRate  MissPenalty   CycleTime
Execution
Inst


MemMisses


CPUtime  IC   CPI

 MissPenalty   CycleTime
Execution
Inst


– CPIExecution includes ALU and Memory instructions
• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
Dr. Amr Talaat
 AluOps
CPUtime  IC  
 CPI
Inst

AluOps

MemAccess

 AMAT   CycleTime
Inst

AMAT  HitTime  MissRate  MissPenalty
  HitTimeInst  MissRate Inst  MissPenalty Inst  
 HitTimeData  MissRateData  MissPenaltyData 
Impact on Performance
Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cycle miss penal
ty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction=
= 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/Data
Mop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMo
p) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
 58% of the time the proc is stalled waiting for memory!
Dr. Amr Talaat
AMAT  HitTime  MissRate  MissPenalty
Four Questions for Memory Hierarchy Desig
ners
 Q1: Where can a block be placed in the upper level? (Block plac
ement)
 Q2: How is a block found if it is in the upper level?
 Q3: Which block should be replaced on a miss?
 Q4: What happens on a write?
Dr. Amr Talaat
Block Placement
 Q1: Where can a block be placed in the upper level?
 Fully Associative,
 Set Associative,
 Direct Mapped
Dr. Amr Talaat
10
1 KB Direct Mapped Cache, 32B blocks
 For a 2^N byte cache:
 The uppermost (32 - N) bits are always the Cache Tag
 The lowest M bits are the Byte Select (Block Size = 2 ^M)
31
9
Cache Tag
Example: 0x50
4
0
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Cache Data
Byte 31
Byte 63
Byte 1
Byte 33 Byte 32 1
Byte 0
0
2
3
:
:
:
Byte 1023
:
Dr. Amr Talaat
0x50
:
Cache Tag
:
Valid Bit
11
Byte 992 31
Set Associative Cache
 N-way set associative: N entries for each Cache Index
 N direct mapped caches operates in parallel
 How big is the tag?
 Example: Two-way set associative cache
 Cache Index selects a “set” from the cache
 The two tags in the set are compared to the input in parallel
 Data is selected based onCache
the tag
Indexresult
Valid
Cache Tag
Dr. Amr Talaat
:
:
Adr Tag
Compare
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Block
Valid
:
:
Compare
OR
Hit
Cache Tag
12
Block Placement
Dr. Amr Talaat
Q2: How is a block found if it is in the upper level?
 Index identifies set of possibilities
 Tag on each block
 No need to check index or block offset
 Increasing associativity shrinks index, expands tag
Block Address
Tag
Index
Block
Offset
Dr. Amr Talaat
Cache size = Associativity * 2indexsize * 2offestsize
14
Q3: Which block should be replaced on a mi
ss?
 Easy for Direct Mapped
 Set Associative or Fully Associative:
 Random
 LRU (Least Recently Used)
Dr. Amr Talaat
Assoc:
2-way
4-way
8-way
Size
LRU Ran LRU Ran
LRU Ran
16 KB
5.2% 5.7% 4.7%
5.3% 4.4%
5.0%
64 KB
1.9% 2.0% 1.5%
1.7% 1.4%
1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
15
Q4: What happens on a write?
Dr. Amr Talaat
 Write-through: all writes update cache and underlying memory/cache
 Can always discard cached data - most up-to-date data is in memory
 Cache control bit: only a valid bit
 Write-back: all writes simply update cache
 Can’t just discard cached data - may have to write it back to memory
 Cache control bits: both valid and dirty bits
 Other Advantages:
 Write-through:
 memory (or other processors) always have latest data
 Simpler management of cache
 Write-back:
 much lower bandwidth, since data often overwritten multiple times
 Better tolerance to long-latency memory?
16
Write Buffer for Write Through
Processor
Cache
DRAM
Write Buffer
 A Write Buffer is needed between the Cache and Memory
 Processor: writes data into the cache and the write buffer
 Memory controller: write contents of the buffer to memory
 Write buffer is just a FIFO:
Dr. Amr Talaat
 Typical number of entries: 4
 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cyc
le
17
The Cache Design Space
 Several interacting dimensions





Cache Size
cache size
block size
associativity
replacement policy
write-through vs write-back
Associativity
Block Size
 The optimal choice is a compromise
 depends on access characteristics
Dr. Amr Talaat
Bad
workload
use (I-cache, D-cache, TLB)
Factor A
Good
 depends on technology / cost
Less
 Simplicity often wins
18
Factor B
More
How to improve Cache Performance?

CPUtime  IC  CPI

Executio n

Memory accesses

 Miss rate Miss penalty  Clock cycle time

Instruction
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Dr. Amr Talaat
19
Reducing Misses
 Classifying Misses: 3 Cs
 Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first refere
nce misses.
 Capacity—If the cache cannot contain all the blocks needed during execut
ion of a program, capacity misses will occur due to blocks being discarded
and later retrieved.
Dr. Amr Talaat
 Conflict—If block-placement strategy is set associative or direct mapped, c
onflict misses (in addition to compulsory & capacity misses) will occur bec
ause a block can be discarded and later retrieved if too many blocks map t
o its set. Also called collision misses or interference misses.
 More recent, 4th “C”:
 Coherence - Misses caused by cache coherence.
20
Classify Cache Misses, How?
 (1) Infinite cache, fully associative
 Compulsory misses
 (2) Finite cache, fully associative
 Compulsory misses + Capacity misses
 (3) Finite cache, limited associativity
 Compulsory misses + Capacity misses + Conflict misses
Dr. Amr Talaat
21
3Cs Miss Rate
0.14
1-way
Miss Rate per Type
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
128
64
32
16
8
4
2
Dr. Amr Talaat
1
0
Compulsory
Cache Size (KB)
Compulsory extremely small
22
Reduce the miss rate,








Larger cache
Reduce Misses via Larger Block Size
Reduce Misses via Higher Associativity
Reducing Misses via Victim Cache
Reducing Misses via Pseudo-Associativity
Reducing Misses by HW Prefetching Instr, Data
Reducing Misses by SW Prefetching Data
Reducing Misses by Compiler Optimizations
Dr. Amr Talaat
Reduce Misses via Larger Block Size
Dr. Amr Talaat
Reduce Misses via Higher Associativity
miss rate 1-way associative cache size X
~= miss rate 2-way associative cache size X/2
0.14
1-way
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
128
64
32
16
8
4
2
0
1
Dr. Amr Talaat
Miss Rate per Type
0.12
Compulsory
25
Reducing Misses via a “Victim Cache”
Add buffer to place data dis
carded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95%
of conflicts for a 4 KB direct
mapped data cache
TAGS
DATA
Dr. Amr Talaat
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
To Next Lower Level In
Hierarchy
26
Reducing Misses via “Pseudo-Associativity”
How to combine fast hit time of Direct Mapped and have the lo
wer conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if the
re, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time
Miss Penalty
Time
Dr. Amr Talaat
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
 Better for caches not tied directly to processor (L2)
 Used in MIPS R1000 L2 cache, similar in UltraSPARC
27
Reducing Misses by Hardware Pre-fetchi
ng of Instructions & Data
 E.g., Instruction Prefetching
 Alpha 21064 fetches 2 blocks on a miss
 Extra block placed in “stream buffer”
 On miss check stream buffer
 Works with data blocks too:
Dr. Amr Talaat
 Jouppi [1990] 1 data stream buffer got 25% misses from 4K
B cache; 4 streams got 43%
 Palacharla & Kessler [1994] for scientific programs for 8 str
eams got 50% to 70% of misses from 2 64KB, 4-way set as
sociative caches
 Prefetching relies on having extra memory bandwidth tha
t can be used without penalty
28
Reducing Misses by Software Prefetching Dat
a
 Build a Special prefetching instructions that cannot cause fa
ults; a form of speculative execution
 Data Prefetch
 Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
 Issuing Prefetch Instructions takes time
 Is cost of prefetch issues < savings in reduced misses?
 Higher superscalar reduces difficulty of issue bandwidth
Dr. Amr Talaat
 Relies on having extra memory bandwidth that can be used
without penalty
29
Reducing Misses by Compiler Optimizations
 McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
 How?
 For instructions:
look at conflicts (using tools they developed)
Reorder procedures in memory so as to reduce co
nflict misses
 For Data
Dr. Amr Talaat
Merging Arrays: improve spatial locality by single a
rray of compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to acce
ss data in order stored in memory
Loop Fusion: Combine 2 independent loops that h
ave same looping and some variables overlap
30
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Dr. Amr Talaat
 Reducing conflicts between val & key;
improve spatial locality
31
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Dr. Amr Talaat
 Sequential accesses instead of striding through memory every
100 words; improved spatial locality
32
Loop Fusion Example
Dr. Amr Talaat
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
 2 misses per access to a & c vs. one miss per access; impro
ve spatial locality
33
Reduce the miss penalty,




Read priority over write on miss
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Dr. Amr Talaat
Read Priority over Write on Miss
CPU
in out
Dr. Amr Talaat
write
buffer
DRAM
(or lower mem)
Write Buffer
Read Priority over Write on Miss
 Write-through w/ write buffers => RAW conflicts with main me
mory reads on cache misses
 If simply wait for write buffer to empty, might increase read miss p
enalty (old MIPS 1000 by 50% )
 Check write buffer contents before read; if no conflicts, let the me
mory access continue
 Write-back want buffer to hold displaced blocks
 Read miss replacing dirty block
Dr. Amr Talaat
 Normal: Write dirty block to memory, and then do the read
 Instead copy the dirty block to a write buffer, then do the read, an
d then do the write
 CPU stall less since restarts as soon as do read
Early Restart and Critical Word First
 Don’t wait for full block to be loaded before restarting CPU
 Early restart—As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
 Critical Word First—Request the missed word first from memory a
nd send it to the CPU as soon as it arrives; let the CPU continue e
xecution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first
Dr. Amr Talaat
 Generally useful only in large blocks,
 Spatial locality a problem; tend to want next sequential word, s
o not clear if benefit by early restart
block
Non-blocking Caches to reduce stalls on misse
s
 Non-blocking cache or lockup-free cache allow data cache to c
ontinue to supply cache hits during a miss
 requires F/E bits on registers or out-of-order execution
 requires multi-bank memories
 “hit under miss” reduces the effective miss penalty by working
during miss vs. ignoring CPU requests
 “hit under multiple miss” or “miss under miss” may further lowe
r the effective miss penalty by overlapping multiple misses
Dr. Amr Talaat
 Significantly increases the complexity of the cache controller as th
ere can be multiple outstanding memory accesses
 Requires muliple memory banks (otherwise cannot support)
 Penium Pro allows 4 outstanding memory misses
Add a second-level cache
 L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
 Definitions:
Dr. Amr Talaat
 Local miss rate— misses in this cache divided by the total number of m
emory accesses to this cache (Miss rateL2)
 Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
 Global Miss Rate is what matters
Dr. Amr Talaat
miss penalty
miss rate
Cache Optimization Summary
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
MR
+
+
+
+
+
+
+
MP HT
–
–
+
+
+
+
Complexity
0
1
2
2
2
3
0
1
2
3
2