Memory technology

Download Report

Transcript Memory technology

Ch 7 Memory Hierarchy
° Most of the slides are from Prof. Dave Patterson of
University of California at Berkeley
° Part of the materials is from Sun Microsystems
° Part of the material is from AF
° "Copyright 1997 UCB." Permission is granted to
alter and distribute this material provided that the
following credit line is included: "Adapted from
(complete bibliographic citation). Copyright 1997
UCB.
EE30332 Ch7 DP .1
Ch 7 Memory Hierarchy
Capacity
Logic:2x in 3 years
Speed (latency)
2x in 3 years
DRAM:
4x in 3 years
2x in 10 years
Disk:
4x in 3 years
2x in 10 years
Year
1980
1983
1986
1989
1992
1995
EE30332 Ch7 DP .2
DRAM
Size
1000:1! 64 Kb 2:1!
256 Kb
1 Mb
4 Mb
16 Mb
64 Mb
Cycle Time
250 ns
220 ns
190 ns
165 ns
145 ns
120 ns
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc
60%/yr.
“Moore’s Law”
(2X/1.5yr)
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
9%/yr.
(2X/10 yrs)
Performance
1000
CPU
100
10
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1
Time
EE30332 Ch7 DP .3
Impact on Performance
° Suppose a processor executes at
Inst Miss
(0.5)
16%
• Clock Rate = 200 MHz (5 ns per cycle)
Ideal C PI
(1.1)
35%
• CPI = 1.1
• 50% arith/logic, 30% ld/st, 20% control
° Suppose that 10% of memory
operations get 50 cycle
miss penalty
DataMiss
(1.6)
49%
° CPI = ideal CPI + average stalls per instruction
= 1.1(cyc) +( 0.30 (datamops/ins)
x 0.10 (miss/datamop) x 50 (cycle/miss) )
= 1.1 cycle + 1.5 cycle
= 2. 6
° 58 % of the time the processor
is stalled waiting for memory!
° a 1% instruction miss rate would add
an additional 0.5 cycles to the CPI!
EE30332 Ch7 DP .4
The Goal: illusion of large, fast, cheap memory
° Fact: Large memories are slow, fast memories are
small
° How do we create a memory that is large, cheap and
fast (most of the time)?
• Hierarchy
• Parallelism
EE30332 Ch7 DP .5
An Expanded View of the Memory System
Processor
Control
Memory
Memory
Speed: Fastest
Size: Smallest
Cost:
EE30332 Ch7 DP .6
Highest
Memory
Memory
Datapath
Memory
Slowest
Biggest
Lowest
Why hierarchy works
° The Principle of Locality:
• Program access a relatively small portion of the
address space at any instant of time.
Probability
of reference
0
EE30332 Ch7 DP .7
Address Space
2^n - 1
Memory Hierarchy: How Does it Work?
° Temporal Locality (Locality in Time):
=> Keep most recently accessed data items closer to the
processor
° Spatial Locality (Locality in Space):
=> Move blocks consists of contiguous words to the
upper levels
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
EE30332 Ch7 DP .8
Blk Y
Memory Hierarchy: Terminology
° Hit: data appears in some block in the upper level
(example: Block X)
• Hit Rate: the fraction of memory access found in the
upper level
• Hit Time: RAM access time + time to determine hit/miss
° Miss: data needs to be retrieved from lower storage
• Miss Rate = 1 - (Hit Rate)
• Miss Penalty: Time to replace a block in the upper level
+ Time to deliver the block the processor
° Hit Time << Miss Penalty
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
EE30332 Ch7 DP .9
Blk Y
How is the hierarchy managed?
° Registers <-> Memory (cache is faster copy of mem)
• by compiler (programmer?)
° cache <-> memory
• by the hardware
° memory <-> disks
• by the hardware and operating system (virtual
memory)
• by the programmer (files)
Processor sees Registers, cache and memory but not
disk, when executing instruction
Disks are handled as I/O, thru Page Fault to/from
memory
EE30332 Ch7 DP .10
Memory Hierarchy Technology
° Random Access:
• “Random” is good: access time is the same for
all locations
• DRAM: Dynamic Random Access Memory
-
High density, low power, cheap, slow
Dynamic: need to be “refreshed” regularly
• SRAM: Static Random Access Memory
-
Low density, high power, expensive, fast
Static: content will last “forever”(until lose power)
° “Non-so-random” Access Technology:
• Access time varies from location to location and
from time to time
• Examples: Disk, CDROM
° Sequential Access Technology: access time linear in location
(e.g.,Tape)
• The Main Memory: DRAMs + Caches: SRAMs
EE30332 Ch7 DP .11
Main Memory Background
° Performance of Main Memory:
• Latency: Cache Miss Penalty
-
Access Time: time between request and word arrives
Cycle Time: time between requests
• Bandwidth: I/O & Large Block Miss Penalty (L2)
° Main Memory is DRAM: Dynamic Random Access Memory
(except supercomputers)
• Dynamic since needs to be refreshed periodically (8 ms)
• Addresses divided into 2 halves (Memory as a 2D matrix):
-
RAS or Row Access Strobe
CAS or Column Access Strobe
° Cache uses SRAM: Static Random Access Memory
• No refresh (6 transistors/bit vs. 1 transistor /bit)
• Address not divided
° Size: DRAM/SRAM 4-8
° Cost/Cycle time: SRAM/DRAM 8-16
EE30332 Ch7 DP .12
Random Access Memory (RAM) Technology
° Why do computer designers need to know about RAM
technology?
• Processor performance is usually limited by memory
bandwidth
• As IC densities increase, lots of memory will fit on
processor chip
-
Tailor on-chip memory to specific needs
- Instruction cache
- Data cache
- Write buffer
° What makes RAM different from a bunch of flip-flops?
• Density: RAM is much more denser
EE30332 Ch7 DP .13
Classical DRAM Organization (square)
bit (data) lines
r
o
w
d
e
c
o
d
e
r
row
address
Each intersection represents
a 1-T DRAM Cell
RAM Cell
Array
word (row) select
Column Selector &
I/O Circuits
data
Column
Address
° Row and Column Address
together:
• Select 1 bit a time
EE30332 Ch7 DP .14
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving:
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory
Bank 0
Access Pattern with 4-way Interleaving:
CPU
Memory
Bank 1
Access Bank 0
Memory
Bank 2
Memory
Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
EE30332 Ch7 DP .15
Main Memory Performance
° Timing model
• 1 to send address,
• 6 access time, 1 to send data
• Cache Block is 4 words
° Simple M.P.
= 4 x (1+6+1) = 32
° Wide M.P.
=1+6+1
=8
° Interleaved M.P. = 1 + 6 + 4x1 = 11
EE30332 Ch7 DP .16
Independent Memory Banks
° How many banks?
number banks •number clocks to access word in bank
• For sequential accesses, otherwise will return to
original bank before it has next word ready
° Increasing DRAM => fewer chips => harder to have
banks
• Growth bits/chip DRAM : 50%-60%/yr
• Nathan Myrvold M/S: mature software growth
(33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr)
EE30332 Ch7 DP .17
Fewer DRAMs/System over Time
Minimum PC Memory Size
(from Pete
MacWilliams,
Intel)
DRAM Generation
‘86
‘89
‘92
‘96
‘99
‘02
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
8
Memory per
4 MB 32
DRAM growth
16
4
8 MB
@ 60% / year
8
2
16 MB
32 MB
Memory per
64 MB System growth
128 MB @ 25%-30% / year
256 MB
EE30332 Ch7 DP .18
4
1
8
2
4
1
8
2
Today’s Situation: DRAM
° Commodity, second source industry
 high volume, low profit, conservative
• Little organization innovation (vs. processors)
in 20 years: page mode, EDO, Synch DRAM
° DRAM industry at a crossroads:
• Fewer DRAMs per computer over time
-
Growth bits/chip DRAM : 50%-60%/yr
Nathan Myrvold M/S: mature software growth
(33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr)
° DRAM is often chosen as the first major product for a new
semiconductor technology, because the cell are simple
and arrangement is regular, and it is large volume and
earlier return of investment
EE30332 Ch7 DP .19
Example: 1 KB Direct Mapped Cache with 32 B Blocks
° For a 2 ** N byte cache:
• The uppermost (32 - N) bits are always the Cache Tag
• The lowest M bits are the Byte Select (Block Size = 2 **
M)
31
9
Cache Tag
Example: 0x50
4
0
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Cache Tag
Cache Data
Byte 31
0x50
Byte 63
: :
Valid Bit
Byte 1
Byte 0
0
Byte 33 Byte 32 1
2
3
:
:
Byte 1023
EE30332 Ch7 DP .20
:
:
Byte 992 31
Block Size Tradeoff
° In general, larger block size take advantage of spatial
locality BUT:
• Larger block size means larger miss penalty
-
Takes longer time to fill up the block
• If block size is too big relative to cache size, miss rate will go
up
-
Too few cache blocks
° In general, Average Access Time:
• = Hit Time x (1 - Miss Rate) + Miss Average
Penalty x Miss Rate
Miss
Rate Exploits Spatial Locality
Miss
Penalty
Fewer blocks:
compromises
temporal locality
Block Size
EE30332 Ch7 DP .21
Block Size
Access
Time
Increased Miss Penalty
& Miss Rate
Block Size
Extreme Example: single big line
Valid Bit
Cache Tag
Cache Data
Byte 3
° Cache Size = 4 bytes
Byte 2 Byte 1 Byte 0 0
Block Size = 4 bytes
• Only ONE entry in the cache
° If an item is accessed, likely that it will be accessed again soon
• But it is unlikely that it will be accessed again
immediately!!!
• The next access will likely to be a miss again
-
Continually loading data into the cache but
discard (force out) them before they are used again
-
Worst nightmare of a cache designer: Ping Pong Effect
° Conflict Misses are misses caused by:
• Different memory locations mapped to the
same cache index
EE30332 Ch7 DP .22
-
Solution 1: make the cache size bigger
-
Solution 2: Multiple entries for the same Cache Index
Another Extreme Example: Fully Associative
° Fully Associative Cache
• Forget about the Cache Index
• Compare the Cache Tags of all entries in parallel
° By definition: there is no Conflict Miss for a fully
associative cache
31
4
Cache Tag (27 bits long)
0
Byte Select
Ex: 0x01
Valid Bit Cache Data
X
Byte 31
X
Byte 63
: :
Cache Tag
Byte 1
Byte 33 Byte 32
X
X
X
EE30332 Ch7 DP .23
:
:
Byte 0
:
A Two-way Set Associative Cache
° N-way set associative: N entries for each Cache Index
• N direct mapped caches operates in parallel
° Example: Two-way set associative cache
• Cache Index selects a “set” from the cache
• The two tags in the set are compared in parallel
• Data is selected basedCache
on the
Indextag result
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
OR
Hit
EE30332 Ch7 DP .24
Cache Block
Cache Tag
Valid
:
:
Compare
Disadvantage of Set Associative Cache
° N-way Set Associative Cache versus Direct Mapped
Cache:
• N comparators vs. 1
• Extra MUX delay for the data
• Data comes AFTER Hit/Miss decision and set selection
° In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
• Possible to assume a hit and continue. Recover later if
miss.
Cache Index
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
OR
EE30332 Ch7 DP .25
Hit
Cache Block
Cache Tag
Valid
:
:
Compare
A Summary on Sources of Cache Misses
° Compulsory (cold start or process migration, first
reference): first access to a block
• “Cold” fact of life: not a whole lot you can do
about it
• Note: If you are going to run “billions” of
instruction, Compulsory Misses are insignificant
° Conflict (collision):
• Multiple memory locations mapped
to the same cache location
• Solution 1: increase cache size
• Solution 2: increase associativity
° Capacity:
• Cache cannot contain all blocks access by the
program
• Solution: increase cache size
° Invalidation: other process (e.g., I/O) updates
memory
EE30332
Ch7 DP .26
Source of Cache Misses Quiz
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size:
Small, Medium, Big?
Compulsory Miss:
Conflict Miss
Capacity Miss
Invalidation Miss
Choices: Zero, Low, Medium, High, Same
EE30332 Ch7 DP .27
Impact on Cycle Time
PC
Cache Hit Time:
directly tied to clock rate
increases with cache size
increases with associativity
I -Cache
miss
IR
IRex
A
B
invalid
IRm
Average Memory Access time =
Hit Time + Miss Rate x Miss Penalty
R
D Cache
IRwb
Time = IC x CT x (ideal CPI + memory stalls)
T
Miss
Example: direct map allows miss signal after data
EE30332 Ch7 DP .29
Improving Cache Performance: 3 general options
1. Reduce the miss rate,
°
Larger cache, higher associativity
2. Reduce the miss penalty,
faster memory system
3. Reduce the time to hit in the cache.
faster parts for cache
EE30332 Ch7 DP .30
Basic Cache Types (for write)
° Write through—The information is written to both
the block in the cache and to the block in the lowerlevel memory.
° Write back—The information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced.
• is block clean or dirty?
° Pros and Cons of each?
• WT: read misses cannot result in writes
• WB: no writes of repeated writes
° WT always combined with write buffers so that
don’t wait for lower level memory
EE30332 Ch7 DP .31
Write Buffer for Write Through
Processor
Cache
DRAM
Write Buffer
° A Write Buffer is needed between the Cache and Memory
• Processor: writes data into the cache and the write buffer
• Memory controller: write contents of the buffer to memory
° Write buffer is can be a FIFO or a small associative cache:
• Typical number of entries: 4
• Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycle
° Memory system designer’s nightmare:
• Store frequency (w.r.t. time) -> 1 / DRAM write cycle
• Write buffer saturation
EE30332 Ch7 DP .32
Write Buffer Saturation
Cache
Processor
DRAM
Write Buffer
° Solution for write buffer saturation:
• Use a write back cache
• Install a second level (L2) cache:
Processor
Cache
Write Buffer
EE30332 Ch7 DP .33
L2
Cache
DRAM
Write-miss Policy: Write Allocate versus Not Allocate
° Assume: a 16-bit write to memory location 0x0 and
causes a miss
• Do we read in the block?
31
Yes: Write Allocate
No: Write Not Allocate
Cache Tag
Valid Bit
Example: 0x00
Cache Tag
9
4
0
Cache Index
Byte Select
Ex: 0x00
Ex: 0x00
Cache Data
0x00
Byte 31
Byte 63
: :
-
Byte 1
Byte 0
0
Byte 33 Byte 32 1
2
3
:
:
Byte 1023
EE30332 Ch7 DP .34
:
:
Byte 992 31
Recall: Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost
Staging
Xfer Unit
CPU Registers
100s Bytes
<10s ns
Registers
Cache
K Bytes
10-100 ns
$.01-.001/bit
Cache
Instr. Operands
Blocks
Main Memory
M Bytes
100ns-1us
$.01-.001
Disk
G Bytes
ms
-4
-3
10 - 10 cents
Tape
infinite
sec-min
10 -6
EE30332 Ch7 DP .35
faster
prog./compiler
1-8 bytes
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
Larger
Lower Level
Basic Issues in Virtual Memory System Design
size of information blocks that are transferred from
secondary to main storage (M)
block of information brought into M, and M is full, then some region
of M must be released to make room for the new block -->
replacement policy
which region of M is to hold the new block --> placement policy
missing item fetched from secondary memory only on the occurrence
of a fault --> demand load policy
disk
mem
cache
reg
pages
frame
Paging Organization
virtual and physical address space partitioned into blocks of equal size
page frames
pages
EE30332 Ch7 DP .36