CPE 232. Computer Organization

Download Report

Transcript CPE 232. Computer Organization

CPE 232
Computer Organization
Spring 2006
Memory Hierarchy
Dr. Gheith Abandah
[Adapted from the slides of Professor Mary Irwin (www.cse.psu.edu/~mji)
which in turn Adapted from Computer Organization and Design,
Patterson & Hennessy, © 2005, UCB]
CPE232 Memory Hierarchy
1
Review: Major Components of a Computer
Processor
Control
Datapath
CPE232 Memory Hierarchy
Devices
Memory
Input
Output
2
Processor-Memory Performance Gap
µProc
55%/year
(2X/1.5yr)
Performance
10000
“Moore’s Law”
1000
Processor-Memory
Performance Gap
(grows 50%/year)
100
10
DRAM
7%/year
(2X/10yrs)
1
1980 1984 1988 1992 1996 2000 2004
Year
CPE232 Memory Hierarchy
3
The “Memory Wall”

Logic vs DRAM speed gap continues to grow
Clocks per DRAM access
Clocks per instruction
1000
100
10
Core
Memory
1
0.1
0.01
VAX/1980
CPE232 Memory Hierarchy
PPro/1996
2010+
4
Memory Performance Impact on Performance

Suppose a processor executes at

ideal CPI = 1.1

50% arith/logic, 30% ld/st, 20% control
InstrMiss,
0.5
Ideal CPI,
1.1
DataMiss,
1.5
and that 10% of data
memory operations miss with a 50 cycle miss penalty

CPI = ideal CPI + average stalls per instruction
= 1.1(cycle) + ( 0.30 (datamemops/instr)
x 0.10 (miss/datamemop) x 50 (cycle/miss) )
= 1.1 cycle + 1.5 cycle = 2.6
so 58% of the time the processor is stalled waiting for
memory!

A 1% instruction miss rate would add an additional 0.5 to
the CPI!
CPE232 Memory Hierarchy
5
The Memory Hierarchy Goal

Fact: Large memories are slow and fast memories are
small

How do we create a memory that gives the illusion of
being large, cheap and fast (most of the time)?

With hierarchy

With parallelism
CPE232 Memory Hierarchy
6
A Typical Memory Hierarchy

By taking advantage of the principle of locality

Can present the user with as much memory as is available in the
cheapest technology

at the speed offered by the fastest technology
On-Chip Components
Control
eDRAM
Instr Data
Cache Cache
1’s
10’s
100’s
Size (bytes):
K’s
10K’s
M’s
Cost:
ITLB DTLB
Speed (%cycles): ½’s
Datapath
RegFile
Second
Level
Cache
(SRAM)
100’s
highest
CPE232 Memory Hierarchy
Main
Memory
(DRAM)
Secondary
Memory
(Disk)
1,000’s
G’s to T’s
lowest
7
Characteristics of the Memory Hierarchy
Processor
4-8 bytes (word)
Increasing
distance
from the
processor in
access time
L1$
8-32 bytes (block)
L2$
1 to 4 blocks
Inclusive– what
is in L1$ is a
subset of what
is in L2$ is a
subset of what
is in MM that is
a subset of is
in SM
Main Memory
1,024+ bytes (disk sector = page)
Secondary Memory
(Relative) size of the memory at each level
CPE232 Memory Hierarchy
8
Memory Hierarchy Technologies

Caches use SRAM for speed
and technology compatibility



21
Address
Low density (6 transistor cells), Chip select
Output enable
high power, expensive, fast
Static: content will last
“forever” (until power
turned off)
SRAM
2M x 16
Write enable
16
Dout[15-0]
Din[15-0]
16
Main Memory uses DRAM for size (density)

High density (1 transistor cells), low power, cheap, slow

Dynamic: needs to be “refreshed” regularly (~ every 8 ms)
- 1% to 2% of the active cycles of the DRAM

Addresses divided into 2 halves (row and column)
- RAS or Row Access Strobe triggering row decoder
- CAS or Column Access Strobe triggering column selector
CPE232 Memory Hierarchy
9
Memory Performance Metrics


Latency: Time to access one word

Access time: time between the request and when the data is
available (or written)

Cycle time: time between requests

Usually cycle time > access time

Typical read access times for SRAMs in 2004 are 2 to 4 ns for
the fastest parts to 8 to 20ns for the typical largest parts
Bandwidth: How much data from the memory can be
supplied to the processor per unit time

width of the data channel * the rate at which it can be used

Size: DRAM to SRAM 4 to 8

Cost/Cycle time: SRAM to DRAM 8 to 16
CPE232 Memory Hierarchy
10
Classical RAM Organization (~Square)
bit (data) lines
R
o
w
D
e
c
o
d
e
r
row
address
RAM Cell
Array
word (row) line
Column Selector &
I/O Circuits
data bit or word
CPE232 Memory Hierarchy
Each intersection
represents a
6-T SRAM cell or
a 1-T DRAM cell
column
address
One memory row holds a block
of data, so the column address
selects the requested bit or word
from that block
11
Classical DRAM Organization (~Square Planes)
bit (data) lines
R
o
w
D
e
c
o
d
e
r
Each intersection
represents a
1-T DRAM cell
RAM Cell
Array
word (row) line
column
address
row
address
Column Selector &
I/O Circuits
data bit
data bit
data bit
CPE232 Memory Hierarchy
The column address
selects the requested
bit from the row in each
plane
12
Classical DRAM Operation
DRAM Organization:

N rows x N column x M-bit

Read or Write M-bit at a time

Each M-bit access requires
a RAS / CAS cycle
N cols
DRAM
N rows

Column
Address
Row
Address
M bit planes
M-bit Output
Cycle Time
1st M-bit Access
2nd M-bit Access
RAS
CAS
Row Address
CPE232 Memory Hierarchy
Col Address
Row Address
Col Address
13
Page Mode DRAM Operation
Page Mode DRAM


N cols
N x M SRAM to save a row
DRAM
After a row is read into the
DRAM “register”

Only CAS is needed to access
other M-bit words on that row

RAS remains asserted while CAS
is toggled
Row
Address
N rows

Column
Address
N x M SRAM
M bit planes
M-bit Output
Cycle Time
1st M-bit Access
2nd M-bit
3rd M-bit
4th M-bit
RAS
CAS
Row Address
CPE232 Memory Hierarchy
Col Address
Col Address
Col Address
Col Address
14
Synchronous DRAM (SDRAM) Operation
+1
After a row is
read into the SRAM register


Inputs CAS as the starting “burst”
address along with a burst length
Transfers a burst of data from a
series of sequential addresses
within that row
- A clock controls transfer of
successive words in the burst –
300MHz in 2004
Cycle Time
N cols
DRAM
Row
Address
N rows

Column
Address
N x M SRAM
M bit planes
M-bit Output
1st M-bit Access 2nd M-bit 3rd M-bit
4th M-bit
RAS
CAS
Row Address
CPE232 Memory Hierarchy
Col Address
Row Add
15
Other DRAM Architectures


Double Data Rate SDRAMs – DDR-SDRAMs (and DDRSRAMs)

Double data rate because they transfer data on both the rising
and falling edge of the clock

Are the most widely used form of SDRAMs
DDR2-SDRAMs
http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swf
CPE232 Memory Hierarchy
16
DRAM Memory Latency & Bandwidth Milestones
DRAM
Page
DRAM
FastPage FastPage
DRAM
DRAM
Synch
DRAM
DDR
SDRAM
Module Width
16b
16b
32b
64b
64b
64b
Year
1980
1983
1986
1993
1997
2000
Mb/chip
0.06
0.25
1
16
64
256
Die size (mm2)
35
45
70
130
170
204
Pins/chip
16
16
18
20
54
66
BWidth (MB/s)
13
40
160
267
640
1600
Latency (nsec)
225
170
125
75
62
52
Patterson, CACM Vol 47, #10, 2004

In the time that the memory to processor bandwidth
doubles the memory latency improves by a factor of only
1.2 to 1.4 every year.

To deliver such high bandwidth, the internal DRAM has to
be organized as interleaved memory banks
CPE232 Memory Hierarchy
17
Memory Systems that Support Caches

The off-chip interconnect and memory architecture can
affect overall system performance in dramatic ways
on-chip
CPU
One word wide organization
(one word wide bus and
one word wide memory)

Assume
Cache
bus
32-bit data
&
32-bit addr
per cycle Memory

1.
1 clock cycle to send the address
2.
25 clock cycles for DRAM cycle time, 8
clock cycles access time
3.
1 clock cycle to return a word of data
Memory-Bus to Cache bandwidth

CPE232 Memory Hierarchy
number of bytes accessed from memory
and transferred to cache/CPU per clock
cycle
18
One Word Wide Memory Organization

on-chip
CPU
Cache
If the block size is one word, then for a
memory access due to a cache miss,
the pipeline will have to stall the
number of cycles required to return one
data word from memory
1
25
1
27
bus
Memory

cycle to send address
cycles to read DRAM
cycle to return data
total clock cycles miss penalty
Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
4/27 = 0.148 bytes per clock
CPE232 Memory Hierarchy
19
One Word Wide Memory Organization, con’t

on-chip
What if the block size is four words?
1
4 x 25 = 100
1
102
CPU
Cache
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
25 cycles
bus
25 cycles
25 cycles
Memory
25 cycles

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/102 = 0.157 bytes per clock
CPE232 Memory Hierarchy
20
One Word Wide Memory Organization, con’t

on-chip
CPU
What if the block size is four words and if a
fast page mode DRAM is used?
1
25 + 3*8 = 49
1
51
Cache
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
bus
25 cycles
8 cycles
8 cycles
Memory
8 cycles

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/51 = 0.314 bytes per clock
CPE232 Memory Hierarchy
21
Wide Memory Organization

What if the memory is two words wide?
on-chip
CPU
1
25 + 25 = 50
1
52
Cache
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
bus
Memory

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/52 = 0.308 bytes per clock
CPE232 Memory Hierarchy
22
Interleaved Memory Organization

For a block size of four words
on-chip
CPU
1 cycle to send 1st address
25 + 3 = 28 cycles to read DRAM
1 cycles to return last data word
Cache
30 total clock cycles miss penalty
25 cycles
bus
25 cycles
25 cycles
Memory Memory Memory Memory
bank 0 bank 1 bank 2 bank 3
25 cycles
Number of bytes transferred
per clock cycle (bandwidth) for a
single miss is

(4 x 4)/30 = 0.533 bytes per clock
CPE232 Memory Hierarchy
23
DRAM Memory System Summary

Its important to match the cache characteristics


with the DRAM characteristics


caches access one block at a time (usually more than one
word)
use DRAMs that support fast multiple word accesses,
preferably ones that match the block size of the cache
with the memory-bus characteristics

make sure the memory-bus can support the DRAM access
rates and patterns

with the goal of increasing the Memory-Bus to Cache
bandwidth
CPE232 Memory Hierarchy
24