Internal Memory

Download Report

Transcript Internal Memory

William Stallings
Computer Organization
and Architecture
Chapter 4
Internal Memory
The four-level memory hierarchy
♦Computer memory is
organized into a hierarchy.
♦Decreasing cost/bit,
increasing capacity, slower
access time, and decreasing
frequency of access of the
memory by the processor
♦The cache automatically
retains a copy of some of the
recently used words from the
DRAM.
Memory Hierarchy
Registers
In CPU
Internal or Main memory
May include one or more levels of cache
“RAM”
External memory
Backing store
4.1 COMPUTER MEMORY SYSTEM
OVERVIEW
Characteristics of Memory Systems
Location
Capacity
Unit of transfer
Access method
Performance
Physical type
Physical characteristics
Organisation
Location
The term location refers to whether memory is
internal or external to the computer.
CPU
The processor requires its own local memory , in the
form of registers.
Internal
Main memory, cache
External
Peripheral storage devices, such as disk and tape
Capacity
Internal memory capacity typically expressed in terms
of bytes(1byte=8bits)or words.
External memory capacity expressed in bytes.
Word
The natural unit of organisation
Word length usually 8, 16 and 32 bits
The size of the word is typically equal to the number
of bits used to represent a number and to the instruction
length. Unfortunately, there are many exceptions.
Unit of Transfer
Internal
Usually governed by data bus width
External
Usually a block which is much larger than a word
Addressable unit
Smallest location which can be uniquely addressed
At the word level or byte level
In any case,
2A=N, A is the length in bits of an address
N is the number of addressable units
Access Methods (1)
Sequential access
Start at the beginning and read through in order
Access time depends on location of data and previous
location
variable
e.g. tape
Direct access
Individual blocks have unique address
Access is by jumping to vicinity plus sequential search
Access time depends on location and previous location
variable
e.g. disk
Access Methods (2)
Random
Individual addresses identify locations exactly
Access time is independent of location or previous
access and is constant
e.g. RAM
Associative
Data is located by a comparison with contents of a
portion of the store
Access time is independent of location or previous
access and is constant
e.g. cache
Performance Parameters
 Access time
For random-access memory
 the time it takes to perform a read or write operation.
Time between presenting the address to the memory and getting the
valid data
For non-random-access memory
The time it takes to position the read-write mechanism at the desired
location.
 Memory Cycle time
Cycle time is access time plus additional time
Time may be required for the memory to “recover” before next
access
 Transfer Rate
Rate at which data can be moved
Physical Types
Semiconductor
RAM
Magnetic
Disk & Tape
Optical
CD (Compact Disk) & DVD (Digital Video Disk)
Others
Bubble
Hologram
Physical Characteristics
Decay
Volatility
In a volatile memory, information decays naturally or is
lost when electrical power is switched off.
In a nonvolatile memory, no electrical power is needed
to retain information, e.g. magnetic-surface memory.
Erasable
Power consumption
Organisation
Organisation means physical arrangement of
bits into words
Obvious arrangement not always used
Memory Hierarchy
Registers
In CPU
Internal or Main memory
May include one or more levels of
cache
“RAM”
External memory
Backing store
The Bottom Line
The design constraints on a computer’s memory:
How much?
Capacity
How fast?
Time is money
How expensive?
A trade-ff among the three key characteristics of
memory: cost, capacity, and access time.
Hierarchy List
Registers
L1 Cache
L2 Cache
Main memory
Disk cache
Disk
Optical
Tape
Hierarchy List
Across this spectrum of technologies:
 Faster access time, greater cost per bit
 Greater capacity, smaller cost per bit
 Greater capacity, slower access time
From top to down:
 Decreasing cost per bit
 Increasing capacity
 Increasing access time
 Decreasing frequency of access of the
memory by the processor
So you want fast?
It is possible to build a computer which uses
only static RAM (see later)
This would be very fast
This would need no cache
How can you cache cache?
This would cost a very large amount
Locality of Reference
 During the course of the execution of a program, memory
references tend to cluster
e.g. loops and subroutines
 Main memory is usually extended with a higher-speed,
smaller cache. It is a device for staging the movement of
data between main memory and processor registers to
improve performance.
 External memory, called Secondary or auxiliary
memory are used to store program and data files and
visible to the programmer only in terms of files and records.
4.2 Semiconductor Main Memory
Table 4.2 Semiconductor Memory Types
Types of Random-Access
Semiconductor Memory
RAM
Misnamed as all semiconductor memory is random
access, because all of the types listed in the table are
random access.
Read/Write
Volatile
 A RAM must be provided with a constant power supply.
Temporary storage
Static or dynamic
Dynamic RAM (DRAM)
Bits stored as charge in capacitors
Charges leak
Need refreshing even when powered
Simpler construction
Smaller per bit
Less expensive
Need refresh circuits
Slower
Main memory
Static RAM (SRAM)
Bits stored as on/off switches
No charges to leak
No refreshing needed when powered
More complex construction
Larger per bit
More expensive
Does not need refresh circuits
Faster
Cache
Read Only Memory (ROM)
Permanent storage
Applications
Microprogramming (see later)
Library subroutines
Systems programs (BIOS)
Function tables
Types of ROM
Written during manufacture
Very expensive for small runs
Programmable (once)
PROM
Needs special equipment to program
Read “mostly”
Erasable Programmable (EPROM)
Erased by UV
Electrically Erasable (EEPROM)
Takes much longer to write than read
Flash memory
It is intermediate between EPROM and EEPROM in both cost and
functionality.
Erase whole memory electrically or erase blocks of memory
Organisation in detail
 Memory cell
The basic element of a semiconductor memory
Two stable states
being written into to set the state, or being read to sense the state
Chip Logic
 One extreme organization : the physical arrangement of cells
in the array is the same as the logical arrangement.
The array is organized into W words of B bits each.
e.g. A 16Mbit chip can be organised as 1M 16-bit words
 One-bit-per-chip in which data is read/written one bit at a
time
A bit per chip system has 16 lots of 1Mbit chip with bit 1 of each word
in chip 1 and so on
Chip Logic
Typical organization of a 16-Mbit DRAM
A 16Mbit chip can be organised as a 2048 x 2048 x 4bit
array
Reduces number of address pins
Multiplex row address and column address
11 pins to address (211=2048)
An additional 11 address lines select one of 2048 columns of
4bits per column. Four data lines are for the input and output
of 4 bits to and from a data buffer. On write, the bit driver of
each bit line is activated for a 1 or 0 according to the value of
the corresponding data line. On read, the value of each bit line
selects which row of cells is used for reading or writing.
Adding one more pin devoted to addressing doubles the
number of rows and columns, and so the size of the chip
memory grows by a factor 4.
Typical 16 Mb DRAM (4M x 4)
Refreshing
Refresh circuit included on chip
Disable chip
Count through rows
Read & Write back
Takes time
Slows down apparent performance
Chip Packaging
EPROM package , which is a one-word-per-chip,
8-Mbit chip organized as 1M×8
•The address of the word being accessed . For 1M
words, a total of 20 pins (220=1M) are needed.
•D0~D7
•The power supply to the chip (VCC)
•A ground pin (Vss)
•A chip enable (CE) pin: the CE pin is used to
indicate whether or not the address is valid for this
chip.
•A program voltage (Vpp)
DRAM package, 16-Mbit chip
organized as 4M×4
RAM chip can be updated, the data
pins are input/output different from
ROM chip
•Write Enable pin (WE)
•Output Enable pin (OE)
•Row Address Select (RAS)
•Column Address Select (CAS)
Module
Organisation
·If a RAM chip contain only
1bit per word, clearly a
number of chips equal to the
number of bits per words are
needed.
e.g. How a memory module
consisting of 256K 8-bit
words could be organized?
256K=218, an 18-bit address
needed;
The address is presented to
8 256K×1-bit chips, each of
which provides the
input/output of 1 bit.
Figure 4.6 256kbyte memory Organization
Module Organisation (2)
Figure 4.7 1-Mbyte Memory Organization
 (1M×8bit/256K×8bit)=4=22
As show in figure 4.7, 1M word by 8bits per word is
organized as four columns of chips, each column
containing 256K words arranged as in Figure 4.6.
1M=220
For 1M word, 20 address lines are needed.
The 18 least significant bits are routed to all 32
modules.
The high-order 2 bits are input to a group select logic
module that sends a chip enable signal to one of
the four columns of modules.
Error Correction
Hard Failure
Permanent defect
Soft Error
Random, non-destructive
No permanent damage to memory
Detected using Hamming error correcting code
Error Correcting Code Function
•A function f, is
performed on the data to
produce a code.
•When the previously
stored word is read out,
the code is used to
detect and possible
correct errors.
•A new set of K code
bits is generated from
the M data bits and
compared with the
fetched code bits.
Even Parity bits
Figure 4.9 Hamming Error-Correcting Code
Figure 4.9 uses Venn diagrams to illustrate the use of Hamming code on 4-bit words
(M=4). With three intersection circles, there are seven compartments. We assign the 4
data bits to the inner compartments. The remaining compartments are filled with parity
bits. Each parity bit is chosen so that the total number of 1s in its circle is even.
Figure 4.8 ErrorCorrecting Code
 The comparison logic receives as input two k-bit values. A bit-by-bit
comparison is done by taking the exclusive-or of the two inputs. The
results is called the syndrome word.
The syndrome word is therefore K bits wide and has a range
between 0 and 2K-1. The value 0 indicates that no error was
detected. Leaving 2K-1 values to indicate, if there is an error,
which bit was in error (the numerical value of the syndrome
indicates the position of the data bit in error).
An error could occur on any of the M data bits or K check bits
so,
2K-1≥M+K
(This equation gives the number of bits needed to correct a single bit error
in a word containing M data bits.)
Those bit positions whose
position number are powers of 2
are designated as check bits.
Each check bit operates on every
data bit position whose position
number contains a 1 in the
corresponding column position.
Bit position n is checked by those
bits Ci such that ∑i=n.
C8 C4 C2 C1
Figure 4.10 Layout of Data bits and Check bits
The check bits are calculated as follows, where the symbol
designates the exclusive-or operation:
Assume that the 8-bit input words is 00111001, with data bit M1 in
the right-most position. The calculations are as follows:
Suppose the data bit 3 sustains an error and is changed from 0 to 1.
When the new check bits are compared with the old check bits,
the syndrome word is formed:
The result is 0110, indicating that bit position 6, which contains
data bit 3, in error.
Figure 4.11 Check Bit Degeneration
a single-error-correction (SEC) code
More commonly, semiconductor memory is equipped with a
single-error-correcting double-error-detecting (SEC-DED) code.
An error-correction code enhances the reliability of the memory
at the cost of added complexity.
Table 4.3 Increase in Word Length with Error Correction
1
1
Figure 4.12 Hamming SEC-DEC Code
The sequence show that if two errors occur (Figure 4.12 c), the checking procedure
goes astray (d) and worsens the problem by creating a third error (e). To overcome
the problem, an eighth bit is added that is set so that the total number of 1s in the
diagram is even.
4.3 CASHE MEMORY
Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or module
Cache operation - overview
Figure 4.14 Cache/Main-Memory Structure (P118)
 Cache includes tags to identify which block of main memory is in each
cache slot. The tag is usually a portion of the main memory address.
Line
Number
Tag
Block
0
1
2



C-1
(a) Cache
Block length (k words)
Memory
address
0
1
Block (K words)
2
3
Block
2n-1
(b) Main Memory
Word Length
Figure 4.15 Cache Read Operation (P119)
• CPU requests contents of memory location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block from main
memory to cache
• Then deliver from cache to CPU
Typical Cache Organization
•In this organization, the cache
connects to the processor via
data, control, and address lines.
•The data and address lines
attach to data and address
buffers, which attach to a
system bus from which main
memory is reached.
•When a cache hit occurs, the
data and address buffers are
disabled and communication is
only between processor and
cache, with no system bus
traffic
Figure 4.16 Typical Cache Organization
•When a cache miss occurs, the
desired address is loaded onto
the system bus and the data are
returned through a data buffer
to both the cache and main
memory.
Elements of Cache Design
 Size
 Mapping Function
 Direct
 Associative
 Set Associative
 Replacement Algorithm
 Least recently used (LRU)
 First in first out (FIFO)
 Least frequently used (LFU)
 Random
 Write Policy
 Write through
 Write back
 Write once
 Block Size
 Number of Caches
 Single or two level
 Unified or split
Cache Size
A trade-off between cost per bit and access time
Cost
More cache is expensive
Speed
More cache is faster (up to a point)
Checking cache for data takes time
“Optimum” cache sizes Suggested : between 1K
and 512K words.
Mapping Function
Three techniques
direct, associative, and set associative
Elements of the example
Cache of 64kByte
Cache block of 4 bytes
Data is transferred between memory and the cache in
blocks of 4 bytes each.
i.e. cache is 16k (214) lines of 4 bytes
16MBytes main memory
24 bit address (224=16M)
Main memory (4M blocks of 4 bytes each)
Direct Mapping
Each block of main memory maps to only
one cache line
i.e. if a block is in cache, it must be in one specific place
Address is in two parts
Least Significant w bits identify unique word or
byte within a block of main memory.
Most Significant s bits specify one memory block
The MSBs are split into a cache line field r and a
tag of s-r (most significant)
The line field of r identifies one of the m=2r lines
of the cache
Direct Mapping
Cache Line Table
Cache line
Every row has the same cache
line number; Every column has
the same tag number.
Main
Memory blocks assigned
0
1
0,
1,
m,
m+1,
2m, … 2s-m
2m+1…2s-m+1
m-1
m-1, 2m-1, 3m-1… 2s-1
The mapping is expressed as:
i= j modulo m
where i =cache line number
j = main memory block number
m = number of lines in the cache
No two blocks in the same line have the same Tag field!
Direct Mapping Cache Organization
The r-bit line number
is used as an index
into the cache to
access a particular
line.
If the (s-r) bit tag
number matches the
tag number currently
stored in that line,
then the w-bit word
number is used to
select one of the 2w
bytes in that line.
Otherwise, the s bits
tag-plus-line field is
used to fetch a block
from main memory.
Direct Mapping
Address Structure
Tag s-r
8
Line or Slot r
14
Word w
2
 24 bit address
 w =2 bit word identifier (4 byte block)
 s=22 bit block identifier
8 bit tag (=22-14)
14 bit slot or line
 No two blocks in the same line have the same Tag field
 Check contents of cache by finding line and checking Tag
Direct Mapping Example
• The cache is organized as
16K=214 lines of 4 bytes each.
• The main memory consists of
16Mbytes, organized as 4M
blocks of 4 bytes each.
• i= j modulo m
i = cache line number
j = main memory block number
m = number of lines in the cache
• Note that no two blocks that
map into the same line number
have the same tag number.
Main Memory Address
Direct Mapping pros & cons
Advantages
Simple
Inexpensive
Disadvantages
Fixed location for given block
If a program accesses 2 blocks that map to the same line
repeatedly, cache misses are very high
Associative Mapping
A main memory block can load into any line of
cache
Memory address is interpreted as a tag and a word
field.
Tag uniquely identifies block of memory
Every line’s tag is examined for a match
Disadvantages of associative mapping
Cache searching gets expensive
Complex circuitry required to examine the tags of all
caches in parallel.
Fully Associative Cache
Organization
Associative Mapping
Address Structure
Word
2 bit
Tag 22 bit
22 bit tag stored with each 32 bit (4B) block of data
Compare tag field with tag entry in cache to check for
hit
Least significant 2 bits of address identify which 16-bit
word is required from 32 bit data block
e.g.
Address
Tag
Data
Cache line
16339C
058CE7
FEDCBA98
0001
Associative Mapping Example
Main Memory Address
Set Associative Mapping
Cache is divided into a number of sets
Each set contains a number of lines
A given block maps to any line in a given set
e.g. Block B can be in any line of set i
e.g. 2 lines per set
2 way associative mapping
A given block can be in one of 2 lines in only one set
Set Associative Mapping
In this case , the cache is divided into v sets,
each of which consists of k lines.
The relationships are
m=v×k
i = j modulo v
where
i=cache set number
j=main memory block number
m=number of lines in the cache
This is referred to as k-way set associative mapping.
Two Way Set Associative Cache
Organization
The d set bits specify one
of v=2d sets. The s bits of
the tag and set fields
specify one of the 2s blocks
of main memory.
With K-way set associative
mapping, the tag in a
memory address is much
smaller and is only
compared to the k tags
within a single set.
Set Associative Mapping
Example
13 bit set number
Block number in main memory is modulo 213
000000, 00A000, 00B000, 00C000 … map to
same set
Set Associative Mapping
Address Structure
Tag 9 bit
Word
2 bit
Set 13 bit
Use set field to determine cache set to look in
Tag+Set field specifies one of the blocks in
the main memory.
Compare tag field to see if we have a hit
e.g
Address
Tag
Data
Set number
1FF 7FFC
1FF
24682468
1FFF
Two Way Set Associative
Mapping Example
e.g
Address
1FF 7FFC
Tag
1FF
Data
24682468
Set number
1FFF
02C 0004
02C
11235813
0001
Main Memory Address
Replacement Algorithms (1)
Direct mapping
When a new block is brought into the cache, one
of the existing blocks must be replaced.
Direct mapping
No choice
Each block only maps to one line
Replace that line
Replacement Algorithms (2)
Associative & Set Associative
Hardware implemented algorithm (speed)
Least Recently used (LRU)
Replace that block in the set which has been in the cache
longest with no reference to it. (hit ratio + time)
e.g. in 2 way set associative
Which of the 2 block is LRU?
First in first out (FIFO)
replace block in the set that has been in cache longest.
(time)
Least frequently used
replace block in the set which has had fewest hits.
Random
(hit ratio)
Write Policy
Must not overwrite a cache block unless main
memory is up to date
Problems to contend with
More than one device may have access to main memory.
Data inconsistent between memory and cache
Multiple CPUs may have individual caches
Data inconsistent among caches
Write Policy
Write through
Write back
Write once
Write through
All writes go to main memory as well as cache
Any other processor-cache can monitor main
memory traffic to keep local (to CPU) cache
updated.
Disadvantages
Lots of traffic
Slows down writes
Write back
Updates initially made in cache only
Update bit for cache slot is set when update occurs
If block in cache is to be replaced, write to
main memory only if update bit is set
Other caches get out of sync
I/O must access main memory through cache
Because portions of main memory are invalid
Approaches to cache coherency
 Bus watching with write through
Each cache controller monitors the address lines to detect write
operations to memory by other bus masters.
 This strategy depends on the use of a write-through policy by all
cache controller.
 Hardware transparency
Additional hardware is used to ensure that all the updates to
main memory via cache are reflected in all caches.
 Noncachable memory
Only a portion of main memory is shared by more than one
processor.
In such a system, all accesses to shared memory are cache
misses, because the shared memory is never copied to the cache.
The noncachable memory can be identified using chip-select logic
or high-access bits.
Line Size
The principle of locality
Data in the vicinity of a referenced word is likely to be
referenced in the near future.
The relationship between block size and hit ratio is
complex, depending on the locality characteristics
of a particular program, and no definitive optimum
value has been found.
A size of from two to eight words seems
reasonably close to optimum.
Number of caches
A single cache
Multiple caches
The number of levels of caches
The use of unified versus split caches
Split caches: one dedicated to instructions and one dedicated
to data
• Key advantage of split caches: eliminate contention for cache
between the instruction processor and the execution unit.
Unified cache: a single cache used to store references to both
data and instructions
For a given cache size, a unified cache has a higher hit rate than
split caches because it balances the load between instruction and
data fetches automatically.
Number of caches
 The on-chip cache: cache and processor on the same chip
When the requested instruction or data is found in the on-chip cache,
the bus access is eliminated. Because of the short data paths internal
to the processor, on-chip cache accesses will complete appreciably
faster than would even zero-wait state bus cycles.
Advantages
Reduce the processor’s external bus activity
Speed up execution times
Increase overall system performance
 A two-level cache
The internal cache designated as level 1 (L1)
The external cache designated as level 2 (L2)
4.4 Pentium Cache
Foreground reading
Find out detail of Pentium II cache systems
NOT just from Stallings!
4.5 Newer RAM Technology (1)
Basic DRAM same since first RAM chips
 Constraints of the traditional DRAM chip:
its internal architecture and its interface to the
processor’s memory bus.
Enhanced DRAM
Contains small SRAM as well
SRAM holds last line read
 A comparator stores the 11-bit value of the most recent
row address selection.
Cache DRAM (CDRAM)
Larger SRAM component
Use as cache or serial buffer
Newer RAM Technology (2)
Synchronous DRAM (SDRAM)
Access is synchronized with an external clock unlike
DRAM asynchronous.
Address is presented to RAM
Since SDRAM moves data in time with system clock, CPU
knows when data will be ready
CPU does not have to wait, it can do something else
Burst mode allows SDRAM to set up stream of data and
fire it out in block
Internal logic of
the SDRAM
• In burst mode, a series of data
bits can be clocked out rapidly
after the first bit has been
accessed.
Burst mode is useful when all
the bits to be accessed are in
sequence and in the same row
of the array as the initial access
•A dual-bank internal
architecture that improves
opportunities for on-chip
parallelism.
• The mode register and
associated control logic
provide a mechanism to
customize the SDRAM to suit
specific system needs.
Newer RAM Technology (3)
Foreground reading
Check out any other RAM you can find
See Web site:
The RAM Guide
Exercises
P143 4.4, 4.6, 4.7, 4.8
P145 4.20
Deadline