Cache Memory - Personal Web Pages

Transcript Cache Memory - Personal Web Pages

Processor - Memory Interface

Memory must be

random access memory

- memory in which individual memory locations can be accessed in any order at the same high speed.

Memory Instructions and data The memory that connects to the processor should operate at a very high speed, preferably at a speed that matches the processor, so as not to slow the system down.

Processor Large dynamic semiconductor RAM used for main memory cannot operate at that speed (much slower) Relatively small static semiconductor memory can be designed to operate faster.

ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides13.ppt Modification date: Nov 18, 2014 1

Solution: Cache Memory

Processor operates much faster than the main memory can.

To ameliorate the situation, a high speed memory called a cache memory placed between the processor and main memory.

Main memory X Data transfer High speed cache memory X Data transfer Information must be in cache memory for processor to access it: Processor What else did he invent/publish first?

The first paper on cache memories: M. Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965. 2

Time to access contents of memory

If same instructions never re-executed, caches would cause an additional overhead as information would first have to be transferred from the main memory to the cache and then to the processor and vice versa, i.e. the access time,

t a

would be:

a = t m + t c

where

= cache access time = main memory access time.

Fortunately, virtually all programs repeat sections of code and repeatedly access the same or nearby data. This characteristic is embodied in the

Principle of Locality

Principle of Locality

Found empirically to be obeyed by most programs. Applies to both instruction and data references, though more likely in instruction refs. Two main aspects:

1. Temporal locality

(locality in time) – individual locations, once referenced, are likely to be referenced again in the near future. Seen in instruction loops, stacks, variable accesses…

Temporal locality is essential for an effective cache.

2. Spatial locality

(locality in space) – references are likely to be near last reference. Seen in data accesses as data often stored in consecutive locations. References to next location sometimes separated into a third aspect, known as sequential locality.

Spatial locality helpful in the design of a cache but not essential.

Taking Advantage of Temporal Locality

Suppose a reference is repeated

times in all during a program loop and, after the first reference, the location is always found in the cache, then the average access time would be:

Average access time = (nt c + t m )/n = t c + t m /n

where

= number of references.

Example

If t c = 5 ns, t m = 60 ns and n = 10, average access time would be as opposed to

60 ns

without cache.

11 ns

THROUGHOUT t c is the time to access the cache, read (or write) the data if a hit or recognize a miss. In practice, these times could be different. t

is the extra time to access the main memory. Sometime machines are used in the equations rather than absolute time.

Hit Ratio

Main memory – the probability that the required word is already in the cache.

hit

occurs when a location in the cache is found immediately, otherwise a

miss

occurs and a reference to the main memory is necessary.

High speed cache memory Processor X X Data Address On a cache miss The cache

hit ratio

, h, (or

hit rate

) is defined as:

h = Number of times required word found in cache Total number of references

The

miss ratio

(or

miss rate

) is given by 1 - h.

Average access time using Hit Ratio

The average access time,

t a

is given by:

t a = t c + (1 - h)t m

assuming again that the access must be to the cache on a hit or miss before an access is made to the main memory on a miss.*

Example

If hit ratio is 0.85 (a typical value), main memory access time is

50 ns

and cache access time is 5 ns, average access time is 5+0.15

 50=

12.5 ns

Machine cycles

In a practical system, each access time given as an integer number of machine cycles. Typically hit time will be 1–2 cycles. Cache miss penalty (extra time to access main memory) in order of 5–20 cycles.

*Only read requests are consider. Write requests considered later.

Taking advantage of Spatial Locality

To take advantage of spatial locality, transfer not just one byte or word to/from main memory to cache but a series of sequential locations called a

line

or a

block

Address 8 0 For best performance, line should be transferred simultaneously across a wide data bus to the cache. This also enables access time of main memory to be matched to the cache.

Bus Line Cache memory with multiple memory modules (wide word length memory) 9 1 Byte Memory address 10 2 Cache Memory modules 11 3 12 4 Processor 13 5 14 6 Line 15 7 Byte location 8

Cache Memory Organizations

Need a way to select the location within the cache. The memory address of its location in main memory is used.

Three ways of selecting cache location:

1. Fully associative 2. Direct mapped

Memory Cache

3. Set associative

Data Memory address Processor 9

1. Fully Associative Mapping

Both memory address and data stored together in the cache. Incoming memory address is simultaneously compared with all stored addresses using the internal logic of the cache memory.

M emo ry a dd re ss fr om p ro cesso r M ain me mor y acce ssed i f a dd re ss n o t in cach e C ach e Co mp ar e with a ll stor ed a dd re sses sim ulta ne ou sly A dd re ss Data M ai n me mo ry R eq ui res o ne ad dr ess co mpa ra tor w ith e ach sto re d a dd re ss ( Co nten t-a dd re ssab le m em ory) A dd re ss fo un d Ad dr ess n ot fo un d i n cach e A ccess lo catio n 10

C om pa re wi th a ll sto red a d dre sse s si mu lta n eo usly

Example

Suppose each line has 16 bytes. With 32-bit processors, a word consists of 4 bytes: Wo rd w ithin lin e B yte with in wo rd M em or y a dd re ss fr om p ro cesso r

“Word” field specifies word within line. In this example, with 4 words in line, need 2 bits.

“Byte” field specifies byte within word. In this example with 4 bytes in word, need 2 bits.

W or d B yte 2 2 C ach e L in e Ad d ress Wo rd 0 Wo rd 1 Wor d 2 Wo rd 3 Ad d ress fo un d A ccess wo rd i n lin e S e lect byte in wo rd i f n ece ssar y 11

Selection/Replacement Algorithms

Fully associative cache needs an algorithm to select where to store information in cache, generally over some existing line (which would have to be copied back to the main memory if altered).* Must be implemented in hardware. (No software) Ideally, algorithm should choose a line which is not likely to be needed again in the near future, from all lines that could be selected.

Common Algorithms

1. Random selection 2. The least recently used algorithm (or an approximation to it).

* Note in caches the selection and replacement location usually refers to the same location whereas in virtual memory (OS course) they usually refer to different locations.

Least Recently Used (LRU) Algorithm

Line which has not been referenced for

longest time

removed from cache. The word “

recently

” comes about because the line is not the least used, as this is likely to be back in memory. It is the least used of those lines in the cache, and all of these are likely to have been recently used otherwise they would not be in the cache.

Can only be implemented in hardware fully when the number of lines that need to be considered is small.

Direct Mapping

Line held in cache at a location given by ”index” bits of main memory address. Line selected by index bits of main memory address. Most significant bits of address stored in cache compared with most significant bits of main memory address (tags): Me mo ry ad dr ess fro m pr oce ssor Tag Ind ex Wo rd B yte H igh spe ed RA M Ca che Ind e x L ine Tag Word 0 Word n-1 Co mp are Re a d Differ en t M ain me mo ry a ccesse d if ta gs d o n ot ma tch On e e xter na l S ame ad dr ess co mpa ra to r Acce ss wor d/b yte in l ine 14

Sample Direct-Mapped Cache Design

8192-byte direct mapped cache with 32-byte line organized as eight 4-byte words. 32-bit memory address.

M em or y a dd re ss fr om p ro cesso r 32 27 5 19 Ta g In d ex 8 Ind ex Wor d 3 Byte 2 Cach e 8192/32 = 256 Li ne Tag W or d 0 Wo rd 1 Wo rd 2 W ord 3 Wo rd 4 Wo rd 5 Wo rd 6 Wo rd 7 256 (2 8 ) Re ad C omp ar e S am e Tag has 19 bits Acce ss wor d/b yte in li ne

With 4 bytes in word, need 2 bits in byte field.

With 8 words in line, need 3 bits in word field.

With 8192 bytes in total and 32 bytes in each line, 8192/32 entries in cache (= 256 = 2 8 ). So index = 8 bits.

Advantages of Direct Mapped Caches

1. No replacement algorithm necessary - because there is no choice in the selection of the location for the incoming line. It is given by the index of the address of incoming line.

2. Simple hardware and low cost.

3. High speed of operation.

Major Disadvantage of Direct Mapped Caches

Performance drops significantly if accesses are made to different locations with the same index.

However, as the size of cache increases, the difference in the hit ratios of the direct and associative caches reduces and becomes insignificant.

Elements of an Array Stored in Memory

Every

th location in memory map into same location in cache where there

locations in the cache.

a[2][1] a[2][0] a[1][n-1]

locations A 2-dimensional array, a[ ][ ],with

elements in the first position would map all these elements into one location (if row-major order as C).

a[1][1] a[1][0] a[0][n-1]

locations Cache a[0][1] a[0][0] 18

Set-Associative Mapping

Allows a limited number of lines, with the same index and different tags, in the cache. A compromise between a fully associative cache and a direct mapped cache.

Cache divided into “

sets

” of lines. A four-way set associative cache would have four lines in each set.

The number of lines in a set is known as the

associativity

or set size. Each line in each set has a stored tag which, together with the index (set number), completes the identification of the line.

4-way Set-Associative Cache

Memory address from processor Ta g Ind ex Wo rd B yte Ca ch e Lin e Set Ta g Da ta Ta g Da ta Ta g Data Ta g Data Com pa re S ame M ain me mo ry acce sse d if ta gs d o n ot ma tch A ccess wo rd /byte First, index of address from processor used to access set. Then, all tags of selected set compared with incoming tag. If match found, corresponding location accessed, otherwise access main memory.

Sample 4-way Set-Associative Cache Design

4096-byte 4-way set-associative cache with 8-byte line organized as two 4-byte words. 32-bit memory address.

32 3 Ta g Ind ex Wo rd B yte Memory address from processor 2 22 7 1 Ca ch e Lin e Ta g Ta g Ta g Ta g 4096/(4 x 8) = 128 Com pa re S ame M ain me mo ry acce sse d if ta gs d o n ot ma tch A ccess wo rd /byte

With 4 bytes in word, need 2 bits in byte field.

With 2 words in line, need 1 bit in word field.

With 4096 bytes in total and 8 bytes in each line and 4 lines in set (4-way set assoc.) 4096/(4 x 8) entries in cache (= 128 = 2 7 ). So index = 7 bits .

Set-Associative Cache Replacement Algorithm

Need only consider the lines in one set, as the choice of set is predetermined by the index (set number) in the address.

Hence, with two lines in each set, for example, only one additional bit is necessary in each set to identify the line to replace.

• • •

Set size

Typically, set size is 2, 4, 8, or 16. A set size of one line reduces organization to that of direct mapping.

An organization with one set becomes fully associative mapping.

Set-associative cache popular for internal caches of microprocessors.

Valid Bits

In all caches, one valid bit provided with each line.* Will assume one valid bit per line.

Valid bits set to a 0 initially. Then set to a 1 when contents of line is valid. Checked before accessing line.

Needed to handle start-up situation when cache holds random patterns of bits

and also before cache is full

. * Or parts of a line if only parts transferred in separate transactions) 23

Sample Cache Design showing valid bits

(assuming a line can be transferred in one transaction)

4096-byte 2-way set-associative cache with 16-byte lines organized as four 4-byte words. 32-bit memory address.

21 Tag 32 28 4 Index 7 WordByte 2 2 Cache Memory address from processor

Valid bits

Tag Line Word 0 Word 1 Wor d 2 Word 3 Valid bits Line Tag Wor d 0 Wor d 1 Wor d 2 Word 3 128 (2 7 )

Valid bit set when line transferred into cache

Compare Same Same Access word/byte in line

With 4 bytes in word, need 2 bits in byte field.

With 4 words in line, need 2 bit in word field.

With 4096 bytes in total and 16 bytes in each line and 2 lines in set (2-way set assoc.) 4096/(16 x 2) entries in cache (= 128 = 2 7 ). So index = 7 bits .

Fetch policy

Three strategies for fetching lines from main memory to cache:

Demand fetch

- fetching a line when it is needed on a miss.

Prefetch

- fetching lines before they are requested.

Simple prefetch strategy - prefetch (i + 1)th line when ith line is initially referenced (assuming that the (i + 1)th line is not already in the cache) on the expectation that it is likely to be needed if the ith line is needed.

Selective fetch

- policy of not always fetching lines, dependent upon some defined criterion. Then, main memory used rather than cache to hold the information. Individual locations could be tagged as non cacheable. May be advantageous to lock certain cache lines so that these are not be replaced. Hardware could be provided within cache to implement such locking.

Write Policies

Reading a word in cache does not affect it and no discrepancy between the cache word and copy held in main memory.

Writing can occur to cache words and then copy held in main memory different. Important to maintain copies same if other devices such as disks access the main memory directly.

Two principal alternative mechanisms to update the main memory:

1. Write through 2. Write back

1. Write-Through

In the write-through mechanism, every write operation to the cache is repeated to the main memory, normally at the same time. Then main memory always the same as the cache.

Main memory X Cache X Data Address O n eve ry write referenc e (b ut see later) Processor 27

Cache with write buffer

Write-through scheme can be enhanced by incorporating buffers: Processor Cache Data Address Main memory Write Read Allows the cache to be accessed while multiple previous memory write operations proceed. “Non-blocking” store. 28

Two ways to handle write misses

1. Fetch-on-write (miss)

Describes a policy of bringing a line from the main memory into the cache for a write operation on a write miss (when the line is not already in the cache).

Also called

allocate on write

because a line is allocated for an incoming line on cache miss.

2. No-Fetch-on-write (miss)

Describes a policy of not bringing a line from the main memory into the cache for a write operation.

Also called

Non-allocate on write

No fetch on write often practiced with a write-through cache. Why?

2. Write-Back (or copy back)

Write operation to main memory only done at line replacement time. At this time, line displaced by incoming line written back to main memory.

Step 2 X w ritten back to main memory w hen location used by incoming line (Y) Main memory Y X Step 3 Bring in Y Only necessary if X altered in cache Requires an altered (“dirty”) bit with line C ache X Data Address Step 1 R eference to Y, a miss Processor Here X and Y have same index if direct mapped/set associative 30

Instruction and Data Caches

Several advantages if separate cache into two parts, one holding the data (a

data cache

) and one holding program instructions (an

instruction cache

code cache

): • Separate paths could be provided from the processor to each cache, allowing simultaneous transfers to both the instruction cache and the data cache.

• Write policy would only have to be applied to the data cache assuming instructions are not modified.

• Designer may choose to have different sizes for the instruction cache and data cache, and have different internal organizations and line sizes for each cache.

Particularly convenient in a pipeline processor, as different stages of the pipeline access each cache (instruction fetch unit accesses instruction cache and memory access unit accesses data cache): Main memory Data paths Instruction cache Instructions Data cache Commonly inside the processor Data Instruction fetch unit IF OF EX MEM Instruction pipeline Memory access unit Processor 32

General Cache Performance Characteristics

Miss Ratio against Cache Size ?

1.0

0.5

Program A Program B 0.1

0.05

Program C 0.01

2K 4K 8K 16K C ache size 32K 33

Miss Ratio against Line Size

1.0

0.5

0.1

0.05

0.01

4 32 Instruction/data cache Instruction cache 256 1024 C ache size (fixed) 4096 Has a minimum (Why?) 8 16 32 Line size (bytes) 64 128 32768 34

Second Level Caches

Most present-day systems use two levels of cache (or three levels).

Processor First-level cache(s) Second-level cache Main memory U nified cache holding code and data U sually separate data and instruction caches First-level cache access time matches processor. Second-level cache access time between main memory access time and first level cache access time.

Strictly inclusive caches

-- all the data in the L2 cache is also in the L1 cache

Exclusive caches

– data is guaranteed only to be in one cache (L1 or L2) at most, never in both.

Alternative: data could be in only L1 or L2 or both.

Caches Example

Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm (Launched 2011)

• • • • L1 Data cache = 32 Kbyte, 8-WAY. (Write-Allocate?), line = 64 Bytes. L1 Instruction cache = 32 Kbyte. 8-WAY, line = 64 Bytes L2 Cache = 256 KB. 8-WAY, line = 64 bytes L3 Cache = 3 MB. Direct?, line = 64 bytes L1 Data Cache Latency = 4 cycles or 5 cycles L2 Cache Latency = 12 cycles L3 Cache Latency = 27.85 cycles RAM Latency = 28 cycles + 49 ns or 56 ns.

http://www.7-cpu.com/cpu/SandyBridge.html

Cache Memory - Personal Web Pages

Transcript Cache Memory - Personal Web Pages

Processor - Memory Interface

Solution: Cache Memory

Time to access contents of memory

Principle of Locality

Taking Advantage of Temporal Locality

Example

Hit Ratio

Average access time using Hit Ratio

Example

Taking advantage of Spatial Locality

Cache Memory Organizations

1. Fully Associative Mapping

Example

Selection/Replacement Algorithms

Common Algorithms

Least Recently Used (LRU) Algorithm

Direct Mapping

Sample Direct-Mapped Cache Design

Advantages of Direct Mapped Caches

Major Disadvantage of Direct Mapped Caches

Elements of an Array Stored in Memory

Set-Associative Mapping

4-way Set-Associative Cache

Sample 4-way Set-Associative Cache Design

Set-Associative Cache Replacement Algorithm

Valid Bits

Sample Cache Design showing valid bits

Fetch policy

Write Policies

1. Write-Through

Cache with write buffer

Two ways to handle write misses

1. Fetch-on-write (miss)

2. No-Fetch-on-write (miss)

2. Write-Back (or copy back)

Instruction and Data Caches

General Cache Performance Characteristics

Miss Ratio against Line Size

Second Level Caches

Caches Example

Questions

Directory