Chapter6. Memory Organization

Download Report

Transcript Chapter6. Memory Organization

Chapter6. Memory Organization

Transfer between P and M should be such that P can operate at its maximum speed. → not feasible to use a single memory using one technology. – CPU registers : a small set of high-speed registers in P as working memory for temporary storage of instructions and data. Single clock cycle access. – Main(primary) memory : can be accessed directly and rapidly by CPU. While an IC technology is similar to that of CPU registers, access is slower because of large capacity and physical separation from CPU – Secondary(back up) memory : much larger in capacity, much slower, and much cheaper than main memory. – Cache : an intermediate temporary storage unit between processor registers and main-memory. One to three clock cycle access.

The objective of memory design is to provide adequate storage capacity with an acceptable level of performance and cost. ⇒ memory hierarchy, automatic storage concepts, virtual memory concepts, and the design of communication link.

Memory Device Characteristics 1. Cost C = P/S (dollars/bits) 2. Access time(t A ) : the average time required to read one word from the memory. From the time a read request is received by memory to the time when all the requested information has been made at the memory output.

depending on the physical nature of the storage medium and on the access mechanism used. Memory units with fast access are expensive.

3. Access mode RAM(Random Access Memory) : accessed in any order and access time is independent of the location.

Serial-access memory(tape)

4. Alterability : ROM(Read Only Memory), PROM(Programmable…), EPROM(Extended…).

5. Performance of storage : destructive readout, dynamic storage, and volatility. ex) dynamic memory(DRAM) – required periodic refreshing. static random access memory(SRAM) – require no periodic refreshing. DRAM is much cheaper then SRAM “volatile” : if the stored information can be destroyed by a power failure. 6. Cycle time(t M ) : the mean time that must elapse between the initiation of two consecutive access operations. t M can be greater than t A .

( Dynamic memory can’t initiate a new access until a refresh operation) 7. Physical characteristics – Storage density – Reliability : MTBF.

RAM : The access and cycle times for every location are constant and independent of its position.

Array organization : The memory address is partitioned into d components so that the address Ai of cell c i becomes a d-dimensional vector (Ai 1 , Ai 2 , ··· ,Ai d )=Ai. Each of d parts goes to a different decoder → d-dimensional array. Usually,

N

N N X X

 

N

we use 2-dimensional array organization.

N Y Y N X

N Y

N

If less access circuitry and less time.

2-D memory organization matches well the circuit structure by IC technology.

Key issue: How to reduce access time, fault-tolerant techniques 6.2 Memory Systems : A hierarchical storage system managed by operating system. 1. To free programmers from the need to carry out storage allocation and to permit efficient sharing of memory space among different users. 2. To make programs independent of the configuration and the capacity of the memory systems used during their execution. 3. To achieve the high access rates and low cost per bit that is possible with a memory hierarchy  implemented by an automatic address mapping mechanism. A typical hierarchy of memory ( M 1 , M 2 , ··· , M k ).

Generally, all information in M i-1 Let, C i : cost per bit – C i t Ai : access time – t Ai S i : storage space S i > C < t i+1 Ai+1 < S i+1 at any time is also stored in M i , but not vice versa.

If the address which CPU generates is currently assigned only to M i for i  1, the execution of the program must be suspended until reassigned from M i to M 1 . → very slow → To work efficiently, the address by CPU should be found in M 1 , as often as possible.

Memory hierarchy works due to the common characteristic of programs : (locality of reference)

Locality of reference : The address generated by a typical program tend to be confined to small regions of its logical address space over the short term.

spatial locality : Consecutive memory references are to address that are close to one another in the memory-address space.  Instead of transferring one instruction I to M 1 , transfer one page of consecutive words containing I.

temporal locality : I’s in a loop are executed repeatedly, resulting in a high frequency of reference to their addresses.

The design objective is to achieve a performance close to that of M 1 bit close to that of M k .

and a cost per Factors: 1. The address reference statistics. 2. The access time of each level Mi relative to CPU. 3. Storage capacity. 4. The size of the transferred block of information. (needs optimal size of block) 5. Allocation algorithm.

by simulation, we can evaluate. Simulation is the major tool.

Consider a two-level hierarchy (M 1

C

C

1

S

1 For, S 1

S

1    S

C

2

S

2 2

S

2 → C  S C i i C : Storage capacity of M : Cost per bit of M 2 & M 2 ) i i

 Hit ratio: H : the prob. that a logical address generated by CPU refers to information in M 1 → want H to be 1. By executing a set of representative programs, N 1 : # of address references by M 1.

N 2 : # of address references by M 2 .

Miss ratio: 1 - H H  N 1 N 1  N 2 Let t A1 and t A2 the access time of M 1 and M 2 , respectively, t A (access time) = H · t A1 + (1-H) · t A2 Block of information has to be transferred. Let t B  t A = H · t A1 + (1-H) · (t B + t A1 : block transfer time, t ) = t A1 + (1-H)t B A2 = t B + t A1 Since, t B >> t A1 → t A2  t B .

Access efficiency

e

t t A

1

A

r

 ( 1 1 

r

)

H

, For r =100, to make e > 90% → H > 0.998

for

r

t A

2

t A

1

6.2.2 Address Translation : map the logical addresses into the physical address space P of main memory → by the OS while the program is being executed. Static translation : assign fixed values to the base address of each block when the program is first loaded. Dynamic translation : allocates storage during execution. Base addressing : A eff = B + D ( or A eff = B . D )

Translation look-aside buffer(TLB)

Segments: A segment is a set of logically related, contiguous words such as programs or data sets.

The physical addresses assigned to the segments are kept in a segment table.

• A presence bit

P

that indicates whether the segment is currently assigned to M 1 .

• A copy bit

C

that specifies whether this is the original ( master ) copy of the descriptor.

• A 20-bit size field

Z

that specifies the number of words in the segment.

• A 20-bit address field

S

that is the segment’s real address in M 1 ( when

P

or M 2 ( when

P

= 0 ).

= 1 )

  Pages : fixed-length blocks adv. : very simple memory allocation. Logical address : a page address + displacement within the page.

Page table : logical page address and corresponding physical address.

disadv. : no logical significance between neighboring pages.

Paged segment : divide each segment into pages. Logical address : a segment address + a page address + displacement adv. : don’t need to store the segment in a contiguous region of the main memory (more flexible memory management).

Optimal page size on the paged segment. S p : page size → impact on storage utilization and memory access rate. too small S p too big S p → large page table → reduced utilization. → excessive internal fragmentation. S : memory space overhead due to the paged segment.

S S

 

opt p S p S p

2 

opt

ds ds p

S s S p

 1 2 , 2

S s d

dS p S s S

2

p S

  0 1 2 

S S opt p

2

S s p

2  , where S s 2

S s

 : average segment space : space utilization factor  

S S

s S s

A special processor : MMU(Memory Management Unit) to handle address translations  Main memory allocation Main memory is divided into regions each of which has a base address to which a particular block is to be assigned.

Main memory allocation : the process to determine the region.

1. an occupied space list : block name, address, size. 2. an available space list : empty space. 3. a secondary memory directory.

Deallocated : When a block is no longer required in main memory, it transfer from the occupied space list to the available space list.

Suppose that a block K i of n i words is transferred from secondary to main memory.

• preemptive : if an incoming block can be assigned to a region occupied by another block either by moving or expelling.

• non-preemptive : if an incoming block can be placed only in an unoccupied region that is large enough to accommodate.

① non-preemptive allocation : if none of blocks is preempted by a block K i of n i words, then → find an unoccupied “available” region of n i → first fit method and best fit method. or more words.  first–fit method : scans the map sequentially until available region is found, then allocate.

 best–fit method : scans the map sequentially and then K i such that (n j – n i ) is minimized.

to a region n j  n i

Example) 0 Available region address Size 50 0 50 300 400 800 200 300 K 1 Two additional blocks K 4 : 100 words K 5 : 250 words 700 800 K 2 1000 K 3 Another Case!! K4: 100 words K5: 400 words 0 50 300 400 650 700 800 K 1 K 4 K 5 K 2 1000 K 3 First fit 0 50 300 550 700 800 900 K 1 K 5 K 2 K 4 1000 K 3 Best fit

② preemptive allocation : In non-preemptive allocation, overflow can occur. reallocation for more efficient use 1. The blocks already in M 1 can be relocated within M 1 to make a large gap for the incoming block. 2. Make more available region by deallocating blocks. → how to select the blocks to be replaced. Dirty blocks(modified blocks) : before overwritten, it must be copied into the secondary memory → I/O operation Clean blocks(unmodified blocks) : simply overwrite Compaction technique : combine into a single block.

K K 1 2 K 1 K 2 Adv: eliminate the problem of selecting an available region. Disadv. : compaction time required.

Replacement policies to maximize the hit-ratio : FIFO and LRU Optimal replacement strategy: at time t i , determine t j > t i at which the next reference to block K is to occur, than replace K for which (t j -t i ) is maximum. → will require two passes through the program.

The first is a simulation run to determine the sequence S B of virtual block addresses.

The second is the execution run, which uses the optimal sequence S B OPT to specify the blocks to be replaced. not practical FIFO : Select for replacement the block least recently loaded into main memory. LRU(Least Recently Used) : Select for replacement the least recently accessed block, assuming that the least recently used block is the one least likely to be reference in the future.

Implementation : FIFO much simple. Disadvantage of FIFO : A frequently used block such as one containing a program loop may be replaced because it is the oldest block (terrible) but LRU avoid the replacement of frequently used block.

Factors of H. 1. Types of address streams encountered. 2. Average block size.

3. Capacity of main memory. 4. Replacement policy.

Simulation.

Page address stream: 2 3 2 1 5 2 4 5 3 2 5 2

6.3. Caches • High speed memory Several approaches to increase the effective P, M interface bandwidth. 1. decrease the memory access time by using a faster technology(limited due to cost). 2. access more than one word during memory cycle. 3. insert a cache memory between P and M. 4. use associate addressing in place of the random access method.

• Cache : a small fast memory placed between P and M. Many of techniques for virtual memory management have applied to cache systems

In a multiprocessor system, each processor has its own cache to reduce the effective time by a processor to access addresses, instructions, or data.

Cache store a set of main memory address A i and the corresponding word M(A i ). A physical address A is sent from CPU to cache at the start of read or write memory access cycle. The cache compares the address tag A to all the addresses it currently stores. If there is a match(cache hit), a cache selects M(A). If a cache miss occurs, copy into cache the main memory block P(A) containing the desired item M(A).

look-aside:

the cache and the main memory are directly connected to the system bus

look-through:

faster, but more expensive CPU communicates with the cache via a separate bus. The system bus is available for use by other units to communicate with main memory  cache access and main-memory access not involving CPU can proceed concurrently. Only after a cache miss, CPU sends memory requests to main memory

Two important issues of the cache design.

1.How to map main memory addresses into cache addresses.

2.How to update main memory when a write operation changes the content of the cache.

• Updating main memory : • write-back : The cache block into which any write operation occurred, are copied back into the main memory.

Single processor case : Not change M c M 1 When this part removed, copied back into the main memory Multi-processor case : inconsistency P 1 write Mc 1 P 2 Mc 2 M 1 P k Mc k Problem : if there are several processors with independent caches.

• write-through : transfer the data word to both cache and main memory during each write cycle, even when the target address is already assigned to the cache. → more “write” to main memory then write-back

6.3.2. Address Mapping When a tag address is present to the cache, it must be quickly compared to the stored tags.

scanning all tag in sequence : unacceptably slow the fastest technique : associative( or content ) addressing to compare simultaneously all tags.

Associative addressing : Any stored item can be accessed by using the contents of the item in question as an address. associated memory = content addressable memory ( CAM ) Item in associate memory have two-field format Key, Data Stored address Information to be accessed An associative cache : a tag as the key.

the incoming tag is compared simultaneously to all tags stored in the cache’s tag memory.

Associative memory Any subfield of the word can be the key, specified by a mask register.

Since all words in the memory are required to compare their keys with the input key simultaneously, each needs its own match circuit.

much more complex and expensive than conventional memories VLSI techniques have made CAM economically feasible.

All words share a common set of data and mask lines for each position simultaneous comparisons.

Direct mapping : simpler address mapping for caches Simple implementation : The low order S bits of each block address form a set address. Main drawback : If two or more frequently used blocks happen to map onto the same region in the cache, the hit ratio drops sharply.

Set-associative mapping : associate + direct mapping

6.3.3. Structure VS Performance Cache types : I-cache and D-cache the different access patterns. Programs involve few write accesses, more temporal and spatial locality than the data they process.

Two or more cache levels in high-performance systems: the feasibility of including part of real memory space on a microprocessor chip and growth in the size of main memory.

L 1 cache : on-chip memory L 2 cache : off-chip memory The desirability of an L 2 cache increases with the size of main memory, assuming L 1 cache has fixed size.

Performance

t A = t

t A

A1 + ( 1 – H ) t B

: average access time t A1 : cache access time t A2 t B : M 2 access time : block transfer time from M 2 to M 1 With a sufficiently wide M 2 -to-M 1 data bus, a block can be loaded into the cache in a single M 2 read operation t B = t A2

t A = t A1 + ( 1 – H ) t A2

Suppose that M 2 is six times slower than M 1 For H = 99% t A = 1.06 t A1 , For H = 95% t A = 1.30 t A1 A small decrease in the cache’s H has a disproportionately large impact on performance.

A general approach to the design of the cache’s main size parameters S 1 K ( # of Blocks per set ), and P 1 ( # of bytes per block ) ( # of sets), 1.Select a block (line) size p1. This value is typically the same as the width w of the data path between the CPU and main memory, or it is a small multiple of w.

2.Select the programs for the representative workloads and estimate the number of address references to be simulated. Particular care should be taken to ensure that the cache is initially filled before H is measured.

3.Simulate the possible designs for each set size s 1 and associativity degree k of acceptable cost. Methods similar to stack processing ( section 6.2.3 ) can be used to simulate several cache configurations in a single pass.

4. Plot the resulting data and determine a satisfactory trade-off between performance and cost.

In many cases, doubling the cache size from S 1 to 2S 1 increases H by about 30%