18-447: Computer Architecture Lecture 20: Memory Scheduling and Virtual Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/4/2012

Transcript 18-447: Computer Architecture Lecture 20: Memory Scheduling and Virtual Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/4/2012

18-447: Computer Architecture Lecture 20: Memory Scheduling and Virtual Memory

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/4/2012

Reminder: Homeworks

 Homework 6    Out today Due April 16 Topics: Main memory, caches, virtual memory 2

Reminder: Lab Assignments

 Lab Assignment 5    Implementing caches and branch prediction in a high-level timing simulator of a pipelined processor Due April 6 (this Friday!) Extra credit: Cache exploration and high performance with optimized caches 3

Reminder: Midterm II

 Next week    April 11 Everything covered in the course can be on the exam You can bring in two cheat sheets (8.5x11’’) 4

Review of Last Lecture

 DRAM subsystem  Page mode     Cell, Row/column, Bank, Chip, Rank, Module/DIMM, Channel Address mapping Refresh DRAM controller  Scheduling, row buffer management, power management  Power management basics   Static vs. dynamic power Principles  Bloom filters  Compact and approximate way of representing set membership  Allows easy way of testing set membership 5

Bloom Filters

Bloom, “ Space/time trade-offs in hash coding with allowable errors , ” CACM 1970.

Seshadri et al., “ The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing , ” CMU Technical Report, 2012.

Hardware Implementation of Bloom Filters

The DRAM Subsystem: A Quick Review

DRAM Subsystem Organization

      Channel DIMM Rank Chip Bank Row/Column 9

Interaction with Virtual



Physical Mapping

 Operating System influences where an address maps to in DRAM Virtual Page number (52 bits) Page offset (12 bits) VA Physical Frame number (19 bits) Row (14 bits) Bank (3 bits) Page offset (12 bits) Column (11 bits) Byte in bus (3 bits) PA PA  Operating system can control which bank a virtual page is mapped to. It can randomize Page  mappings  Application cannot know/determine which bank it is accessing 10

DRAM Controller

 Purpose and functions  Ensure correct operation of DRAM (refresh and timing)  Service DRAM requests while obeying timing constraints of DRAM chips   Constraints: resource conflicts (bank, bus, channel), minimum write-to-read delays Translate requests to DRAM command sequences  Buffer and schedule requests to improve performance  Reordering and row-buffer management  Manage power consumption and thermals in DRAM  Turn on/off DRAM chips, manage power modes 11

DRAM Scheduling Policies (I)

 FCFS  (first come first served) Oldest request first  FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  maximize DRAM throughput  Actually, scheduling is done at the command level  Column commands (read/write) prioritized over row commands (activate/precharge)  Within each group, older commands prioritized over younger ones 12

Row Buffer Management Policies

 Open row  Keep the row open after an access + Next access might need the same row  -- Next access might need a different row  row hit row conflict, wasted energy  Closed row  Close the row after an access (if no other requests already in the request buffer need the same row) + Next access might need a different row  -- Next access might need the same row  avoid a row conflict extra activate latency  Adaptive policies  Predict whether or not the next access to the bank will be to the same row 13

Why are DRAM Controllers Difficult to Design?

    Need to obey DRAM timing constraints for correctness     There are many (50+) timing constraints in DRAM tWTR: Minimum number of cycles to wait before issuing a read command after a write command is issued tRC: Minimum number of cycles between the issuing of two consecutive activate commands to the same bank … Need to keep track of many resources to prevent conflicts  Channels, banks, ranks, data bus, address bus, row buffers Need to handle DRAM refresh Need to optimize for performance (in the presence of constraints)   Reordering is not simple Predicting the future?

DRAM Request Scheduling (and Interference) in Multi-Core Systems

Scheduling Policy for Single-Core Systems

 FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  throughput maximize DRAM  Is this a good policy in a multi-core system?

Uncontrolled Interference: An Example

Multi-Core Chip unfairness L2 CACHE L2 CACHE INTERCONNECT DRAM MEMORY CONTROLLER Shared DRAM Memory System DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 3 17

A Memory Performance Hog

// initialize large arrays A, B } for (j=0; j

streaming

// initialize large arrays A, B } for (j=0; j

random

A[index] = B[index]; …

STREAM RANDOM

- Sequential memory access - Very high row buffer locality (96% hit rate) - Memory intensive - Random memory access - Very low row buffer locality (3% hit rate) - Similarly memory intensive Moscibroda and Mutlu, “ Memory Performance Attacks , ” USENIX Security 2007.

What Does the Memory Hog Do?

Memory Request Buffer Row Buffer

128

(8KB/64B)

requests of T0 serviced before T1

Data Moscibroda and Mutlu, “ Memory Performance Attacks , ” USENIX Security 2007.

Effect of the Memory Performance Hog

2.82X slowdown 1.18X slowdown Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux) Moscibroda and Mutlu, “ Memory Performance Attacks , ” USENIX Security 2007.

Problems due to Uncontrolled Interference

Main memory is the only shared resource High priority

     Unfair slowdown of different threads Low system performance Vulnerability to denial of service Priority inversion: unable to enforce priorities/SLAs Poor performance predictability (no performance isolation)

Cores make very slow progress

Problems due to Uncontrolled Interference

Inter-Thread Interference in DRAM

 Memory controllers, pins, and memory banks are shared  Pin bandwidth is not increasing as fast as number of cores  Bandwidth per core reducing  Different threads executing on different cores interfere with each other in the main memory system   Threads delay each other by causing resource contention:  Bank, bus, row-buffer conflicts  reduced DRAM throughput Threads can also destroy each other ’ s DRAM bank parallelism  Otherwise parallel requests can become serialized 23

Effects of Inter-Thread Interference in DRAM

 Queueing/contention delays  Bank conflict, bus conflict, channel conflict, …  Additional delays due to DRAM constraints   Called “ protocol overhead ” Examples   Row conflicts Read-to-write and write-to-read delays  Loss of intra-thread parallelism 24

Inter-Thread Interference in DRAM

 Existing DRAM controllers are unaware of inter-thread interference in DRAM system  They simply aim to maximize DRAM throughput  Thread-unaware and thread-unfair   No intent to service each thread ’ s requests in parallel FR-FCFS policy: 1) row-hit first, 2) oldest first  Unfairly prioritizes threads with high row-buffer locality 25

QoS-Aware Memory Request Scheduling

Core Core Core Core Memory Controller

Resolves memory contention by scheduling requests

Memory  How to schedule requests to provide    High system performance High fairness to applications Configurability to system software  Memory controller needs to be aware of threads 26

How Do We Solve the Problem?

 Stall-time fair memory scheduling [Mutlu+ MICRO’07]  Goal: Threads sharing main memory should experience similar slowdowns compared to when they are run alone  fair scheduling  Also improves overall system performance by ensuring cores make “proportional” progress  Idea: Memory controller estimates each thread’s slowdown due to interference and schedules requests in a way to balance the slowdowns  Mutlu and Moscibroda, “ Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors, ” MICRO 2007. 27

Stall-Time Fairness in Shared DRAM Systems

 A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system     DRAM-related stall-time: The time a thread spends waiting for DRAM memory ST shared : DRAM-related stall-time when the thread runs with other threads ST alone : DRAM-related stall-time when the thread runs alone

Memory-slowdown = ST shared /ST alone

 Relative increase in stall-time  Stall-Time Fair Memory scheduler (STFM) Memory-slowdown aims to equalize for interfering threads, without sacrificing performance   Considers inherent DRAM performance of each thread Aims to allow proportional progress of threads 28



STFM Scheduling Algorithm

[MICRO ’ 07] For each thread, the DRAM controller   Tracks ST shared Estimates ST alone  Each cycle, the DRAM controller   Computes Slowdown = ST shared /ST alone for threads with legal requests Computes unfairness = MAX Slowdown / MIN Slowdown   If unfairness <   Use DRAM throughput oriented scheduling policy If unfairness ≥   Use fairness-oriented scheduling policy   (1) requests from thread with MAX Slowdown first (2) row-hit first , (3) oldest-first 29

How Does STFM Prevent Unfairness?

T0: Row 0 T1: Row 5 T1: Row 111 Row Buffer T0 Slowdown Unfairness  1.05

Data 30

Another Problem due to Interference

 Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests   Memory-Level Parallelism (MLP) Out-of-order execution, non-blocking caches, runahead execution  Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks   Multiple threads share the DRAM controller DRAM controllers are not aware of a thread ’ s MLP  Can service each thread ’ s outstanding requests serially, not in parallel 31

Bank Parallelism of a Thread

Single Thread:

2 DRAM Requests Thread A : Compute Bank 0 Bank 1 Stall Compute Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency 32

Bank Parallelism Interference in DRAM

Bank 1

Baseline Scheduler:

2 DRAM Requests Bank 0 A : Compute Bank 0 Stall 2 DRAM Requests Bank 1 Stall Compute B: Compute Bank 1 Stall Bank 0 Stall Compute Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies 33

Parallelism-Aware Scheduler

Baseline Scheduler:

2 DRAM Requests Compute A : Compute Bank 0 Stall 2 DRAM Requests Bank 1 B: Compute Bank 1 Stall Bank 0 Stall Stall

Parallelism-aware Scheduler:

2 DRAM Requests A : Compute Stall Bank 0 Bank 1 2 DRAM Requests B: Compute Stall Bank 0 Bank 1 Compute Stall Compute

Saved Cycles

Compute Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1

Average stall-time: ~1.5 bank access latencies

Parallelism-Aware Batch Scheduling (PAR-BS)

 Principle 1: Parallelism-awareness    Schedule requests from a thread (to different banks) back to back Preserves each thread ’ s bank parallelism But, this can cause starvation… T1 T2 T2 T3  Principle 2: Request Batching      Group a fixed number of oldest requests from each thread into a “ batch ” Service the batch before all other requests Form a new batch when the current one is done Eliminates starvation, provides fairness Allows parallelism-awareness within a batch Mutlu and Moscibroda, “ Parallelism-Aware Batch Scheduling, ” ISCA 2008.

T0 T2 T1 Bank 0 T2 T3 T1 T0 T1 T0 T2 Bank 1 Batch 35

Request Batching

 Each memory request has a bit (

marked)

associated with it  Batch formation:  Mark up to Marking-Cap   oldest requests per bank for each thread Marked requests constitute the batch Form a new batch when no marked requests are left  Marked requests are prioritized over unmarked ones  No reordering of requests across batches: no starvation, high fairness  How to prioritize requests within a batch?

Within-Batch Scheduling

  Can use any existing DRAM scheduling policy  FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism  Service each thread ’ s requests back to back 

HOW?

Scheduler computes a ranking of threads when the batch is formed  Higher-ranked threads are prioritized over lower-ranked ones  Improves the likelihood that requests from a thread are serviced in parallel by different banks  Different threads prioritized in the same order across ALL banks 37

How to Rank Threads within a Batch

 Ranking scheme affects system throughput and fairness   Maximize system throughput  Minimize average stall-time of threads within the batch Minimize unfairness (Equalize the slowdown of threads)   Service threads with inherently low stall-time early in the batch Insight: delaying memory non-intensive threads results in high slowdown  Shortest stall-time first ( shortest job first )  ranking Provides optimal system throughput [Smith, 1956]*   Controller estimates each thread ’ s stall-time within the batch Ranks threads with shorter stall-time higher * W.E. Smith, “ Various optimizers for single stage production, ” Naval Research Logistics Quarterly, 1956.

Shortest Stall-Time First Ranking

  Maximum number of marked requests to any bank  (max-bank-load) Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load)  Breaks ties: rank thread with lower total-load higher T3 T1 T2 T3 T1 T2 T0 T2 T1 T3 T3 T2 T1 T0 T2 T3 T3 T3 T0 T2 T3 T3 Bank 0 Bank 1 Bank 2 Bank 3 T0 T1 T2 T3 max-bank-load total-load 1 3 2 2 5 4 6 9

Ranking: T0 > T1 > T2 > T3

Example Within-Batch Scheduling Order

Baseline Scheduling Order (Arrival order)

T3 T1 T2 T3 T1 T2 T0 T2 T1 T3 T3 T2 T1 T0 T2 T3 T3 T3 T0 T2 T3 T3 3 2 1 7 6 5 4

PAR-BS Scheduling Order

T3 T3 T2 T1 T1 T3 T2 T2 T1 T0 T3 T2 T2 T1 T0 T3 T3 T3 T3 T3 T2 T0 3 2 1 7 6 5 4 Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3

Stall times T0

AVG: 5 bank access latencies Ranking: T0 > T1 > T2 > T3 Stall times T0

AVG: 3.5 bank access latencies

Putting It Together: PAR-BS Scheduling Policy

 PAR-BS Scheduling Policy (1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first Batching Parallelism-aware within-batch scheduling  Three properties:  Exploits row-buffer locality and intra-thread bank parallelism   Work-conserving  Services unmarked requests to banks without marked requests Marking-Cap is important   Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads  Mutlu and Moscibroda, “ Parallelism-Aware Batch Scheduling, ” ISCA 2008.

Unfairness on 4-, 8-, 16-core Systems

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007] 5 4.5

4 FR-FCFS FCFS NFQ STFM PAR-BS 3.5

3 2.5

2 1.5

1 4-core 8-core 16-core 42

System Performance

0.6

0.5

0.4

0.3

0.2

0.1

0 1.4

1.3

1.2

1.1

1 0.9

0.8

0.7

4-core FR-FCFS FCFS NFQ STFM PAR-BS 8-core 16-core 43

Another Way of Reducing Interference

 Memory Channel Partitioning  Core 0 App A Idea: Map badly-interfering applications’ pages to different channels [Muralidhara+, MICRO’11] Time Units Channel 0 Time Units 5 4 3 2 1 5 4 3 2 1 Channel 0 Bank 0 Bank 1 Core 0 App A Bank 0 Bank 1 Core 1 App B Bank 0 Bank 1 Channel 1

Conventional Page Mapping

  Core 1 App B

Channel Partitioning

Bank 0 Bank 1 Channel 1 Separate data of low/high intensity and low/high row-locality applications Especially effective in reducing interference of threads with “medium” and “heavy” memory intensity 44

Yet Another Way: Core/Request Throttling

 Idea: Estimate the slowdown due to (DRAM) interference and throttle down threads that slow down others  Ebrahimi et al., “ Memory Systems , Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core ” ASPLOS 2010.

 Advantages + Core/request throttling is easy to implement: no need to change scheduling algorithm + Can be a general way of handling shared resource contention  Disadvantages - Requires interference/slowdown estimations - Thresholds can become difficult to optimize 45

Handling Interference in Parallel Applications

   Threads in a multithreaded application are inter-dependent Some threads can be on the critical path of execution due to synchronization; some threads are not How do we schedule requests of inter-dependent threads to maximize multithreaded application performance?

 Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO’11]  Hardware/software cooperative limiter thread estimation:   Thread executing the most contended critical section Thread that is falling behind the most in a parallel for loop 46

Memory System is the Major Shared Resource

threads ’ requests interfere 47

Inter-Thread/Application Interference

 Problem: Threads share the memory system, but memory system does not distinguish between threads’ requests  Existing memory systems     Free-for-all, shared based on demand Control algorithms thread-unaware and thread-unfair Aggressive threads can deny service to others Do not try to reduce or control inter-thread interference 48

How Do We Solve The Problem?

 Inter-thread interference is uncontrolled in all memory resources    Memory controller Interconnect Caches  We need to control it  i.e., design an interference-aware (QoS-aware) memory system 49

Virtual Memory

Roadmap

 Virtual Memory       Purpose: illusion of a large memory and protection Simplified memory management for multiple processes Demand paging, page faults Address Translation TLB Integrating Caches and Virtual Memory     Physically indexed caches Virtually indexed caches Virtually indexed, physically tagged caches Synonym/aliasing problem and solutions 51

Readings

  Section 5.4 in P&H Optional: Section 8.8 in Hamacher et al.

Ideal Memory

    Zero access time (latency) Infinite capacity Zero cost Infinite bandwidth (to support multiple accesses in parallel) 53

A Modern Memory Hierarchy

Register File 32 words, sub-nsec Memory Abstraction L1 cache ~32 KB, ~nsec L2 cache 512 KB ~ 1MB, many nsec L3 cache, .....

Main memory (DRAM), GB, ~100 nsec Swap Disk 100 GB, ~10 msec manual/compiler register spilling Automatic HW cache management automatic demand paging 54

The Problem

 Physical memory is of limited size (cost)    What if you need more?

Should the programmer be concerned about the size of code/data blocks fitting physical memory? (overlay programming, programming with some embedded systems) Should the programmer manage data movement from disk to physical memory?

 Also, ISA can have an address space greater than the physical memory size   E.g., a 64-bit address space with byte addressability What if you do not have enough physical memory?

Virtual Memory

 Idea: Give the programmer the illusion of a large address space  So that he/she does not worry about running out of memory  Programmer can assume he/she has “infinite” amount of physical memory  Really, it is the amount specified by the address space for a program  Hardware and software cooperatively provide the illusion even though physical memory is not infinite  Illusion is maintained for each independent process 56

Basic Mechanism

 Indirection  Address generated by each instruction in a program is a “virtual address”   i.e., it is not the physical address used to address main memory called “linear address” in x86  An “address translation” mechanism maps this address to a “physical address”   called “real address” in x86 Address translation mechanism is implemented in hardware and software together 57

Virtual Pages, Physical Frames

 Virtual address space divided into pages  Physical address space divided into frames  A virtual page is mapped to a physical frame  Assuming the page is in memory  If an accessed virtual page is not in memory, but on disk  Virtual memory system brings the page into a physical frame and adjusts the mapping  demand paging  Page table is the table that stores the mapping of virtual pages to physical frames 58

A System with Physical Memory Only

 Examples:    Most Cray machines early PCs nearly all embedded systems

Physical Addresses

Memory 0: 1: CPU

CPU’s load or store addresses used directly to access memory.

N-1:

A System with Virtual Memory (page-based)

 Examples:  Laptops, servers, modern PCs

Page Table

Virtual Addresses

0: 1:

Physical Addresses

Memory 0: 1: CPU P-1: N-1:



Disk

Address Translation: The hardware converts virtual addresses into physical addresses via an OS-managed lookup table (page table) 60

Page Fault (“A miss in physical memory”)

 What if object is on disk rather than in memory?

  Page table entry indicates virtual page not in memory  fault exception page OS trap handler invoked to move data from disk into memory  Current process suspends, others can resume  OS has full control over placement

Before fault Memory After fault Memory

Virtual Addresses

Page Table

Physical Addresses

CPU

Virtual Addresses

Page Table

Physical Addresses

CPU Disk Disk

Servicing a Page Fault

  (1) Processor signals controller  Read block of length P starting at disk address X and store starting at memory address Y (2) Read occurs   Direct Memory Access (DMA) Under control of I/O controller  (3) Controller signals completion   Interrupt processor OS resumes suspended process (1) Initiate Block Read Processor Reg (3) Read Done Cache Memory-I/O bus (2) DMA Transfer Memory I/O controller Disk Disk 62

Page Table is Per Process

 Each process has its own virtual address space   Full address space for each program Simplifies memory allocation, sharing, linking and loading.

Virtual Address Space for Process 1: 0 N-1 VP 1 VP 2 ...

Address Translation 0 PP 2 Physical Address Space (DRAM) (e.g., read/only library code) Virtual Address Space for Process 2: 0 N-1 VP 1 VP 2 ...

PP 7 M-1 PP 10 63

Address Translation

 Page size specified by the ISA   VAX: 512 bytes Today: 4KB, 8KB, 2GB, … (small and large pages mixed together)  Page Table contains an entry for each virtual page   Called Page Table Entry (PTE) What is in a PTE?

Address Translation

Page Table Entry

We did not cover the following slides in lecture. These are for your preparation for the next lecture.

VM Address Translation

 Parameters    P = 2 p = page size (bytes). N = 2 n M = 2 m = Virtual-address limit = Physical-address limit n –1 virtual page number p p –1 page offset 0 virtual address address translation m –1 physical page number p p –1 page offset 0 physical address Page offset bits don ’t change as a result of translation 68

VM Address Translation

   Separate (set of) page table(s) per process VPN forms index into page table (points to a page table entry) Page Table Entry (PTE) provides information about page page table base register virtual address n –1 virtual page number (VPN) p p –1 page offset valid access physical page number (PPN) 0 VPN acts as table index if valid=0 then page not in memory (page fault) m –1 p physical page number (PPN) physical address p –1 page offset 0 69

VM Address Translation: Page Hit

VM Address Translation: Page Fault

Page-Level Access Control (Protection)

 Not every process is allowed to access every page  E.g., may need supervisor level privilege to access system pages  Idea: Store access control information on a page basis in the process’s page table  Enforce access control at the same time as translation  Virtual memory system serves two functions today Address translation (for illusion of large physical memory) Access control (protection) 72

Issues (I)

 How large is the page table?

 Where do we store it?    In hardware?

In physical memory? (Where is the PTBR?) In virtual memory? (Where is the PTBR?)  How can we store it efficiently without requiring physical memory that can store all page tables?

   Idea: multi-level page tables Only the first-level page table has to be in physical memory Remaining levels are in virtual memory (but get cached in physical memory when accessed) 73

Page Table Access

 How do we access the Page Table?

  Page Table Base Register Page Table Limit Register  If VPN is out of the bounds (exceeds PTLR) then the process did not allocate the virtual page  access control exception 74

Issues (II)

 How fast is the address translation?

 How can we make it fast?

 Idea: Use a hardware structure that caches PTEs  Translation lookaside buffer  What should be done on a TLB miss?

  What TLB entry to replace?

Who handles the TLB miss? HW vs. SW?

 What should be done on a page fault?

  What virtual page to replace from physical memory?

Who handles the page fault? HW vs. SW?

Issues (III)

 When do we do the address translation?

 Before or after accessing the L1 cache?

 In other words, is the cache virtually addressed or physically addressed?

 Virtual versus physical cache  What are the issues with a virtually addressed cache?

 Synonym problem:  Two different virtual addresses can map to the same physical address  same physical address can be present in multiple locations in the cache  can lead to inconsistency in data 76