Transcript Paging
Paging
Virtual Memory
• •
Segmentation
– Basic early approach
Paging
– Modern approach – Advantages • Easy to allocate physical memory • Easy to “page out”/swap chunks of memory – Disadvantages • • Overhead added to each memory reference Additional memory is required to store page tables
Hardware and OS structures for paging
• •
Hardware
– Page table base register • X86: CR3 – TLB
Software
– Page table • Virtual->physical or virtual->disk mapping – Page frame database • One entry for each physical page • Information about page – e.g. owning process, r/w permissions – Swap file / Section list
Page Frame Database
/* Each physical page in the system has a struct page associated with * it to keep track of whatever it is we are using the page for at the * moment. Note that we have no way to track which tasks are using * a page */ struct page { unsigned long flags; // Atomic flags: locked, referenced, dirty, slab, disk atomic_t _count; // Usage count, atomic_t _mapcount; // Count of ptes mapping in this page struct { unsigned long private; // Used for managing page used in file I/O struct address_space * mapping; // Used to define the data this page is holding }; pgoff_t index; // Our offset within mapping struct list_head lru; // Linked list node containing LRU ordering of pages }; void * virtual; // Kernel virtual address
Multilevel page tables
• Page of table with N levels – Virtual addresses split into N+1 parts • N indexes to different levels • 1 offset into page • Example: 32 bit paging on x86 – 4KB pages, 4 bytes/PTE • 12 bits in offset: 2^12 = 4096 – Want to fit page table entries into 1 page • 4KB / 4 bytes = 1024 PTEs per page – So level indexes = 10 bits each • 2^10 = 1024
2 level page tables
Virtual Address Master Page # (10 bits) Secondary Page # (10 bits) Offset (12 bits) Physical Address Page Frame # Offset
Secondary Page Table
Master Page Table
Page Frame Number
Physical Memory Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3 Page Frame 4 Secondary Page Table
Inverted Page Table
• Previous examples: “Forward Page tables” – – Page table size relative to size of virtual memory Physical memory could be much less • Lots of wasted space Virtual Page # Offset Hash Table Phsyical Page # • • Separate approach: Use a hash table – Inverted page table – – Size is independent of virtual address space Directly related to size of physical memory Cons: – Have to manage a hash table (collisions, rebalancing, etc) Offset
Addressing Page Tables
• • • Where are page tables stored?
– And in which address space?
Possibility #1: Physical memory – Easy address, no translation required – But page tables must stay resident in memory Possibility #2: Virtual Memory (OS VA space) – Cold (unused) page table pages can be swapped out – But page table addresses must be translated through page tables • Don’t page the outer page table page (called wiring) • Question: Can the kernel be paged?
Generic PTE
• • PTE maps virtual page to physical page Includes Page properties (shared with HW) – Valid? Writable? Dirty? Cacheable?
Physical Page Number Where is the virtual page number?
Property Bits • Acronyms – PTE = Page Table Entry – PDE = Page Directory Entry – VA = Virtual Address – PA = Physical Address – VPN = Virtual Page number – PPN = Physical Page number – PFN = Page Frame Number (same as PPN)
X86 address translation (32 bit)
• •
Page Tables organized as a 2 level tree
– Efficiently handle sparse address space
One set of page tables per process
– Current page tables pointed to by CR3 • • •
CPU “walks” page tables to find translations
– Accessed and Dirty bits updated by CPU
32 bit: 4KB or 4MB pages 64 bit: 4 levels; 4KB or 2MB pages
X86 32 bit PDE/PTE details
PWT: Write through PCD: Cache Disable P: Present R/W: Read/Write U/S: User/System AVL: Available for OS use A: Accessed D: Dirty PAT: Cache behavior definition G: Global • If page is not present (P=0), then other bits are available for OS to use. • Remember: Useful for swapping
Paging Translation
Making it efficient
• • • Original page table scheme doubled cost of memory accesses – 1 page table access, 1 data access 2-level page tables triple the cost – 2 page table accesses + 1 data access 4-level page tables quintuple the cost – 4 page table accesses + 1 data access • How to achieve efficiency – Goal: Make virtual memory accesses as fast as physical memory accesses – Solution: Use a hardware cache • Cache virtual-to-physical translations in hardware • Translation Lookaside Buffer (TLB) • X86: – – TLB is managed by CPU’s MMU 1 per CPU/core
TLBs
• • • Translation Lookaside buffers – Translates Virtual page #s into PTEs (NOT physical addresses) • Why?
– Can be done in single machine cycle Implemented in hardware – Associative cache (many entries searched in parallel) – Cache tags are virtual page numbers – Cache values are PTEs – With PTE + offset, MMU directly calculates PA TLBs rely on locality – Processes only use a handful of pages at a time • 16-48 entries in TLB is typical (64-192KB for 4kb pages) • Targets “hot set” or “working set” of process – TLB hit rates are critical for performance
Managing TLBs
• • • Address translations are mostly handled by TLB – (>99%) hit rate, but there are occasional TLB misses – On miss, who places translations into TLB?
Hardware (MMU) – Knows where page tables are in memory (CR3) • OS maintains them, HW accesses them – Tables setup in HW-defined format – X86 Software loaded TLB (OS) – TLB miss faults to OS, OS finds right PTE and loads it into TLB – Must be fast • CPU ISA has special TLB access instructions • OS uses its own page table format – SPARC and IBM Power
Managing TLBs (2)
• • OS must ensure TLB and page tables are consistent – If OS changes PTE, it must invalidate cached PTE in TLB – Explicit instruction to invalidate PTE • X86: invlpg What happens on a context switch?
– Each process has its own page table – Entire TLB must be invalidated (TLB flush) – X86: Certain instructions automatically flush entire TLB • Reloading CR3: asm (“mov %1, %%cr3”); • When TLB misses, a new PTE is loaded, and cached PTE is evicted – Which PTE should be evicted? • TLB Replacement Policy • Defined and implemented in hardware (usually LRU)
x86 TLB
• • •
TLB management is shared by CPU and OS CPU:
– Fills TLB on demand from page tables • OS is unaware of TLB misses – Evicts entries as needed
OS:
– Ensures TLB and page tables are consistent • Flushes entire TLB when page tables are switched (e.g. context switch) – asm (“mov %0, %%cr3”:: “r”(page_table_addr)); • Modifications to a single PTE are flushed explicitly – asm (“invlpg %0;”:: “r”(virtual_addr));
Cool Paging Tricks
• Exploit level of indirection between VA and PA – Shared memory • Regions of two separate processes’ address spaces map to the same physical memory – Read/write: access to shared data – Execute: shared libraries • Each process can have a separate PTE pointing to same physical memory – Different access privileges for different processes • Does the shared region need to map the same VA in each process?
– Copy-On-Write (COW) • Instead of copying physical memory on fork() • Just create a new set of identical page tables with writes disabled – When a child writes to a page, OS gets page fault – OS copies physical memory page, and maps new page into child process
Saving Memory to Disk
• In memory shortage: – OS writes memory contents to disk and reuses memory • Copying a whole process is called “swapping” • Copying a single page is called “paging” • Where does data go?
– If it came from a file and was not modified: deleted from memory • E.g. Executable code – Unix Swap partition • A partition (file, disk segment, or entire disk) reserved as a backing store – Windows Swap file • Designated file stored in regular file system • When does data move?
• Swapping: In advance of running a process • Paging: When a page of memory is accessed
Demand paging
•
Moving pages between memory and disk
– OS uses main memory as a cache • Most of memory is used to store file data – Programs, libraries, data – File contents cached in memory • Anonymous memory – Memory not used for file data » Heap, stack, globals, … – Backed to swap file/partition – OS manages movement of pages to/from disk • Transparent to application
Why is this “demand” paging?
•
When a process first starts: fork()/exec()
– Brand new page tables with no valid PTEs – No pages mapped to physical memory – As process executes memory is accessed • Instructions immediately fault on code and data pages • Faults stop once all necessary code/data is in memory – Only code/data that is needed • Memory that is needed changes over time – Pages shift between disk and memory
Page faults
• What happens when a process references an evicted page?
– When page is evicted, OS sets PTE as invalid (present = 0) • Sets rest of PTE bits to indicate location in swap file – When process accesses page, invalid PTE triggers a CPU exception (page fault) – OS invokes page fault handler • Checks PTE, and uses high 31 bits to find page in swap disk • Handler reads page from disk into available frame – Possibly has to evict another page… • Handler restarts process • What if memory is full?
– Another page must be evicted (page replacement algorithm)
Steps in Handling a page fault
Evicting the best page
•
OS must choose victim page to be evicted
– Goal: Reduce the page fault rate – The best page to evict is one that will never be accessed again • Not really possible… • Belady’s proof: Evicting the page that won’t be used for the longest period of time minimizes page fault rate
Belady’s Algorithm
• • •
Find page that won’t be used for the longest amount of time
– Not possible
So why is it here?
– Provably optimal solution – Comparison for other practical algorithms – Upper bound on possible performance
Lower bound?
– Depends on workload… • Random replacement is generally a bad idea
FIFO
• • • • Obvious and simple – When a page is brought in, goes to tail of list – On eviction take the head of the list Advantages – If it was brought in a while ago, then it might not be used...
Disadvantages – Or its being used by everybody (glibc) – Does not measure access behavior at all FIFO suffers from Belady’s Anomaly – Fault rate might increase when given more physical memory • Very bad property… • Exercise: Develop a workload where this is true
Least Recently Used (LRU)
•
Use access behavior during selection
– – Idea: Use past behavior to predict future behavior On replacement, evict page that hasn’t been used for the longest amount of time • LRU looks at the past, Belady’s looks at future •
Implementation
– To be perfect, every access must be detected and timestamped (way too expensive) • So it must be approximated
Approximating LRU
•
Many approximations, all use PTE flag
– x86: Accessed bit, set by HW on every access – – Each page has a counter (where?) Periodically, scan entire list of pages • If accessed = 0, increment counter (not used) • • If accessed = 1, clear counter (used) Clear accessed flag in PTE – Counter will contain # of iterations since last reference • Page with largest counter is least recently used •
Some CPUs don’t have PTE flags
– Can simulate it by forcing page faults with invalid PTE
LRU Clock
• • Not Recently Used (NRU) or Second Chance – Replace page that is “old enough” Arrange page in circular list (like a clock) – Clock hand (ptr value) sweeps through list • If accessed = 0, not used recently so evict it • If accessed = 1, recently used – Set accessed to 0, go to next PTE • Recommended for Project 3 • Problem: – If memory is large, “accuracy” of information degrades
N-Chance clock
• • Basic Clock only has two states – used or not used Can we add more?
– Answer: Embed counter into PTE • Clock hand (ptr value) sweeps through list – If accessed = 0, not used recently • if counter = 0, evict • else, decrement counter and go to next PTE – If accessed = 1, recently used • Set accessed to 0 • • Increment counter Go to next PTE