Transcript [talk]
Efficient Virtual Memory for Big Memory Servers Arkaprava Basu, Jayneel Gandhi, Jichuan Chang*, Mark D. Hill, Michael M. Swift * HP Labs “Virtual Memory was invented in a time of scarcity. Is it still good idea?” --- Charles Thacker, 2010 Turing Award Lecture Executive Summary • Big memory workloads important – graph analysis, memcached, databases • Our analysis: – TLB misses burns up to 51% execution cycles – Paging not needed for almost all of their memory • Our proposal: Direct Segments – Paged virtual memory where needed – Segmentation (No TLB miss) where possible • Direct Segment often eliminates 99% DTLB misses 7/26/2016 ISCA 2013 2 Virtual Memory Refresher Process 1 Virtual Address Space Core Physical Memory Cache TLB (Translation Lookaside Buffer) Process 2 Challenge: TLB misses wastes execution time 7/26/2016 Page Table 3 Memory Usage Trend • Memory Size: MB GB – Windows Server: 64GB TB 4TB in a decade • TLB size remained almost constant Year L1-DTLB entries 1999 72 (Pent. III) 2001 64 (Pent. 4) 2008 2012 96 100 (Nehalem) (Ivy Bridge) • Low access locality of server workloads [Ramcloud’10] – TLB is less effective Memory Size + 7/26/2016 TLB size => ISCA 2013 TLB miss overhead 4 Experimental Setup • Experiments on Intel Xeon (Sandy Bridge) x86-64 – Page sizes: 4KB (Default), 2MB, 1GB 4 KB L1 DTLB L2 DTLB 2 MB 1GB 64 entry, 4-way 32 entry, 4-way 4 entry, fully assoc. 512 entry, 4-way • 96GB installed physical memory • Methodology: Use hardware performance counter 7/26/2016 ISCA 2013 5 7/26/2016 yS Q L d ISCA 2013 PS 51.1 GU NP B: CG NP B: BT M em ca ch e ap h5 00 35 m gr Percentage of execu on cycles spent on servicing DTLB missses Big Memory Workloads 83.1 30 4KB 25 2MB 20 15 1GB 10 5 Direct Segment 0 6 35 51.1 83.1 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 8 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 9 35 51.1 Significant overhead of paged virtual memory 30 25 83.1 51.3 4KB Worse with TBs of memory now or in future? 20 15 2MB 1GB 10 Direct Segme 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 10 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 0.01 10 ~0 0.48 ~0 0.01 1GB 0.49 Direct Segment 5 m 7/26/2016 ISCA 2013 GU PS NP B: CG NP B: BT L yS Q M ca ch ed em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 11 Roadmap • • • • • Introduction and Motivation Analysis: Big memory workloads Design: Direct Segment Evaluation Summary 7/26/2016 ISCA 2013 12 How is Paged Virtual Memory used? An example: memcached servers 7/26/2016 In-memory Hash table Network state Client memcached server # n ISCA 2013 Key X Value Y 13 Big Memory Workloads’ Use of Paging Paged VM Feature Our Analysis Implication Swapping ~0 swapping Not essential Per-page protection ~99% pages read-write Overkill Fragmentation reduction Little OS-visible fragmentation (next slide) Per-page (re)allocation less important 7/26/2016 ISCA 2013 14 Allocated Memory (in GB) Memory Allocation Over Time Warm-up graph500 memcached 0 300 MySQL NPB:BT NPB:CG GUPS 90 75 60 45 30 15 0 150 450 600 750 900 1050 1200 1350 1500 Time (in seconds) Most of the memory allocated early 7/26/2016 ISCA 2013 15 Where Paged Virtual Memory Needed? Paging Valuable Paging Not Needed * VA Dynamically allocated Heap region Code Constants Shared Memory Mapped Files Stack Guard Pages Paged VM not needed for MOST memory 7/26/2016 ISCA 2013 * Not to scale 16 Roadmap • Introduction and Motivation • Analysis: Big Memory Workloads • Design: Direct Segment – Idea – Hardware – Software • Evaluation • Summary 7/26/2016 ISCA 2013 17 Idea: Two Types of Address Translation A Conventional paging • All features of paging • All cost of address translation B Simple address translation • NO paging features • NO TLB miss • OS/Application decides where to use which [=> Paging features where needed] 7/26/2016 ISCA 2013 18 Hardware: Direct Segment 2 Direct Segment 1 Conventional Paging BASE LIMIT VA OFFSET PA Why Direct Segment? • Matches big memory workload needs • NO TLB lookups => NO TLB Misses 7/26/2016 ISCA 2013 19 H/W: Translation with Direct Segment [V47V46……………………V13V12] [V11……V0] LIMIT<? BASE ≥? DTLB Lookup Paging Ignored HIT/MISS Y OFFSET 7/26/2016 MISS Page-Table Walker [P40P39………….P13P12] [P11……P ] 20 0 H/W: Translation with Direct Segment [V47V46……………………V13V12] BASE ≥? [V11……V0] LIMIT<? Direct Segment Ignored N DTLB Lookup HIT OFFSET 7/26/2016 HIT/MISS MISS Page-Table Walker [P40P39………….P13P12] [P11……P ] 21 0 S/W: 1 Setup Direct Segment Registers • Calculate register values for processes – BASE = Start VA of Direct Segment – LIMIT = End VA of Direct Segment – OFFSET = BASE – Start PA of Direct Segment • Save and restore register values BASE LIMIT VA2 VA1 OFFSET PA 7/26/2016 ISCA 2013 22 S/W: 2 Provision Physical Memory • Create contiguous physical memory – Reserve at startup • Big memory workloads cognizant of memory needs • e.g., memcached’s object cache size – Memory compaction • Latency insignificant for long running jobs – 10GB of contiguous memory in < 3 sec – 1% speedup => 25 mins break even for 50GB compaction 7/26/2016 ISCA 2013 23 S/W: 3 Abstraction for Direct Segment • Primary Region – Contiguous VIRTUAL address not needing paging – Hopefully backed by Direct Segment – But all/part can use base/large/huge pages VA PA • What allocated in primary region? – All anonymous read-write memory allocations – Or only on explicit request (e.g., mmap flag) 7/26/2016 ISCA 2013 24 Roadmap • • • • Introduction and Motivation Analysis: Big Memory Workloads Design: Direct Segment Evaluation – Methodology – Results • Summary 7/26/2016 ISCA 2013 25 Methodology • Primary region implemented in Linux 2.6.32 • Estimate performance of non-existent direct-segment – Get fraction of TLB misses to direct-segment memory – Estimate performance gain with linear model • Prototype simplifications (design more general) – One process uses direct segment – Reserve physical memory at start up – Allocate r/w anonymous memory to primary region 7/26/2016 ISCA 2013 26 35 51.1 83.1 51.3 Lower is better 30 4KB 25 2MB 20 15 1GB 10 Direct Segment 5 PS GU NP B: CG NP B: BT yS Q L M d m em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7/26/2016 ISCA 2013 27 35 51.1 83.1 51.3 Lower is better 30 4KB 25 2MB 20 15 0.01 10 ~0 ~0 0.48 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG T NP B: B L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7/26/2016 ISCA 2013 28 35 51.1 83.1 51.3 Lower is better 30 20 4KB “Misses” in Direct Segment 25 99.9% 2MB 92.4% 99.9% 99.9% 99.9% 99.9% 15 0.01 10 ~0 ~0 0.48 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG T NP B: B L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7/26/2016 ISCA 2013 29 (Some) Limitations • Does not (yet) work with Virtual Machines • Can be extended but memory overcommit challenging • Less suitable for sparse virtual address space • One direct segment – Our workloads did not justify more 7/26/2016 ISCA 2013 30 Summary • Big memory workloads – Incurs high TLB miss cost – Paging not needed for almost all memory • Our proposal: Direct Segment – Paged virtual memory where needed – Segmentation (NO TLB miss) where possible 7/26/2016 ISCA 2013 31 Thank You & Questions? 7/26/2016 ISCA 2013 32 BACKUP 7/26/2016 ISCA 2013 33 Address Translation in Different ISA/machines ISA/Machine Address Translation Multics Segmentation on top of Paging Burroughs B5000 Segmentation UltraSPARC Paging X86 (32 bit) Segmentation on top of Paging ARM Paging PowerPC Segmentation on top of Paging Alpha Paging X86-64 Paging only (mostly) Direct Segment: (1)NOT on top of paging. (2)NOT to replace paging. (3)NO two-dimensional address space. Keeps Linear address space. 7/26/2016 ISCA 2013 34 Why not Huge Pages? • Huge pages does not automatically scale – New page size and/or more TLB entries • TLBs dependent on access locality • Fixed ISA-defined sparse page sizes – e.g., 4KB, 2MB, 1GB – Needs to be aligned at page size boundaries • Multiple page sizes introduces TLB tradeoffs – Fully associative vs. set-associative designs 7/26/2016 ISCA 2013 35 Direct Segment in Cloud? • In current incarnation DS most suitable for enterprise workloads – Less suitable when many short jobs come and go • Memory usage needs to be predictable to enable performance guarantees – Same memory usage predictions can be used to create DS 7/26/2016 ISCA 2013 36 How to handle faulty pages? • Direct segment can not remap faulty pages – No ability to remapping at small granularities • Revert part or all of direct segment memory • Memory controller remaps faulty pages – Only small number of faulty pages – List of faulty re-mapped pages in MC 7/26/2016 ISCA 2013 37 Methodology • S/W TLB miss tracker – Make PTEs invalid in memory valid in TLB – Trap to OS on each TLB miss – Range checking against direct segment’s VA • Assumption – TLB miss overhead reduces proportionally with the number of DTLB misses 7/26/2016 ISCA 2013 38