Transcript link

Efficient Virtual Memory
for Big Memory Servers
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang*,
Mark D. Hill, Michael M. Swift
* HP Labs
“Virtual Memory was invented in a time of scarcity. Is it still good idea?”
--- Charles Thacker, 2010 Turing Award Lecture
Executive Summary
• Big memory workloads important
–
graph analysis, memcached, databases
• Our analysis:
– TLB misses burns up to 51% execution cycles
– Paging not needed for almost all of their memory
• Our proposal: Direct Segments
– Paged virtual memory where needed
– Segmentation (No TLB miss) where possible
• Direct Segment often eliminates 99% DTLB misses
7/26/2016
ISCA 2013
2
Virtual Memory Refresher
Process 1
Virtual Address Space
Core
Physical Memory
Cache
TLB
(Translation Lookaside Buffer)
Process 2
Challenge:
TLB misses wastes
execution time
7/26/2016
Page Table
3
Memory capacity for $10,000*
1,000.00
1
Memory size
GB
MB
TB
10,000.00
10
Commercial servers with
4TB memory
100.00
100
10.00
10
1.00
1
0.10
100
Big data needs to access
terabytes of data at low latency
0.01
10
0.00
0
1980
1990
*Inflation-adjusted 2011 USD, from: jcmit.com
2000
2010
4
Memory Usage Trend
• Memory Size: MB
GB
– Windows Server: 64GB
TB
4TB in a decade
• TLB size remained almost constant
Year
L1-DTLB
entries
1999
72
(Pent. III)
2001
64
(Pent. 4)
2008
2012
96
100
(Nehalem) (Ivy Bridge)
• Low access locality of server workloads [Ramcloud’10]
– TLB is less effective
Memory Size +
7/26/2016
TLB size =>
ISCA 2013
TLB miss overhead
5
Experimental Setup
• Experiments on Intel Xeon (Sandy Bridge) x86-64
– Page sizes: 4KB (Default), 2MB, 1GB
4 KB
L1 DTLB
L2 DTLB
2 MB
1GB
64 entry, 4-way 32 entry, 4-way 4 entry, fully assoc.
512 entry, 4-way
• 96GB installed physical memory
• Methodology: Use hardware performance counter
7/26/2016
ISCA 2013
6
7/26/2016
yS
Q
L
d
ISCA 2013
PS
51.1
GU
NP
B:
CG
NP
B:
BT
M
em
ca
ch
e
ap
h5
00
35
m
gr
Percentage of execu on cycles spent on
servicing DTLB missses
Big Memory Workloads
83.1
30
4KB
25
2MB
20
15
1GB
10
5
Direct
Segment
0
7
35
51.1
83.1
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
m
7/26/2016
ISCA 2013
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
8
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
m
7/26/2016
ISCA 2013
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
9
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
m
7/26/2016
ISCA 2013
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
10
35
51.1
Significant overhead of
paged virtual memory
30
25
83.1 51.3
4KB
Worse with TBs of
memory now or in
future?
20
15
2MB
1GB
10
Direct
Segme
5
m
7/26/2016
ISCA 2013
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
11
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
0.01
10
~0
0.48
~0
0.01
1GB
0.49
Direct
Segment
5
m
7/26/2016
ISCA 2013
GU
PS
NP
B:
CG
NP
B:
BT
L
yS
Q
M
ca
ch
ed
em
h5
00
0
gr
ap
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
12
Roadmap
•
•
•
•
•
Introduction and Motivation
Analysis: Big memory workloads
Design: Direct Segment
Evaluation
Summary
7/26/2016
ISCA 2013
13
How is Paged Virtual Memory used?
An example: memcached servers
7/26/2016
In-memory
Hash table
Network state
Client
memcached
server # n
ISCA 2013
Key X
Value Y
14
Big Memory Workloads’ Use of Paging
Paged VM
Feature
Our Analysis
Implication
Swapping
~0 swapping
Not essential
Per-page protection
~99% pages read-write
Overkill
Fragmentation
reduction
Little OS-visible
fragmentation
(next slide)
Per-page (re)allocation less
important
7/26/2016
ISCA 2013
15
Allocated Memory (in GB)
Memory Allocation Over Time
Warm-up
graph500
memcached
0
300
MySQL
NPB:BT
NPB:CG
GUPS
90
75
60
45
30
15
0
150
450
600
750
900
1050 1200 1350 1500
Time (in seconds)
Most of the memory allocated early
7/26/2016
ISCA 2013
16
Where Paged Virtual Memory Needed?
Paging Valuable
Paging Not Needed
*
VA
Dynamically allocated
Heap region
Code Constants
Shared Memory
Mapped Files Stack
Guard Pages
Paged VM not needed for MOST memory
7/26/2016
ISCA 2013
* Not to scale
17
Roadmap
• Introduction and Motivation
• Analysis: Big Memory Workloads
• Design: Direct Segment
– Idea
– Hardware
– Software
• Evaluation
• Summary
7/26/2016
ISCA 2013
18
Idea: Two Types of Address Translation
A
Conventional paging
• All features of paging
• All cost of address translation
B
Simple address translation
• NO paging features
• NO TLB miss
• OS/Application decides where to use which
[=> Paging features where needed]
7/26/2016
ISCA 2013
19
Hardware: Direct Segment
2 Direct Segment
1 Conventional Paging
BASE
LIMIT
VA
OFFSET
PA
Why Direct Segment?
• Matches big memory workload needs
• NO TLB lookups => NO TLB Misses
7/26/2016
ISCA 2013
20
H/W: Translation with Direct Segment
[V47V46……………………V13V12]
[V11……V0]
LIMIT<?
BASE ≥?
DTLB
Lookup
Paging Ignored
HIT/MISS
Y
OFFSET
7/26/2016
MISS
Page-Table
Walker
[P40P39………….P13P12]
[P11……P
]
21 0
H/W: Translation with Direct Segment
[V47V46……………………V13V12]
BASE ≥?
[V11……V0]
LIMIT<?
Direct Segment
Ignored
N
DTLB
Lookup
HIT
OFFSET
7/26/2016
HIT/MISS
MISS
Page-Table
Walker
[P40P39………….P13P12]
[P11……P
]
22 0
S/W:
1
Setup Direct Segment Registers
• Calculate register values for processes
– BASE = Start VA of Direct Segment
– LIMIT = End VA of Direct Segment
– OFFSET = BASE – Start PA of Direct Segment
• Save and restore register values
BASE
LIMIT
VA2
VA1
OFFSET
PA
7/26/2016
ISCA 2013
23
S/W: 2 Provision Physical Memory
• Create contiguous physical memory
– Reserve at startup
• Big memory workloads cognizant of memory needs
• e.g., memcached’s object cache size
– Memory compaction
• Latency insignificant for long running jobs
– 10GB of contiguous memory in < 3 sec
– 1% speedup => 25 mins break even for 50GB compaction
7/26/2016
ISCA 2013
24
S/W:
3
Abstraction for Direct Segment
• Primary Region
– Contiguous VIRTUAL address not needing paging
– Hopefully backed by Direct Segment
– But all/part can use base/large/huge pages
VA
PA
• What allocated in primary region?
– All anonymous read-write memory allocations
– Or only on explicit request (e.g., mmap flag)
7/26/2016
ISCA 2013
25
Roadmap
•
•
•
•
Introduction and Motivation
Analysis: Big Memory Workloads
Design: Direct Segment
Evaluation
– Methodology
– Results
• Summary
7/26/2016
ISCA 2013
26
Methodology
• Primary region implemented in Linux 2.6.32
• Estimate performance of non-existent direct-segment
– Get fraction of TLB misses to direct-segment memory
– Estimate performance gain with linear model
• Prototype simplifications (design more general)
– One process uses direct segment
– Reserve physical memory at start up
– Allocate r/w anonymous memory to primary region
7/26/2016
ISCA 2013
27
35
51.1
83.1 51.3
Lower is better
30
4KB
25
2MB
20
15
1GB
10
Direct
Segment
5
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
m
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
7/26/2016
ISCA 2013
28
35
51.1
83.1 51.3
Lower is better
30
4KB
25
2MB
20
15
0.01
10
~0
~0
0.48
0.01
1GB
0.49
Direct
Segment
5
GU
PS
NP
B:
CG
T
NP
B:
B
L
yS
Q
M
ca
ch
ed
m
em
h5
00
0
gr
ap
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
7/26/2016
ISCA 2013
29
35
51.1
83.1 51.3
Lower is better
30
20
4KB
“Misses” in Direct Segment
25
99.9%
2MB
92.4% 99.9% 99.9% 99.9%
99.9%
15
0.01
10
~0
~0
0.48
0.01
1GB
0.49
Direct
Segment
5
GU
PS
NP
B:
CG
T
NP
B:
B
L
yS
Q
M
ca
ch
ed
m
em
h5
00
0
gr
ap
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
7/26/2016
ISCA 2013
30
(Some) Limitations
• Does not (yet) work with Virtual Machines
• Can be extended but memory overcommit challenging
• Less suitable for sparse virtual address space
• One direct segment
– Our workloads did not justify more
7/26/2016
ISCA 2013
31
Summary
• Big memory workloads
– Incurs high TLB miss cost
– Paging not needed for almost all memory
• Our proposal: Direct Segment
– Paged virtual memory where needed
– Segmentation (NO TLB miss) where possible
7/26/2016
ISCA 2013
32
Thank You
&
Questions?
7/26/2016
ISCA 2013
33
BACKUP
7/26/2016
ISCA 2013
34
Address Translation in Different
ISA/machines
ISA/Machine
Address Translation
Multics
Segmentation on top of Paging
Burroughs B5000
Segmentation
UltraSPARC
Paging
X86 (32 bit)
Segmentation on top of Paging
ARM
Paging
PowerPC
Segmentation on top of Paging
Alpha
Paging
X86-64
Paging only (mostly)
Direct Segment:
(1)NOT on top of paging.
(2)NOT to replace paging.
(3)NO two-dimensional address
space. Keeps Linear address space.
7/26/2016
ISCA 2013
35
Why not Huge Pages?
• Huge pages does not automatically scale
– New page size and/or more TLB entries
• TLBs dependent on access locality
• Fixed ISA-defined sparse page sizes
– e.g., 4KB, 2MB, 1GB
– Needs to be aligned at page size boundaries
• Multiple page sizes introduces TLB tradeoffs
– Fully associative vs. set-associative designs
7/26/2016
ISCA 2013
36
Direct Segment in Cloud?
• In current incarnation DS most suitable for
enterprise workloads
– Less suitable when many short jobs come and go
• Memory usage needs to be predictable to
enable performance guarantees
– Same memory usage predictions can be used to
create DS
7/26/2016
ISCA 2013
37
How to handle faulty pages?
• Direct segment can not remap faulty pages
– No ability to remapping at small granularities
• Revert part or all of direct segment memory
• Memory controller remaps faulty pages
– Only small number of faulty pages
– List of faulty re-mapped pages in MC
7/26/2016
ISCA 2013
38
Methodology
• S/W TLB miss tracker
– Make PTEs invalid in memory valid in TLB
– Trap to OS on each TLB miss
– Range checking against direct segment’s VA
• Assumption
– TLB miss overhead reduces proportionally with
the number of DTLB misses
7/26/2016
ISCA 2013
39