Accelerating Two-Dimensional Page Walks for Virtualized

Download Report

Transcript Accelerating Two-Dimensional Page Walks for Virtualized

Accelerating Two-Dimensional Page
Walks for Virtualized Systems
Jun Ma
Introduction
• Native non-virtualized system
We have a OS running on a physical system.
OS communicates with physical system directly.
Address Mapping:
Virtual Address: The address used in OS application
software.
Physical Address: The address in physical machine.
For native system: VA->PA.
Introduction
• Virtualization:
Multiple OS can run simultaneously but separately
on one physical system.
hypervisor: underlying software used to insert
abstractions into virtualized system and manipulate
the communication between OS and physical
system.
Introduction
• Virtualization:
Address mapping for Virtual Machine.
Guest OS: Guest Virtual Address (GVA), Guest
Physical Address. (GPA)
Physical system: System Physical Address(SPA).
Address translation:
GVA->GPA->SPA
Introduction
• Virtualization:
Tradition idea for memory translation: manipulated by
hypervisor.
Drawbacks: hypervisor intercepts operation, exits guest,
emulates the operation and does memory translation
and then return back to guest. -> high overhead.
Alternative idea:
Using hardware to finish translation.
Don’t need hypervisor, save overhead.
Background
• X86 Native Page Translation
Page table:
use hierarchical address-translation tables to map VA
to PA.
Page walk:
an iterative process.
In order to get the final PA from VA, we need a page
walk and traverse all level page table hierarch.
Background
• X86 Native Page Translation
From level 4 down to level 1.
A physical address from above
level is used as base address
and 9-bit VA is used as offside.
TLB(Translation look-aside
buffers) caches the final
physical address to reduce
frequency of page walks.
Background
• Memory Management for Virtualization
Without hardware support, we should use
hypervisor to manipulate this translation. This is one
important overhead for hypervisor. (Using shadow
page table to map GVA to SPA)
Hardware mechanism:
Same idea as X86 page walking. (2D page walking)
Nested paging: map GPA to SPA.
Background
• Memory Management for Virtualization
Traverse guest page table
to translate GVA to GPA.
For each level, original GPA
should be translated to SPA
by walking nested page
table for each gL (guest
page table) to read.
TLB caches the final SPA to
reduce page walk overhead.
Background
• Large page size advantages:
* Memory saving:
With 4 KB pages, an OS should use entire L1 table which is 4 KB large. If we
can make all 512 4 KB into a 2 MB contiguous block, we can escape L1 so
we save 4 KB space used by L1.
* Reduction in TLB pressure:
Each large page table entry can be stored in a single TLB entry while the
corresponding regular page entries require 512 4 KB TLB entries to map
the same 2 MB range of virtual address.
*Shorter page walk:
Escape the entire L1, the page walking is shorter and therefore save some
overhead.
Page walk characterization
• Page walk cost
Perfect TLB Opportunity means the performance improvement that could
be achieved with a perfect TLB which eliminates cold misses as well as
conflict and capacity misses.
Page walk characterization
• Page entry reuses
Page walk characterization
• Page entry reuses
Page walk characterization
• Page entry reuses
Nested page tables have much higher reuse than
guest page tables, in part due to the inherent
redundancy of the nested page walk.
There are many more nested accesses than guest
accesses in a 2D page walk. Each level of the nested
page table hierarchy must be accessed for each
guest level. In many cases the same nested page
entries are accessed multiple times in a 2D page
walk (high reuse rate).
Page walk characterization
• Page entry reuses
<gL1,G> and <gPA, nL1>
both have high unique
page entries because
both of them map guest
data into their respective
address space.
< gL1,G > maps GVA->
GPA.
< gPA, nL1 > maps GPA ->
SPA.
So these two are most
difficult to be cached.
Page Walk Acceleration
• AMD Opteron Translation Caching:
Page walk cache(PWC):
stores page entries from all page table levels
except L1, which is stored in TLB.
All page entries are initially brought into L2 cache.
On a PWC miss, the page entry data may reside in
the L2 cache, L3 cache(if present).
Page Walk Acceleration
• Translation caching for 2D page walks
Page Walk Acceleration
• Translation caching for 2D page walks
One –Dimensional PWC(1D_PWC):
Only page entry data from the guest dimension are stored in the PWC
and the entries are tagged based on the system physical address.
The lowest level guest page table entry {G,gL1} is not cached in the PWC
because of its low reuse rate.
Two-Dimensional PWC (2D PWC):
Extends 1D PWC into the nested dimension of the 2D page walk. Turning
the 20 unconditional cache hierarchy accesses into 16 likely PWC hits
(dark-filled references in Figure 5(b)) and four possible PWC hits
(checkered references. Like 1D PWC, all page entries are tagged with
their system physical address and {G,gL1} is not cached.
Page Walk Acceleration
• Translation caching for 2D page walks
Two-Dimensional PWC with Nested Translations (2D PWC+NT):
Augment 2D PWC with a dedicated GPA to SPA translation buffer, the
Nested TLB (NTLB), which is used to reduce the average number of
page entry references that take place during a 2D page walk.
The NTLB uses the guest physical address of the guest page entry to
cache the corresponding nL1 entry.
The page walk begins by accessing the NTLB with the guest physical
address of {G,gL4} and produce the data of {nL1,gL4}, allowing nested
references 1-4 to be skipped. On an NTLB hit, the system physical
address of {G,gL4} needed for the PWC access is calculated.
Result
Benchmark we will use in the following slides:
Result
The three hardware-only page walk
caching schemes improve performance
by turning page entry memory
hierarchy references into lower latency
PWC accesses and, in the case of 2D
PWC+NT, skipping some page entry
references entirely.
Result
Left side:
G column is not skipped, so it
does not change. So does gPA row.
gL1 in 2D_PWC+NT is skipped in
2D_PWC+NT though it has a low
reuse rate. So it exhibits a shorter
space in 2D_PWC_NT than in
2D_PWC.
Right side:
NTLB eliminates many of the PWC
accesses, but it does not
eliminate a significant
portion of the accesses that have
the highest penalty.
Result
The first data column states that L2 accesses incurred during a 2D page walk using the 2D
PWC+NT configuration generate 2.7-5.5 times more L2 misses than the native page walk.
This increase is primarily because the native page walk has fewer entries that are difficult
to cache (L1 and sometimes L2) compared to the 2D page walk ({G,gL1}, {nL1,gPA} and
sometimes {G,gL2}, {nL2,gPA}, {nL1,gL1}, and {nL2,gL1}).
The second data column shows the L2 cache miss percentage due only to page entries from
the 2D page walk. The miss percentages are relatively high because the PWC and NTLB have
filtered the easy-to-cache accesses and the remaining accesses are difficult to cache.
Result
The 8096 w/(G, gL1)
configuration is unique
in that it writes the gL1
guest page entry to the
PWC.
Result
Large pages allow the TLB to cover a larger data region with fewer
translations, which will lead to less TLB missing. (the nL1 references for the
gPA, gL1, gL2, gL3,and gL4 levels are all eliminated. )
The ability to eliminate poor-locality references, like {nL1,gL1} and {nL1,gPA},
reduces the number of L2 cache misses by 60%-64%.
Conclusion
Nested paging is a hardware technique to reduce the complexity of
software memory management during system virtualization. Nested
page tables combine with the guest page tables to map GPA to SPA,
resulting in a two-dimensional (2D) page walk(2D_PWC, 2D_PWC+NT).
A hypervisor is no longer required to trap on all guest page table
updates and significant virtualization overhead is eliminated. However,
nested paging can introduce new overhead due to the increase in page
entry references.
Therefore, the overall performance of a virtualized system is improved
by nested paging when the eliminated hypervisor memory management
overhead is greater than the new 2D page walk overhead.