The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation.

Transcript The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation.

The Memory Manager in Windows Server
2003 and Windows Vista
®
™
Landy Wang
Software Design Engineer
Windows Kernel Team
Microsoft Corporation
Outline
Memory Manager (MM) improvements in
Windows Server 2003 SP1
64-bit Windows features and Enhancements
NUMA and large page support added
Performance Enhancements
Support for No Execute (NX) Capability
© 2005 Microsoft Corporation
2
Outline (con’t)
Memory Manager Improvements Planned
for Windows Vista
Dynamic system address space
Kernel page table pages allocated on demand
Support for very large registries
NUMA and large page support enhancements
Advanced video model support
I/O and Section Access Improvements
Performance Improvements
Terminal Server improvements
Robustness and Diagnosability Improvements
© 2005 Microsoft Corporation
3
Server 2003 SP1 – Windows 64
Windows 64-bit memory
8TB user address space
8TB kernel address space
128GB pools
128GB system page table entries (PTEs)
1TB system cache
Support for x64 platform added
4GB virtual address space added for 32-bit large
address space aware applications
Further increases performance of WOW layer on both
Itanium and x64 systems
© 2005 Microsoft Corporation
4
Server 2003 SP1 – NUMA & Large Page Support
Large page support added for user images and
pagefile-backed sections
Large pages now also used in 32-bit, even when
booted with /3GB switch, for
Kernel
Page Frame Number (PFN) database
Initial non-paged pool
Prior large page support (added in Server 2003)
was for the following
User private memory
Device driver image mappings
Kernel, when not booted with /3GB switch
© 2005 Microsoft Corporation
5
Server 2003 SP1 – NUMA & Large Page Support
Pages zeroed in parallel and in a node aware
fashion during boot up
Reduces boot time on large NUMA systems
Physical pages initially consumed in top-down
order, instead of bottom-up
Keeps more pages below 4GB available for drivers that require it
© 2005 Microsoft Corporation
6
Server 2003 SP1 – Performance Enhancements
Working set management performance
increases, especially in
Areas of large shared memory and when booted with
/3GB switch
Premature self-trimming and linear hash table
walks eliminated
Major perf increases for apps like Exchange & SAP
© 2005 Microsoft Corporation
7
Server 2003 SP1 – Performance Enhancements
Pool tagging paths parallelized
Introduced shared acquire mode for spinlocks
Employing for tag table updates
Expand hash table for tagging large pages
When we detect searches are occurring, instead of
waiting for the table to be entirely filled
Overlapped asynchronous flushing for user
requests to maximize I/O throughput
Pagefiles zeroed in parallel instead of serially
Faster shutdown when “zero my pagefile” is set
© 2005 Microsoft Corporation
8
Server 2003 SP1 – Performance Enhancements
Per-process working set lock used to synchronize
PTE updates and working set list changes to an
address space
System, session or process
This lock converted from a mutex to a pushlock
Pushlocks support both shared and exclusive acquire modes
Mutexes support only exclusive acquisitions
In conjunction with 2-byte interlocked operations this
allows parallelization of many operations
MmProbeAndLockPages
Completely remove the PFN lock acquire from this very hot routine
MmUnlockPages
VirtualQuery
etc.
© 2005 Microsoft Corporation
9
Server 2003 SP1 – Performance Enhancements
Major PFN lock reduction to improve scalability
Reducing time held
Replacing acquisitions with lock-free or alternative lock
sequences in many places & APIs
Translation look-aside buffer (TLB) optimizations
© 2005 Microsoft Corporation
10
Server 2003 SP1 – Other MM Enhancements
Support for no execute (NX) capability
New Win32 SetThreadStackGuarantee API
Allows user applications & the CLR to specify
guaranteed stack space requirements
Requirements honored even in low resource scenarios
Support for hot-patching a running system
Patch system without reboot to reduce down time
Backported to Windows XP SP2
© 2005 Microsoft Corporation
11
Windows Vista – Dynamic Address Space
System virtual address (VA) space allocated on-demand
Instead of at boot time based registry & configuration information
Region sizes bounded only by VA limitations
Applies to non-paged, paged, session space, mapped views, etc.
Kernel page tables allocated on demand
No longer preallocated at system boot, saves
1.5MB on x86 systems
3MB on PAE systems
16MB to 2.5GB on 64-bit machines
Boot with very large registries on 32-bit machines
With and without /3GB switch
Important for large multipath LUN machines
MM locates registry VA space used by boot loader & reuses it as
dynamic kernel virtual address space
© 2005 Microsoft Corporation
12
Key Benefits of Dynamic Address Space
No registry editing & reboots to reconfigure systems due
to resource imbalances
Maximum resources available in wide range of scenarios,
w/ no human intervention
Desktop heap exhaustion
Terminal Server maximum scaling
Large video clients
/3GB SQL and Exchange machines
Http servers, NFS servers, etc
Features enabled w/o reboot, yet have no cost if not used
64-bit systems grow to maximum limit regardless of
underlying physical configuration
128GB paged pool, nonpaged pool
1TB system cache/system PTEs/special pool
128GB session pool
128GB session views (desktop heaps), etc
© 2005 Microsoft Corporation
13
Windows Vista – Planned Enhancements for
NUMA, Large System, Large Page Support
Initial nonpaged pool now NUMA aware, with separate VA
ranges for each node
Per-node look-asides for full pages
Page table allocation for system PTEs, the system cache, etc.
distributed across nodes
More even locality
Avoids exhausting free pages from the boot node
NUMA-related APIs for device drivers
MmAllocateContiguousMemorySpecifyCacheNode
MmAllocatePagesForMdlEx
Default if no node is specified has been changed
From current processor to the thread’s ideal processor
Zeroing of pages for these APIs bounds number of threads more
intelligently
© 2005 Microsoft Corporation
14
Windows Vista – Planned Enhancements for
NUMA, Large System, Large Page Support
Win32® APIs that specify nodes for allocations & mapped
views on per VAD & per section basis
VirtualAllocExNuma
CreateFileMappingExNuma
MapViewOfFileExNuma
Scalable query
QueryWorkingSetEx
Higher perf for very physically sparse machines
Example: Hewlett-Packard Superdome
1TB gaps between chunks of physical memory
PFN database & initial nonpaged pool always mapped with
large pages regardless of physical memory sparseness
© 2005 Microsoft Corporation
15
Windows Vista – Planned Enhancements for
NUMA, Large System, Large Page Support
/3GB mode on 32-bit systems supports up to
64GB of RAM
Booting in /3GB mode on 32-bit systems now supports
up to 64GB of RAM instead of just 16GB
Booting without /3GB on 32-bit systems continues to
support up to 128 GB of RAM
© 2005 Microsoft Corporation
16
Windows Vista – Planned Enhancements for
NUMA, Large System, Large Page Support
Much faster large page allocations in
kernel & user
Support for cache-aligned pool
allocation directives
Data structures describing non-paged pool free
list converted from linked list to bitmap
Reduced lock contention by over 50%
Bitmaps can be searched opportunistically lock-free
Costly combining of adjacent allocations on free no
longer necessary
© 2005 Microsoft Corporation
17
Windows Vista – New Video Model Support
Dramatically different video architecture
in Windows Vista
More fully exploits modern GPUs & virtual memory
MM provides new mapping type
Rotate virtual address descriptors (VADs)
Allow video drivers to quickly switch user views from regular
application memory into Cached, non-cached, write combined
AGP or video RAM mappings
Allows video architecture to use GPU to rotate unneeded clients in
and out on demand
First time Windows-based OS has supported fully
pageable mappings w/ arbitrary cache attributes
© 2005 Microsoft Corporation
18
Windows Vista – I/O Section Access Improvements
Pervasive prefetch-style clustering for all types of page
faults and system cache read ahead
Major benefits over previous clustering
Infinite size read ahead instead of 64k max
Dummy page usage
So a single large I/O is always issued regardless of valid pages
encountered in the cluster
Pages for the I/O are put in transition (not valid)
No VA space is required
If the pages are not subsequently referenced, no working set trim and
TLB flush is needed either
Further emphasizes that driver writers must be aware
that MDL pages can have their contents change !
© 2005 Microsoft Corporation
19
Windows Vista – I/O Section Access Improvements
Significant changes in pagefile writing
Larger clusters up to 4GB
Align near neighbors
Sort by virtual address (VA)
Reduced fragmentation
Improved reads
Cache manager read ahead size limitations in thread
structure removed
Improved synchronization between cache manager
and memory manager data flushing to maximize
filesystem/disk throughput and efficiency
© 2005 Microsoft Corporation
20
Windows Vista – I/O Section Access Improvements
Mapped file writing and file flushing
performance increases
Support for writes of any size up to 4GB instead of
previous 64k limit per write
Multiple asynchronous flushes can be issued, both
internally and by the caller, to satisfy a single call
Pagefile fragmentation improvements
On dirty bit faults, we use interlocked queuing
operation to free the pagefile space of the
corresponding page
Avoids PFN lock acquisitions
Reduces needless pagefile fragmentation
© 2005 Microsoft Corporation
21
Windows Vista – I/O Section Access Improvements
Elimination of pagefile writes and potential
subsequent re-reads of completely zero pages
Check pages at trim time to see if they are all zero
Optimization used to make this nearly free
User virtual address used to check for the first and last
ULONG_PTR being zero; if they both are, then
After the page is trimmed, and TLB invalidated, a kernel
mapping used to make the final check of the entire page
Avoids needless scans & TLB flushes
We’ve measured over 90% success rate with this algorithm
© 2005 Microsoft Corporation
22
Windows Vista – I/O Section Access Improvements
Access to large section performance increases
A subsection is the name of the data structure used
to describe on-disk file spans for sections
The subsection structure was converted
From a singly linked (i.e., linear walk required)
To a balanced AVL tree
Enables huge performance gain for sections mapping
large files
User mappings & flushes, system cache mappings, flushes &
purges, section-based backups, etc
Mapped page writer does flushing based on a
sweep hand
Data is written out much sooner than the prior 5
minute “flush everything” model
© 2005 Microsoft Corporation
23
Windows Vista – I/O Section Access Improvements
Dependencies between modified writer &
mapped writer removed to
Increase parallelism
Reduce filesystem deadlock rules
Provide the cache manager with a way to influence
which portions of files get written first
To optimize disk seek as well as avoiding valid data length
extension costs
© 2005 Microsoft Corporation
24
Windows Vista – I/O Section Access Improvements
Core support for “Superfetch”
Enables significantly faster app launch by deciding which pages
should be prioritized
Provides mechanisms to pre-fetch pages and prevents premature
cannibalization
Includes support for
Per page priorities
Access bit tracing
Private page pre-fetching
Section (including pagefile-backed) pre-fetching
© 2005 Microsoft Corporation
25
Windows Vista – Fast S4 Support
Hibernation converted to use memory management
mirroring facilities
Hibernation time reduced by 2x, with 50% smaller
hiber-file
Resume time reductions
© 2005 Microsoft Corporation
26
Windows Vista – Internal Data Structure and
Algorithmic Performance Enhancements
Constant PFN lock time reduction always ongoing, has
included areas like
User address space trimming and deletion
MEM_RESET
Page allocations
the PFN sharecount now uses interlocked updates instead of requiring
the PFN lock, etc
Page faults
Modified writes
Page color generation
MDL construction for fault I/Os, and so on
Translation look-aside buffer (TLB) optimizations
© 2005 Microsoft Corporation
27
Windows Vista – Internal Data Structure and
Algorithmic Performance Enhancements
The per-process address space lock used to synchronize
creation/deletion/changes to user address spaces
This lock was converted from a mutex to a pushlock
Pushlocks support both shared and exclusive acquire modes
Mutexes support only exclusive acquisitions
Allowed parallelization of many operations like VirtualAlloc, etc
VirtualAlloc support has been revamped to reduce
Conventional (non-AWE) allocations by over 30%
AWE allocations by over 2500% (not a typo)
Address Windowing Extension (AWE) non-zeroed
allocations are >10x faster than in SP1
Can now therefore be used for http responses, for example
© 2005 Microsoft Corporation
28
Windows – Internal Data Structure and
Algorithmic Performance Enhancements
PFN database contains information about all physical
memory in the machine
In the past, whenever a new page was needed:
The PFN spinlock was acquired
New page removed from appropriate list chained through
PFN database
This has been improved by adding a zero and free page
SLIST for every NUMA node and page color
Now obtain the page without needing the PFN lock in
many instances where we need a single page
Demand zero faults, copy on write faults, etc
For example, the fault processing path length is cut in half
Alleviates pressure on both the working set pushlock & PFN lock
© 2005 Microsoft Corporation
29
Windows Vista – Terminal Services Improvements
Added Terminal Server session objects
Enables various components to have secure session IDs and
implement compartment IDs, for example
Major overhaul of Terminal Server global-per-session
image support
Eliminated multiple image control areas
To provide single image cache & fix flush/purge/truncate races
Only the shared subsections themselves are now per-session, instead
of the entire image
Shared subsection use AVL tree instead of a linked list, for faster
searches
Support for hot-patching of session-space drivers
64-bit Windows uses demand zero pages instead of pool
for WOW64 page table bitmaps
© 2005 Microsoft Corporation
30
Windows Vista – Additional Robustness
and Diagnosability
Capability to mark system cache views as read only
Used by Registry to protect views from inadvertent driver
corruption
Reduced data loss in the face of crashes
Flush all modified data to its backing store (local & remote) if we
are going to bugcheck due to a failed inpage
Only failed inpages of kernel and/or drivers are fatal
Failed inpages of user process code/data merely results in an inpage
exception being handed to the application
Commit thresholds now reflected in global named events
Apps can use this to monitor the system
© 2005 Microsoft Corporation
31
Windows Vista – Additional Robustness
and Diagnosability
.pagein debugger support for kernel/driver
addresses added
Allows for viewing memory addresses which have
been paged out to disk when debugging crashes
© 2005 Microsoft Corporation
32
Call to Action
Consider these significant Memory Manager
enhancements as you develop drivers for
Windows Server 2003 and Windows Vista
Use new APIs when available in Windows Vista
© 2005 Microsoft Corporation
33
© 2005 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation.

Transcript The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation.

Directory