The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation.
Download ReportTranscript The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation.
The Memory Manager in Windows Server 2003 and Windows Vista ® ™ Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation Outline Memory Manager (MM) improvements in Windows Server 2003 SP1 64-bit Windows features and Enhancements NUMA and large page support added Performance Enhancements Support for No Execute (NX) Capability © 2005 Microsoft Corporation 2 Outline (con’t) Memory Manager Improvements Planned for Windows Vista Dynamic system address space Kernel page table pages allocated on demand Support for very large registries NUMA and large page support enhancements Advanced video model support I/O and Section Access Improvements Performance Improvements Terminal Server improvements Robustness and Diagnosability Improvements © 2005 Microsoft Corporation 3 Server 2003 SP1 – Windows 64 Windows 64-bit memory 8TB user address space 8TB kernel address space 128GB pools 128GB system page table entries (PTEs) 1TB system cache Support for x64 platform added 4GB virtual address space added for 32-bit large address space aware applications Further increases performance of WOW layer on both Itanium and x64 systems © 2005 Microsoft Corporation 4 Server 2003 SP1 – NUMA & Large Page Support Large page support added for user images and pagefile-backed sections Large pages now also used in 32-bit, even when booted with /3GB switch, for Kernel Page Frame Number (PFN) database Initial non-paged pool Prior large page support (added in Server 2003) was for the following User private memory Device driver image mappings Kernel, when not booted with /3GB switch © 2005 Microsoft Corporation 5 Server 2003 SP1 – NUMA & Large Page Support Pages zeroed in parallel and in a node aware fashion during boot up Reduces boot time on large NUMA systems Physical pages initially consumed in top-down order, instead of bottom-up Keeps more pages below 4GB available for drivers that require it © 2005 Microsoft Corporation 6 Server 2003 SP1 – Performance Enhancements Working set management performance increases, especially in Areas of large shared memory and when booted with /3GB switch Premature self-trimming and linear hash table walks eliminated Major perf increases for apps like Exchange & SAP © 2005 Microsoft Corporation 7 Server 2003 SP1 – Performance Enhancements Pool tagging paths parallelized Introduced shared acquire mode for spinlocks Employing for tag table updates Expand hash table for tagging large pages When we detect searches are occurring, instead of waiting for the table to be entirely filled Overlapped asynchronous flushing for user requests to maximize I/O throughput Pagefiles zeroed in parallel instead of serially Faster shutdown when “zero my pagefile” is set © 2005 Microsoft Corporation 8 Server 2003 SP1 – Performance Enhancements Per-process working set lock used to synchronize PTE updates and working set list changes to an address space System, session or process This lock converted from a mutex to a pushlock Pushlocks support both shared and exclusive acquire modes Mutexes support only exclusive acquisitions In conjunction with 2-byte interlocked operations this allows parallelization of many operations MmProbeAndLockPages Completely remove the PFN lock acquire from this very hot routine MmUnlockPages VirtualQuery etc. © 2005 Microsoft Corporation 9 Server 2003 SP1 – Performance Enhancements Major PFN lock reduction to improve scalability Reducing time held Replacing acquisitions with lock-free or alternative lock sequences in many places & APIs Translation look-aside buffer (TLB) optimizations © 2005 Microsoft Corporation 10 Server 2003 SP1 – Other MM Enhancements Support for no execute (NX) capability New Win32 SetThreadStackGuarantee API Allows user applications & the CLR to specify guaranteed stack space requirements Requirements honored even in low resource scenarios Support for hot-patching a running system Patch system without reboot to reduce down time Backported to Windows XP SP2 © 2005 Microsoft Corporation 11 Windows Vista – Dynamic Address Space System virtual address (VA) space allocated on-demand Instead of at boot time based registry & configuration information Region sizes bounded only by VA limitations Applies to non-paged, paged, session space, mapped views, etc. Kernel page tables allocated on demand No longer preallocated at system boot, saves 1.5MB on x86 systems 3MB on PAE systems 16MB to 2.5GB on 64-bit machines Boot with very large registries on 32-bit machines With and without /3GB switch Important for large multipath LUN machines MM locates registry VA space used by boot loader & reuses it as dynamic kernel virtual address space © 2005 Microsoft Corporation 12 Key Benefits of Dynamic Address Space No registry editing & reboots to reconfigure systems due to resource imbalances Maximum resources available in wide range of scenarios, w/ no human intervention Desktop heap exhaustion Terminal Server maximum scaling Large video clients /3GB SQL and Exchange machines Http servers, NFS servers, etc Features enabled w/o reboot, yet have no cost if not used 64-bit systems grow to maximum limit regardless of underlying physical configuration 128GB paged pool, nonpaged pool 1TB system cache/system PTEs/special pool 128GB session pool 128GB session views (desktop heaps), etc © 2005 Microsoft Corporation 13 Windows Vista – Planned Enhancements for NUMA, Large System, Large Page Support Initial nonpaged pool now NUMA aware, with separate VA ranges for each node Per-node look-asides for full pages Page table allocation for system PTEs, the system cache, etc. distributed across nodes More even locality Avoids exhausting free pages from the boot node NUMA-related APIs for device drivers MmAllocateContiguousMemorySpecifyCacheNode MmAllocatePagesForMdlEx Default if no node is specified has been changed From current processor to the thread’s ideal processor Zeroing of pages for these APIs bounds number of threads more intelligently © 2005 Microsoft Corporation 14 Windows Vista – Planned Enhancements for NUMA, Large System, Large Page Support Win32® APIs that specify nodes for allocations & mapped views on per VAD & per section basis VirtualAllocExNuma CreateFileMappingExNuma MapViewOfFileExNuma Scalable query QueryWorkingSetEx Higher perf for very physically sparse machines Example: Hewlett-Packard Superdome 1TB gaps between chunks of physical memory PFN database & initial nonpaged pool always mapped with large pages regardless of physical memory sparseness © 2005 Microsoft Corporation 15 Windows Vista – Planned Enhancements for NUMA, Large System, Large Page Support /3GB mode on 32-bit systems supports up to 64GB of RAM Booting in /3GB mode on 32-bit systems now supports up to 64GB of RAM instead of just 16GB Booting without /3GB on 32-bit systems continues to support up to 128 GB of RAM © 2005 Microsoft Corporation 16 Windows Vista – Planned Enhancements for NUMA, Large System, Large Page Support Much faster large page allocations in kernel & user Support for cache-aligned pool allocation directives Data structures describing non-paged pool free list converted from linked list to bitmap Reduced lock contention by over 50% Bitmaps can be searched opportunistically lock-free Costly combining of adjacent allocations on free no longer necessary © 2005 Microsoft Corporation 17 Windows Vista – New Video Model Support Dramatically different video architecture in Windows Vista More fully exploits modern GPUs & virtual memory MM provides new mapping type Rotate virtual address descriptors (VADs) Allow video drivers to quickly switch user views from regular application memory into Cached, non-cached, write combined AGP or video RAM mappings Allows video architecture to use GPU to rotate unneeded clients in and out on demand First time Windows-based OS has supported fully pageable mappings w/ arbitrary cache attributes © 2005 Microsoft Corporation 18 Windows Vista – I/O Section Access Improvements Pervasive prefetch-style clustering for all types of page faults and system cache read ahead Major benefits over previous clustering Infinite size read ahead instead of 64k max Dummy page usage So a single large I/O is always issued regardless of valid pages encountered in the cluster Pages for the I/O are put in transition (not valid) No VA space is required If the pages are not subsequently referenced, no working set trim and TLB flush is needed either Further emphasizes that driver writers must be aware that MDL pages can have their contents change ! © 2005 Microsoft Corporation 19 Windows Vista – I/O Section Access Improvements Significant changes in pagefile writing Larger clusters up to 4GB Align near neighbors Sort by virtual address (VA) Reduced fragmentation Improved reads Cache manager read ahead size limitations in thread structure removed Improved synchronization between cache manager and memory manager data flushing to maximize filesystem/disk throughput and efficiency © 2005 Microsoft Corporation 20 Windows Vista – I/O Section Access Improvements Mapped file writing and file flushing performance increases Support for writes of any size up to 4GB instead of previous 64k limit per write Multiple asynchronous flushes can be issued, both internally and by the caller, to satisfy a single call Pagefile fragmentation improvements On dirty bit faults, we use interlocked queuing operation to free the pagefile space of the corresponding page Avoids PFN lock acquisitions Reduces needless pagefile fragmentation © 2005 Microsoft Corporation 21 Windows Vista – I/O Section Access Improvements Elimination of pagefile writes and potential subsequent re-reads of completely zero pages Check pages at trim time to see if they are all zero Optimization used to make this nearly free User virtual address used to check for the first and last ULONG_PTR being zero; if they both are, then After the page is trimmed, and TLB invalidated, a kernel mapping used to make the final check of the entire page Avoids needless scans & TLB flushes We’ve measured over 90% success rate with this algorithm © 2005 Microsoft Corporation 22 Windows Vista – I/O Section Access Improvements Access to large section performance increases A subsection is the name of the data structure used to describe on-disk file spans for sections The subsection structure was converted From a singly linked (i.e., linear walk required) To a balanced AVL tree Enables huge performance gain for sections mapping large files User mappings & flushes, system cache mappings, flushes & purges, section-based backups, etc Mapped page writer does flushing based on a sweep hand Data is written out much sooner than the prior 5 minute “flush everything” model © 2005 Microsoft Corporation 23 Windows Vista – I/O Section Access Improvements Dependencies between modified writer & mapped writer removed to Increase parallelism Reduce filesystem deadlock rules Provide the cache manager with a way to influence which portions of files get written first To optimize disk seek as well as avoiding valid data length extension costs © 2005 Microsoft Corporation 24 Windows Vista – I/O Section Access Improvements Core support for “Superfetch” Enables significantly faster app launch by deciding which pages should be prioritized Provides mechanisms to pre-fetch pages and prevents premature cannibalization Includes support for Per page priorities Access bit tracing Private page pre-fetching Section (including pagefile-backed) pre-fetching © 2005 Microsoft Corporation 25 Windows Vista – Fast S4 Support Hibernation converted to use memory management mirroring facilities Hibernation time reduced by 2x, with 50% smaller hiber-file Resume time reductions © 2005 Microsoft Corporation 26 Windows Vista – Internal Data Structure and Algorithmic Performance Enhancements Constant PFN lock time reduction always ongoing, has included areas like User address space trimming and deletion MEM_RESET Page allocations the PFN sharecount now uses interlocked updates instead of requiring the PFN lock, etc Page faults Modified writes Page color generation MDL construction for fault I/Os, and so on Translation look-aside buffer (TLB) optimizations © 2005 Microsoft Corporation 27 Windows Vista – Internal Data Structure and Algorithmic Performance Enhancements The per-process address space lock used to synchronize creation/deletion/changes to user address spaces This lock was converted from a mutex to a pushlock Pushlocks support both shared and exclusive acquire modes Mutexes support only exclusive acquisitions Allowed parallelization of many operations like VirtualAlloc, etc VirtualAlloc support has been revamped to reduce Conventional (non-AWE) allocations by over 30% AWE allocations by over 2500% (not a typo) Address Windowing Extension (AWE) non-zeroed allocations are >10x faster than in SP1 Can now therefore be used for http responses, for example © 2005 Microsoft Corporation 28 Windows – Internal Data Structure and Algorithmic Performance Enhancements PFN database contains information about all physical memory in the machine In the past, whenever a new page was needed: The PFN spinlock was acquired New page removed from appropriate list chained through PFN database This has been improved by adding a zero and free page SLIST for every NUMA node and page color Now obtain the page without needing the PFN lock in many instances where we need a single page Demand zero faults, copy on write faults, etc For example, the fault processing path length is cut in half Alleviates pressure on both the working set pushlock & PFN lock © 2005 Microsoft Corporation 29 Windows Vista – Terminal Services Improvements Added Terminal Server session objects Enables various components to have secure session IDs and implement compartment IDs, for example Major overhaul of Terminal Server global-per-session image support Eliminated multiple image control areas To provide single image cache & fix flush/purge/truncate races Only the shared subsections themselves are now per-session, instead of the entire image Shared subsection use AVL tree instead of a linked list, for faster searches Support for hot-patching of session-space drivers 64-bit Windows uses demand zero pages instead of pool for WOW64 page table bitmaps © 2005 Microsoft Corporation 30 Windows Vista – Additional Robustness and Diagnosability Capability to mark system cache views as read only Used by Registry to protect views from inadvertent driver corruption Reduced data loss in the face of crashes Flush all modified data to its backing store (local & remote) if we are going to bugcheck due to a failed inpage Only failed inpages of kernel and/or drivers are fatal Failed inpages of user process code/data merely results in an inpage exception being handed to the application Commit thresholds now reflected in global named events Apps can use this to monitor the system © 2005 Microsoft Corporation 31 Windows Vista – Additional Robustness and Diagnosability .pagein debugger support for kernel/driver addresses added Allows for viewing memory addresses which have been paged out to disk when debugging crashes © 2005 Microsoft Corporation 32 Call to Action Consider these significant Memory Manager enhancements as you develop drivers for Windows Server 2003 and Windows Vista Use new APIs when available in Windows Vista © 2005 Microsoft Corporation 33 © 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.