Phil Pennington [email protected] Microsoft WSV317 What will you look for? Overall Solution Scalability Your Application : SPEED-UP vs.
Download ReportTranscript Phil Pennington [email protected] Microsoft WSV317 What will you look for? Overall Solution Scalability Your Application : SPEED-UP vs.
Phil Pennington [email protected] Microsoft WSV317 What will you look for? Overall Solution Scalability Your Application : SPEED-UP vs. CORES Speedup ideal 1 2 4 8 16 32 35 30 Speedup 25 20 15 10 7.44 8.59 8.29 4.87 5 1.001.47 2.57 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Number of Cores Agenda Windows Server 2008 R2 New NUMA APIs New User-Mode Scheduling APIs New C++ Concurrency Runtime Example NUMA Hardware Today A 256 Logical Processor System – HP SuperDome A 64 Logical Processor System - Unisys ES7000 64 dual-core hyper-threaded “Montvale” 1.6 GHz Itanium2 32 dual-core hyper-threaded “Tulsa” 3.4 GHz Xeon NUMA Hardware Tommorrow 2, 4, 8 Cores-per-Socket "Commodity" CPU Architectures Nehalem Nehalem I/O I/O Hub Hub Nehalem PCI Express* Nehalem PCI Express* Expect systems with 128-256 logical processors NUMA Node Groups New with Win7 and R2 GROUP NUMA NODE Socket Socket Core Core LP LP LP LP Core Core NUMA NODE NUMA Node Groups Example: 2 Groups, 4 Nodes, 8 Sockets, 32 Cores, 4 LPs/Core = 128 LPs Group Group NUMA Node NUMA Node Socket Socket Core Core Core Core Socket Socket Core Core Core Core LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP Core Core Core Core Core Core Core Core LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP NUMA Node NUMA Node Socket Socket Core Core Core Core Socket Socket Core Core Core Core LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP Core Core Core Core Core Core Core Core LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP LP Sample SQL Server Scaling 64P To 128P 1.7X 1.3X 64P 128P Bad Case Disk Write Software and Hardware Locality NOT Optimal Locked out for I/O Initiation ISR P3 P4 Cache3 Cache4 (2) (7) DPC P1 P2 Cache1 Cache2 (0) (5) Node Interconnect (1) I/O Buffer Home I/O Initiator Locked out for I/O Initiation (6) MemA DiskA Cache(s) (3) (4) DiskB MemB Windows Server 2008 R2 Optimization for NUMA Topology I/O Initiator ISR P1 P2 Cache1 Cache2 DiskA Cache(s) (3) P4 Cache3 Cache4 (3) (2) DiskB ISR P3 (2) Node Interconnect MemA DPC MemB NUMA Aware Applications Non-Uniform Memory Architecture Minimize Contention, Maximize Locality Apps scaling beyond even 8-16 logical processors should be NUMA aware A process or thread can set a preferred NUMA node Use the Node Group scheme for Task or Process partitioning Performance-optimize within Node Groups NUMA API's “Minimize Contention and Maximize Locality” Agenda Windows Server 2008 R2 New NUMA APIs New User-Mode Scheduling APIs New C++ Concurrency Runtime User Mode Scheduling (UMS) System Call Servicing Primary Threads User Core 1 Core 2 KT(P1) primary to regain core KT(PWake 2) KT(1) KT(2) KT(3) KT(4) Migrate request to appropriate KT Running Blocked Parked Parked Parked Parked SYSCALL Kernel UMS KT (Backing threads) UT(P1) Kernel User UT(P2) UMS completion list UT(1) UT(2) UT(3) USched ready list UT(4) User Mode Context Switch Benefit Lower context switch time means scheduling finer-grained items UMS-based yield: 370 cycles Signal-and-wait: 2600 cycles Direct impact synchronization-heavy fine-grained work speeds up Indirect impact finer grains means more workloads are candidates for parallelization Getting the Processor Back Benefit The scheduler keeps control of the processor when work blocks in the kernel Direct impact More deterministic scheduling and better use of a thread’s quantum Indirect impact Better cache locality when algorithmic libraries take advantage of the determinism to manage available resources Agenda Windows Server 2008 R2 New NUMA APIs New User-Mode Scheduling New C++ Concurrency Runtime Visual Studio 2010 Tools, Programming Models, Runtimes Tools Programming models PLINQ Task Parallel library Data structures Concurrency runtime Profiler and concurrency analyzer Thread pool Task scheduler Task scheduler Resource manager Resource manager Operating system Threads/UMS Key: Managed library Agents library Data structures Parallel Debugger Parallel Pattern library Native library Tools Task Scheduling Tasks are run by worker threads, which the scheduler controls Dead Zone WT0 WT1 WT2 WT3 WT0 WT1 WT2 WT3 Without UMS (signal-and-wait) With UMS (UMS yield) User-Mode Scheduling API's and the C++ Concurrency Runtime “Cooperative Thread-Scheduling” Summary Call-to-action Consider how your solution will scale on NUMA systems Utilize the NUMA API’s to Maximize Node Locality Leverage UMS for custom user-mode thread scheduling Use the C++ Concurrency Runtime for most native Parallel Computing scenarios and gain benefits of NUMA/UMS implicitly Resources MSDN Concurrency Dev-Center http://msdn.microsoft.com/concurrency MSDN Channel9 http://channel9.msdn.com/tags/w2k8r2 MSDN Code Gallery http://code.msdn.microsoft.com/w2k8r2 MSDN Server Dev Center http://msdn.microsoft.com/en-us/windowsserver 64+ LP and NUMA API Support http://code.msdn.microsoft.com/64plusLP http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx Dev-Team Blogs http://blogs.msdn.com/pfxteam http://blogs.technet.com/winserverperformance Resources www.microsoft.com/teched www.microsoft.com/learning Sessions On-Demand & Community Microsoft Certification & Training Resources http://microsoft.com/technet http://microsoft.com/msdn Resources for IT Professionals Resources for Developers www.microsoft.com/learning Microsoft Certification and Training Resources Related Content DTL203 "The Manycore Shift: Making Parallel Computing Mainstream" Monday 5/11, 2:45-4:00, Room 404, Stephen Toub DTL06-INT "Task-Based Parallel Programming with the Microsoft .NET Framework 4" Thursday 5/14, 1:00-2:15, Blue Thr 2, Stephen Toub DTL403 "Microsoft Visual C++ Library, Language, and IDE : Now and Next" Thursday 5/14, 4:30-5:45, Room 408A, Kate Gregory DTL310 Parallel Computing with Native C++ in Microsoft Visual Studio 2010 Friday 5/15, 2:45-4:00, Room 515A, Josh Phillips Windows Server Resources Make sure you pick up your copy of Windows Server 2008 R2 RC from the Materials Distribution Counter Learn More about Windows Server 2008 R2: www.microsoft.com/WindowsServer2008R2 Technical Learning Center (Orange Section): Highlighting Windows Server 2008 and R2 technologies • Over 15 booths and experts from Microsoft and our partners Complete an evaluation on CommNet and enter to win! © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.