Phil Pennington [email protected] Microsoft WSV317 What will you look for? Overall Solution Scalability Your Application : SPEED-UP vs.

Download Report

Transcript Phil Pennington [email protected] Microsoft WSV317 What will you look for? Overall Solution Scalability Your Application : SPEED-UP vs.

Phil Pennington
[email protected]
Microsoft
WSV317
What will you look for?
Overall Solution Scalability
Your Application : SPEED-UP vs. CORES
Speedup
ideal 1 2 4 8 16 32
35
30
Speedup
25
20
15
10
7.44
8.59
8.29
4.87
5
1.001.47
2.57
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Number of Cores
Agenda
Windows Server 2008 R2
New NUMA APIs
New User-Mode Scheduling APIs
New C++ Concurrency Runtime
Example NUMA Hardware Today
A 256 Logical Processor System – HP SuperDome
A 64 Logical Processor System - Unisys ES7000
64 dual-core hyper-threaded
“Montvale” 1.6 GHz Itanium2
32 dual-core hyper-threaded
“Tulsa” 3.4 GHz Xeon
NUMA Hardware Tommorrow
2, 4, 8 Cores-per-Socket "Commodity" CPU Architectures
Nehalem
Nehalem
I/O
I/O
Hub
Hub
Nehalem
PCI
Express*
Nehalem
PCI
Express*
Expect systems with 128-256 logical processors
NUMA Node Groups
New with Win7 and R2
GROUP
NUMA NODE
Socket
Socket
Core
Core
LP
LP
LP
LP
Core
Core
NUMA NODE
NUMA Node Groups
Example: 2 Groups, 4 Nodes, 8 Sockets, 32 Cores, 4 LPs/Core = 128 LPs
Group
Group
NUMA Node
NUMA Node
Socket
Socket
Core
Core
Core
Core
Socket
Socket
Core
Core
Core
Core
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
Core
Core
Core
Core
Core
Core
Core
Core
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
NUMA Node
NUMA Node
Socket
Socket
Core
Core
Core
Core
Socket
Socket
Core
Core
Core
Core
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
Core
Core
Core
Core
Core
Core
Core
Core
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
LP
Sample SQL Server Scaling
64P To 128P
1.7X
1.3X
64P
128P
Bad Case Disk Write
Software and Hardware Locality NOT Optimal
Locked out for
I/O Initiation
ISR
P3
P4
Cache3
Cache4
(2)
(7)
DPC
P1
P2
Cache1
Cache2
(0)
(5)
Node
Interconnect
(1)
I/O Buffer
Home
I/O Initiator
Locked out for
I/O Initiation
(6)
MemA
DiskA
Cache(s)
(3)
(4)
DiskB
MemB
Windows Server 2008 R2
Optimization for NUMA Topology
I/O Initiator
ISR
P1
P2
Cache1
Cache2
DiskA
Cache(s)
(3)
P4
Cache3
Cache4
(3)
(2)
DiskB
ISR
P3
(2)
Node
Interconnect
MemA
DPC
MemB
NUMA Aware Applications
Non-Uniform Memory Architecture
Minimize Contention, Maximize Locality
Apps scaling beyond even 8-16 logical
processors should be NUMA aware
A process or thread can set a preferred NUMA node
Use the Node Group scheme for Task or
Process partitioning
Performance-optimize within Node Groups
NUMA API's
“Minimize Contention and Maximize Locality”
Agenda
Windows Server 2008 R2
New NUMA APIs
New User-Mode Scheduling APIs
New C++ Concurrency Runtime
User Mode Scheduling (UMS)
System Call Servicing
Primary Threads
User
Core 1
Core 2
KT(P1)
primary to regain core
KT(PWake
2)
KT(1)
KT(2)
KT(3)
KT(4)
Migrate request to appropriate KT
Running
Blocked
Parked
Parked
Parked
Parked
SYSCALL
Kernel
UMS KT (Backing threads)
UT(P1)
Kernel
User
UT(P2)
UMS completion list
UT(1)
UT(2)
UT(3)
USched ready list
UT(4)
User Mode Context Switch
Benefit
Lower context switch time means scheduling
finer-grained items
UMS-based yield:
370 cycles
Signal-and-wait:
2600 cycles
Direct impact
synchronization-heavy fine-grained work speeds up
Indirect impact
finer grains means more workloads are
candidates for parallelization
Getting the Processor Back
Benefit
The scheduler keeps control of the processor when
work blocks in the kernel
Direct impact
More deterministic scheduling and better use of
a thread’s quantum
Indirect impact
Better cache locality when algorithmic libraries
take advantage of the determinism to manage
available resources
Agenda
Windows Server 2008 R2
New NUMA APIs
New User-Mode Scheduling
New C++ Concurrency Runtime
Visual Studio 2010
Tools, Programming Models, Runtimes
Tools
Programming models
PLINQ
Task Parallel
library
Data structures
Concurrency runtime
Profiler and
concurrency
analyzer
Thread pool
Task scheduler
Task scheduler
Resource manager
Resource manager
Operating system
Threads/UMS
Key:
Managed library
Agents
library
Data structures
Parallel
Debugger
Parallel Pattern
library
Native library
Tools
Task Scheduling
Tasks are run by worker threads, which the scheduler controls
Dead Zone
WT0
WT1
WT2
WT3
WT0
WT1
WT2
WT3
Without UMS (signal-and-wait)
With UMS (UMS yield)
User-Mode Scheduling API's and the C++ Concurrency Runtime
“Cooperative Thread-Scheduling”
Summary
Call-to-action
Consider how your solution will scale on NUMA systems
Utilize the NUMA API’s to Maximize Node Locality
Leverage UMS for custom user-mode thread scheduling
Use the C++ Concurrency Runtime for most native
Parallel Computing scenarios and gain benefits of
NUMA/UMS implicitly
Resources
MSDN Concurrency Dev-Center
http://msdn.microsoft.com/concurrency
MSDN Channel9
http://channel9.msdn.com/tags/w2k8r2
MSDN Code Gallery
http://code.msdn.microsoft.com/w2k8r2
MSDN Server Dev Center
http://msdn.microsoft.com/en-us/windowsserver
64+ LP and NUMA API Support
http://code.msdn.microsoft.com/64plusLP
http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx
Dev-Team Blogs
http://blogs.msdn.com/pfxteam
http://blogs.technet.com/winserverperformance
Resources
www.microsoft.com/teched
www.microsoft.com/learning
Sessions On-Demand & Community
Microsoft Certification & Training Resources
http://microsoft.com/technet
http://microsoft.com/msdn
Resources for IT Professionals
Resources for Developers
www.microsoft.com/learning
Microsoft Certification and Training Resources
Related Content
DTL203 "The Manycore Shift: Making Parallel Computing Mainstream"
Monday 5/11, 2:45-4:00, Room 404, Stephen Toub
DTL06-INT "Task-Based Parallel Programming with the Microsoft .NET Framework 4"
Thursday 5/14, 1:00-2:15, Blue Thr 2, Stephen Toub
DTL403 "Microsoft Visual C++ Library, Language, and IDE : Now and Next"
Thursday 5/14, 4:30-5:45, Room 408A, Kate Gregory
DTL310 Parallel Computing with Native C++ in Microsoft Visual Studio 2010
Friday 5/15, 2:45-4:00, Room 515A, Josh Phillips
Windows Server Resources
Make sure you pick up your
copy of Windows Server 2008
R2 RC from the Materials
Distribution Counter
Learn More about Windows Server 2008 R2:
www.microsoft.com/WindowsServer2008R2
Technical Learning Center (Orange Section):
Highlighting Windows Server 2008 and R2 technologies
• Over 15 booths and experts from Microsoft and our partners
Complete an
evaluation on
CommNet and
enter to win!
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should
not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.