ES13  Sean Mortazavi Architect Microsoft Corporation  Jeff Baxter Software Developer Microsoft Corporation               productivity  X64 Server $40,000,000 $1,000,000 $1,000

Download Report

Transcript ES13  Sean Mortazavi Architect Microsoft Corporation  Jeff Baxter Software Developer Microsoft Corporation               productivity  X64 Server $40,000,000 $1,000,000 $1,000

ES13
 Sean Mortazavi
Architect
Microsoft Corporation
 Jeff Baxter
Software Developer
Microsoft Corporation
5

10





30



10
20

2




productivity

1991
1998
2008
X64 Server
$40,000,000
$1,000,000
$1,000
3















4



Windows Server
Operating System
• Secure,
Reliable, Tested
• Support for high
performance hardware (x64,
high-speed interconnects)
HPC Pack
•
•
•
•
•
Job Scheduler
Resource Manager
Cluster Management
Message Passing Interface
SDK
Microsoft Windows
HPC Server 2008
• Integrated Solution
out-of-the-box
• Leverages investment in
Windows administration
and tools
• Makes cluster operation
easy and secure as a single
system
5
Rapid large scale deployment and
built-in diagnostics suite
Integrated monitoring,
management and reporting
Familiar UI and rich scripting
interface
Systems
Management
Storage
Access to SQL, Windows and Unix
file servers
Key parallel file server vendor
support (GPFS, Lustre, Panasas)
In-memory caching options
Integrated security via Active
Directory
Support for batch, interactive and
service-oriented applications
High availability scheduling
Interoperability via OGF’s HPC
Basic Profile
Job & Resource
Scheduling
HPC
Application
Models
MS-MPI stack based on MPICH2
reference implementation
Performance improvements for
RDMA networking and multi-core
shared memory
MS-MPI integrated with Windows
Event Tracing
Corporate IT Infrastructure
Windows
Update
AD
DNS
Monitoring
Systems
Management
DHCP
Public
Network
Head Node
10s to 1000s
……….
Compute Node
Compute Node
Admin / User Cons
Node Manager
Node Manager
WDS
MPI
MPI
Job Scheduler
Management
Management
MPI
Management
Private
Network
MPI
Network
NAT
Compute Cluster
7
 Sean Mortazavi
Architect
Microsoft Corporation
Skip/Demo
Skip/Demo
Skip/Demo
Skip/Demo
List or Heat Map view cluster at a
Group compute nodes based on
hardware, software and
glance
custom attributes; Act on groupings.
Receive alerts for failures
Track long running operations and
access operation history
Pivoting enables correlating
nodes and jobs together
12
Skip/Demo
Skip/Demo
Skip/Demo
Services oriented HPC apps
Expanded Job Policies
Support for Job Templates
Improve interoperability
with mixed IT infrastructure
15
Skip/Demo
Node 1
Node 2
P1
P0
S0
S1
J1
P2
P1
P0
P3
P0
P2
P2
P3
P1
S1
S0
J1
P0
P1
P3
P2
P3
P0
P1
J2
P1
P0
S2
P2
P0
J3
P1
J3
P0
S3
J1
P3
P2
P1
S3
S2
J3
P3
J1: /numsockets:3 /exclusive: false
J3: /numcores:4 /exclusive: false
J3
P2
P3
P2
P3
J2: /numnodes:1
16










Powershell
using Microsoft.Hpc.Scheduler;
class Program
{
static void Main()
{
IScheduler store = new Scheduler();
store.Connect(“localhost”);
ISchedulerJob job = store.CreateJob();
job.AutoCalculateMax = true;
job.AutoCalculateMin = true;
ISchedulerTask task = job.CreateTask();
task.CommandLine = "ping 127.0.0.1 -n *";
task.IsParametric = true;
task.StartValue = 1;
task.EndValue = 10000;
task.IncrementValue = 1;
task.MinimumNumberOfCores = 1;
task.MaximumNumberOfCores = 1;
job.AddTask(task);
store.SubmitJob(job, @"hpc\user“, "p@ssw0rd");
}
}
17
A new RDMA networking interface built for speed and stability

Socket-Based
App

MPI App


MS-MPI
Windows Sockets
(Winsock + WSD)
RDMA
Networking

TCP/Ethernet
Networking




Kernel
Mode
TCP
IP
NDIS
Networking
Networking
Mini-port
Hardware
Hardware
Driver

User
Mode
Kernel By-Pass

Networking
WinSock
Networking
Networking
Networking
Hardware
NetworkDirect
Direct
Hardware
Hardware
Hardware
Provider
Provider
Networking Hardware
Hardware
Networking
User
Mode Access Layer
NetworkingHardware
Hardware
Networking
Hardware Driver

NetworkingHardware
Hardware
Networking
Networking
Hardware
(ISV) App
CCP
Component
OS
Component
IHV
Component
18
5

10


Programming Model overviews & demo



30



10
20

19
Task Concurrency
IFx / CCR
• Robotics-based
manufacturing assembly
Maestro
line
• Silverlight
Olympics viewer
TPL / PPL
• Automotive
control system
WCF
• Internet –based photo
services
WF
Local
Computing
Cluster-TPL
• Ultrasound imaging
equipment
• Media encode/decode
PLINQ
• Image processing/
OpenMPenhancement
TPL / PPL
• Data visualization
CDS
MPI / MPI.Net
Distributed/
Cloud Computing
• Enterprise search, OLTP,
Cluster SOA
collab
• Animation / CGI rendering
Cluster-PLINQ
• Weather forecasting
• Seismic monitoring
• Oil exploration
Data Parallelism
20











21









22







MPI_Init

MPI_Comm_size

MPI_Comm_rank

MPI_Send

MPI_Recv

MPI_Finalize
23















24
The bread & butter of MPI programs

P0
A
A
P1
A
P2
public void Broadcast<T>(ref T value, int root )
A
Broadcast
P3
A

int MPI_Reduce(void* sendbuf, void* recvbuf, int
count, MPI_Datatype datatype,
MPI_Op op, int root, MPI_Comm comm)
public T Reduce<T>( T value,
ReductionOperation<T> op,
int root )

P0
A
P1
B
P2
C
P3
D
R
Reduce
25
 Sean Mortazavi
Architect
Microsoft Corporation
Skip/Demo
P0 A
B
C
D
A
B
C
D
A
B
C
D
C
A
B
C
D
P3
D
A
B
C
D
P0 A0 A1 A2 A3
A0 B0 C0 D0
P1 B0 B1 B2 B3 All to All
A1 B1 C1 D1
P2 C0 C1 C2 C3
A2 B2 C2 D2
P3 D0 D1 D2 D3
A3 B3 C3 D3
P1
P2
A
Scatter
Gather
B
All Gather
P0
A
R
P1
B
P2
C
R
P3
D
R
P0
A
A
P1
B
P2
C
ABC
P3
D
ABCD
All Reduce
Scan
R
AB
Skip/Demo
Gathering Hostnames in MPI.Net
string
0
if
foreach string
0
in
30
Naïve Point-to-Point Send
public void
int
new
using
int
Serialize
new
unsafe
fixed
byte
new
int
Pin Memory
31
An optimal Send ?
public void
int
if
int
Not Valid C#!
unsafe
fixed
T
new
1
C#
else
// Serialize and transmit
MPI
short
MPI_SHORT
int
MPI_INT
float
MPI_FLOAT
double MPI_DOUBLE
32
NetPIPE Performance
Throughput (Mbps)
100
10
C (Native )
C# (Se rialize d)
1
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
0.1
0.01
Message Size (Bytes)
1.E+05
1.E+06
1.E+07
NetPIPE Performance
Throughput (Mbps)
100
10
C (Native )
C# (Prim itive )
C# (Se rialize d)
1
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
0.1
0.01
Message Size (Bytes)
1.E+05
1.E+06
1.E+07
V2 (focusing on Interactive applications)
V1 (focusing on batch applications)
Engineering
Applications
Oil & Gas
Applications
Life Science
Applications
Financial Services
Excel
Structural Analysis
Crash Simulation
Reservoir simulation
Seismic Processing
Structural Analysis
Crash Simulation
Portfolio analysis
Risk analysis
Compliance
Actual
Pricing
Modeling
App.exe
Interactive
Cluster
Applications
Your applications
here
Job Scheduler
WCF Service Broker
Resource allocation
Process Launching
Resource usage tracking
Integrated MPI execution
Integrated Security
WS Virtual Endpoint Reference
Request load balancing
Integrated Service activation
Service life time management
Integrated WCF Tracing
App.exe
App.exe
+
App.exe
Service
(DLL)
Service
(DLL)
Service
(DLL)
Service
(DLL)
35
"Cluster SOA"
Parallel
Session session = new session(startInfo);
Sequential
PricingClient client = new PricingClient(
binding, session.EndpointAddress);
for (i = 0; i < 100,000,000; i++)
for (i = 0; i < 100,000,000, i++)
{
{
client.BeginDoWork(dataset[i],
r[i] = worker.DoWork(dataSet[i]);
}
reduce ( r );
new AsyncCallback(callback), i)
}
void callback(IAsyncResult handle)
{
r[inx++] = client.EndDoWork(handle);
}
// aggregate results
reduce ( r );
36
What admin
sees
Job status
Service usage
report
Tracing logs
Http (SSL)
Net.Tcp
(Transport)
LAN
Head Node (Fail-over)
Compute Nodes
Control
Path
Job Scheduler
2. Start Service Instances
Broker
Job
1. Create a
Session
What user
runs
Node heatmap, Perfmon & Event logs
Service
Job
Node
Manager
Node
Manager
Node
Manager
...
Client
2. Start Broker
Service
Instance
Data
path
3. Send / receive
Messages
Service
Instance
WCF
Broker
Broker Nodes
What
Backend
does
Service
Instance
Balance the requests
Grow & shrink service pool
Provide WS Interoperability
Track service resource usage
Run service as the user
Restart upon failure
Support application tracing
37
Using VSTO and HPC WCF API, developers can enhance the
Excel’s calculation power by invoking distributed services
HPC
Service
Host
S
O
A
VSTO
A
P
I
WCF
Broker
HPC
Service
Host
HPC
Service
Host
Developers use VSTO and SOA API to invoke
existing services that are already deployed to
the cluster, without having to write any
calculation logic or XLLs. As such, the
analytics library can be centrally managed,
meeting the regulatory requirements…
HPC
Service
Host
Service
Code
CLR
Service
Code
CLR
Service
Code
CLR
Service
Code
CLR
38
 Sean Mortazavi
Architect
Microsoft Corporation
Skip/Demo
SOA Pingpong, Small Message Latency
1.6
Round Trip Latency ( ms )
1.4
1.2
1
0.8
WSD
0.6
IPoIB
0.4
Gige
0.2
0
Message Size ( bytes )
40
High Throughput
Skip/Demo
41
User needs
Build
• Keep the loop & method-call.
• Develop, debug and deploy with
VS 2008
Solution Benefits
• WCF based SOA effectively hides the details
for data serialization and distributed
computing
• VS 2008 debugs services and clients
Run
• Handle sub-second requests
efficiently
• Low round-trip latency
• End-to-end Kerberos with WCF
• Securely run user applications
• Smart, efficient WCF Broker node
• Intelligent, dynamic load balancing
Manage
• Monitor performance
• Configure infrastructure at a
glance
• Perform Diagnostics
• Runtime monitoring of performance
counters
• Configuration of SOA infrastructure from UI
• Diagnostics report for configuration,
connectivity and performance
• Monitor and report service usage
• Service resource usage reports
42
 Sean Mortazavi
Architect
Microsoft Corporation











44
>
>
HPC Embed SPEC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ML
Games
DB
Dense Matrix
Sparse Matrix
Spectral (FFT)
N-Body
Structured Grid
Unstructured
MapReduce
Combinational
Nearest Neighbor
Graph Traversal
Dynamic Prog
Backtrack/ B&B
Graphical Models
FSM
Source:“Future of Computer Architecture” by David A. Patterson
45




• Run, Trace
• Plot Excel, Xperf
• Plot Vampir, JumpShot









PS1> DwarfBench -Names SpectralMethod -Size medium -Platform managed -Parallel serial,tpl,mpi –PlotExcel
PS1> DwarfBench –Names DenseAlgebra -Size medium -Platform unmanaged,managed -Parallel mpi –PlotVmampir
PS1> DwarfBench –Names *grid* -Size Large -Platform unmanaged -Parallel hybrid –PlotVmampir
46 46
Skip/Demo
StructuredGrid code fragment using MPI.NET
47
Skip/Demo
48
Skip/Demo
49
Skip/Demo
50










world championships at ICGA





Surface Game UI: Vectorform
Game Engines: SmartGames, MOGO /Irina
51
5

10





30



10
20

Advanced performance tuning & analysis
52
 Jeff Baxter
Principal SDE
Microsoft Corporation
Sieve of Eratosthones



2
3
4
5
6
7
8
9
10
11
12
13
14
Mask off multiples of 2
2
3
5
7
9
11
13
Mask off multiples of 3
2
3
5
7
11
6 primes less than 15
13
54
Under very reasonable assumptions









55
Perfmon doesn’t really help
All nodes
~100% cpu
No Disk, Kernel,
DPC or ISR time.
No Paging
56
Traffic capture doesn’t really help
Data is
Interesting but
hard to action
57
Vampir Shows The Way
Communicating
Comunicating
Ranks
Ranks
MPI /
Application
Breakdown
Application /
MPI timeline
MPI Message
Detail
Per–rank
timeline
58
.Net Framework
Languages/Runtimes
LINQ: language integrated query
Dynamic Language Runtime
Fx/JIT/GC improvements
Native support for Web Services
C++, C#, VB
F#, Python, Ruby, Jscript
Fortran (Intel, PGI)
OpenMP
MPI, MPI.Net
Team Development
Team portal: version control,
scheduled build, bug tracking
Test and stress generation
Code analysis, Code coverage
Performance analysis
IDE
Rapid application development
Parallel debugging
Multiprocessor builds
Work flow design
Libs / Tools / Partners
Debug: Allinea, Vampir, …
Mathlibs: VNI, NAG, Intel, AMD, …
Eng RAD Tools: Matlab,
Mathematica, Maple, ISC, …
OSS: Tons in every category
59
60



www.microsoft.com/hpc


www.vectorform.com
www.smart-games.com

www.Allinea.com


http://tu-dresden.de/

www.visualnumerics.com

www.top500.org






http://www.osl.iu.edu/research/mpi.net


www.codeplex.com




www.microsoftpdc.com
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
66
67
68