High Productivity Computing: Taking HPC Mainstream Lee Grant Technical Solutions Professional High Performance Computing leegrant@microsoft.com.

High Productivity Computing: Taking HPC Mainstream Lee Grant Technical Solutions Professional High Performance Computing [email protected].

Transcript High Productivity Computing: Taking HPC Mainstream Lee Grant Technical Solutions Professional High Performance Computing [email protected].

High Productivity Computing:
Taking HPC Mainstream
Lee Grant
Technical Solutions Professional
High Performance Computing
[email protected]
Challenge: High Productivity Computing
“Make high-end computing easier and more productive to use.
Emphasis should be placed on time to solution, the major
metric of value to high-end computing users…
A common software environment for scientific computation encompassing
desktop to high-end systems will enhance productivity gains by promoting
ease of use and manageability of systems.”
2004 High-End Computing Revitalization Task Force
Office of Science and Technology Policy, Executive Office of the President
X64 Server
The Data Pipeline
Data Gathering
Discovery and
Browsing
Science
Exploration
Domain specific
analyses
Scientific Output
“Raw” data includes
sensor output, data
downloaded from
agency or collaboration
web sites, papers
(especially for ancillary
data
“Raw” data browsing for
discovery (do I have
enough data in the right
places?), cleaning (does
the data look obviously
wrong?), and light weight
science via browsing
“Science variables” and
data summaries for early
science exploration and
hypothesis testing.
Similar to discovery and
browsing, but with
science variables
computed via gap filling,
units conversions, or
simple equation.
“Science variables”
combined with models,
other specialized code,
or statistics for deep
science understanding.
Scientific results via
packages such as
MatLab or R2. Special
rendering package such
as ArcGIS.
Paper preparation.
Operations per second for serial code
for traditional software
Free Lunch
Free Lunch Is Over For Traditional Software
24 GHz
1 Core
12 GHz
1 Core
3 GHz
2 Cores
6 GHz
1 Core
3 GHz
4 Cores
3 GHz
8 Cores
3 GHz
1 Cor 3 GHz
1 Cores
Additional operations per second if code can take advantage of concurrency
No Free Lunch for traditional software
(Without highly concurrent software it won’t get any faster!)
Microsoft’s Vision for HPC
“Provide the platform, tools and broad ecosystem to reduce the complexity of HPC by
making parallelism more accessible to address future computational needs.”
Reduced Complexity
Mainstream HPC
Developer Ecosystem
Ease deployment for
larger scale clusters
Address needs of traditional
supercomputing
Increase number of parallel
applications and codes
Simplify management for
clusters of all scale
Address emerging
cross-industry
computation trends
Offer choice of parallel
development tools,
languages and libraries
Integrate with
existing infrastructure
Enable non-technical users to
harness the power of HPC
Drive larger universe of
developers and ISVs
Microsoft HPC++ Solution
Application Benefits
The most productive distributed
application development
environment
Cluster Benefits
Complete HPC cluster platform
integrated with the enterprise
infrastructure
System Benefits
Cost-effective, reliable and high
performance server operating
system
Windows HPC Server 2008
Integrated security via Active
Directory
Support for batch, interactive and
service-oriented applications
High availability scheduling
Interoperability via OGF’s HPC
Basic Profile
Rapid large scale deployment and
built-in diagnostics suite
Integrated monitoring,
management and reporting
Familiar UI and rich scripting
interface
Systems
Management
Storage
Access to SQL, Windows and Unix
file servers
Key parallel file server vendor
support (GPFS, Lustre, Panasas)
In-memory caching options
Job
Scheduling
MPI
MS-MPI stack based on MPICH2
reference implementation
Performance improvements for
RDMA networking and multi-core
shared memory
MS-MPI integrated with Windows
Event Tracing
List or Heat Map view cluster at a
glance
Group compute nodes based on hardware, software and
custom attributes; Act on groupings.
Receive alerts for failures
Track long running operations and
access operation history
Pivoting enables correlating
nodes and jobs together
Integrated Job Scheduling
Services oriented HPC apps
Expanded Job Policies
Support for Job Templates
Improve interoperability
with mixed IT infrastructure
Skip/Demo
Skip/Demo
Node/Socket/Core Allocation
Windows HPC Server can help your application make the
best use of multi-core systems
Node 1
Node 2
P1
P0
S0
S1
J1
P2
P1
P0
P3
P0
P2
P2
P3
P1
S1
S0
J1
P0
P1
P3
P2
P3
P0
P1
J2
P1
P0
S2
P2
P0
J3
P1
J3
P0
S3
J1
P3
P2
P1
S3
S2
J3
P3
J3
J1: /numsockets:3 /exclusive: false
J3: /numcores:4 /exclusive: false
P2
P3
J2: /numnodes:1
P2
P3
Job submission: 3 methods
• Command line
–
–
–
–
–
•
Programmatic
•
•
Job submit /headnode:Clus1 /Numprocessors:124 /nodegroup:Matlab
Job submit /corespernode:8 /numnodes:24
Job submit /failontaskfailure:true /requestednodes:N1,N2,N3,N4
Job submit /numprocessors:256 mpiexec \\share\mpiapp.exe
[Completel Powershell system mgmt commands are available as well]
Support for C++ & .Net
languages
Web Interface
•
Open Grid Forum: “HPC
Basic Profile”
using Microsoft.Hpc.Scheduler;
class Program
{
static void Main()
{
IScheduler store = new Scheduler();
store.Connect(“localhost”);
ISchedulerJob job = store.CreateJob();
job.AutoCalculateMax = true;
job.AutoCalculateMin = true;
ISchedulerTask task = job.CreateTask();
task.CommandLine = "ping 127.0.0.1 -n *";
task.IsParametric = true;
task.StartValue = 1;
task.EndValue = 10000;
task.IncrementValue = 1;
task.MinimumNumberOfCores = 1;
task.MaximumNumberOfCores = 1;
job.AddTask(task);
store.SubmitJob(job, @"hpc\user“,
"p@ssw0rd");
}
}
Scheduling MPI jobs
•Job Submit /numprocessors:7800 mpiexec hostname
•Start time: 1 second, Completion time: 27 seconds
NetworkDirect
A new RDMA networking interface built for speed and stability
– 2 usec latency, 2 GB/sec
bandwidth on ConnectX
• OpenFabrics driver for
Windows includes support for
Network Direct, Winsock Direct
and IPoIB protocols
MPI App
MS-MPI
Windows Sockets
(Winsock + WSD)
RDMA
Networking
Networking
Networking
WinSock
Direct
Hardware
Hardware
Provider
TCP/Ethernet
Networking
Networking
Networking
NetworkDirect
Hardware
Hardware
Provider
Networking Hardware
Hardware
Networking
User
Mode Access Layer
TCP
User
Mode
Kernel By-Pass
• Verbs-based design for close fit
with native, high-perf
networking interfaces
• Equal to Hardware-Optimized
stacks for MPI microbenchmarks
Socket-Based
App
IP
NDIS
Networking
Networking
Mini-port
Hardware
Hardware
Driver
Kernel
Mode
Networking Hardware
Hardware
Networking
Hardware Driver
Networking Hardware
Hardware
Networking
Networking
Hardware
(ISV) App
CCP
Component
OS
Component
IHV
Component
Spring 2008, NCSA, #23
9472 cores, 68.5 TF, 77.7%
Spring 2008, Umea, #40
5376 cores, 46 TF, 85.5%
Spring 2008, Aachen, #100
2096 cores, 18.8 TF, 76.5%
Fall 2007, Microsoft, #116
2048 cores, 11.8 TF, 77.1%
30% efficiency
improvement
Windows HPC Server 2008
Windows Compute Cluster 2003
Spring 2007, Microsoft, #106
2048 cores, 9 TF, 58.8%
Spring 2006, NCSA, #130
896 cores, 4.1 TF
November 2008 Top500
Customers
“It is important that our IT environment is easy to use and support.
Windows HPC is improving our performance and manageability.”
-- Dr. J.S. Hurley, Senior Manager, Head Distributed Computing, Networked Systems
Technology, The Boeing Company
“Ferrari is always looking for the most advanced technological solutions and, of course, the same
applies for software and engineering. To achieve industry leading power-to-weight ratios,
reduction in gear change times, and revolutionary aerodynamics, we can rely on Windows HPC
Server 2008. It provides a fast, familiar, high performance computing platform for our users,
engineers and administrators.”
-- Antonio Calabrese, Responsabile Sistemi Informativi (Head of Information Systems), Ferrari
“Our goal is to broaden HPC availability to a wider audience than just power users. We believe that
Windows HPC will make HPC accessible to more people, including engineers, scientists, financial
analysts, and others, which will help us design and test products faster and reduce costs.”
-- Kevin Wilson, HPC Architect, Procter & Gamble
“We are very excited about utilizing the Cray CX1 to support our research activities,” said Rico
Magsipoc, Chief Technology Officer for the Laboratory of Neuro Imaging. “The work that we do in
brain research is computationally intensive but will ultimately have a huge impact on our
understanding of the relationship between brain structure and function, in both health and disease.
Having the power of a Cray supercomputer that is simple and compact is very attractive and
necessary, considering the physical constraints we face in our data centers today.”
Porting Unix Applications
• Windows Subsystem for Unix applications
– Complete SVR-5 and BSD UNIX environment with 300
commands, utilizes, shell scripts, compilers
– Visual Studio extensions for debugging POSIX applications
– Support for 32 and 64-bit applications
• Recent port of WRF weather model
– 350K lines, Fortran 90 and C using MPI, OpenMP
– Traditionally developed for Unix HPC systems
– Two dynamical cores, full range of physics options
• Porting experience
– Fewer than 750 lines of code changed in makefiles/scripts
– Level of effort similar to port to any new version of UNIX
– Performance on par with the Linux systems
• India Interoperability Lab, MTC Bangalore
– Industry Solutions for Interop jointly with partners
– HPC Utility Computing Architecture
– Open Source Applications on HPC Server 2008
(NAMD, PL_POLY, GROMACS)
High Productivity Modeling
Languages/Runtimes
C++, C#, VB
F#, Python, Ruby, Jscript
Fortran (Intel, PGI)
OpenMP, MPI
.Net Framework
LINQ: language
integrated query
Dynamic Language
Runtime
Fx/JIT/GC improvements
Native support for Web
Services
Team Development
Team portal: version
control, scheduled build,
bug tracking
Test and stress
generation
Code analysis, Code
coverage
Performance analysis
IDE
Rapid application
development
Parallel debugging
Multiprocessor builds
Work flow design
MSFT || Computing Technologies
Task Concurrency
IFx / CCR
•Robotics-based
manufacturing assembly
Maestro
line
•Silverlight
TPL / Olympics
PPL
viewer
Local
Computing
•Automotive
control system
WCF
•Internet –based photo
services
WF
Cluster-TPL
•Ultrasound imaging
equipment
•Media encode/decode
PLINQ
•Image processing/
enhancement
OpenMP
TPL / PPL
•Data visualization
CDS
MPI / MPI.Net
•Enterprise search, OLTP,
Cluster SOA
collab
•Animation / CGI rendering
Cluster-PLINQ
•Weather forecasting
•Seismic monitoring
•Oil exploration
Data Parallelism
Distributed/
Cloud Computing
UDF
UDF
UDF
UDF
UDF
UDF
UDF
Head Nodes
Supports SOA
functionality
WCF Brokers.
UDF
Compute Nodes
Each performs UDF
Tasks as called
From WCF Broker
SOA Broker Performance
Low latency
Messages/sec (25 ms compute time)
Round Trip Latency ( ms )
1.6
High throughput
1.4
1.2
1
0.8
0.6
0.4
0.2
0
6000
5000
4000
3000
2000
1000
0
0
50
100
150
Number of clients
Message Size ( bytes )
WSD
IPoIB
Gige
0k pingpong
1k pingpong
4k pingpong
16k pingpong
200
MPI.NET
• Supports all .NET languages
(C#, C++, F#, ..., even Visual Basic!)
• Natural expression of MPI in C#
if (world.Rank == 0)
world.Send(“Hello, World!”, 1, 0);
else
string msg = world.Receive<string>(0, 0);
string[] hostnames =
comm.Gather(MPI.Environment.ProcessorName, 0);
double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) =>
return x + y, 0) / totalDartsThrown;
• Negligible overhead (relative to C) over TCP
Allinea DDT VS Debugger Add-in
Skip/Demo
NetPIPE Performance
Throughput (Mbps)
100
10
C (Native )
C# (Prim itive )
C# (Se rialize d)
1
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
0.1
0.01
Message Size (Bytes)
1.E+05
1.E+06
1.E+07
Parallel Extensions to .NET
• Declarative data parallelism (PLINQ)
var q = from n in names.AsParallel()
where n.Name == queryInfo.Name &&
n.State == queryInfo.State &&
n.Year >= yearStart &&
n.Year <= yearEnd
orderby n.Year ascending
select n;
• Imperative data and task parallelism (TPL)
Parallel.For(0, n, i=> {
result[i] = compute(i);
});
• Data structures and coordination constructs
Example: Tree Walk
Sequential
static void ProcessNode<T>(Tree<T>
tree, Action<T> action) {
if (tree == null) return;
Thread Pool
static void ProcessNode<T>(Tree<T> tree, Action<T> action) {
if (tree == null) return;
Stack<Tree<T>> nodes = new Stack<Tree<T>>();
Queue<T> data = new Queue<T>();
nodes.Push(tree);
while (nodes.Count > 0) {
Tree<T> node = nodes.Pop();
data.Enqueue(node.Data);
if (node.Left != null) nodes.Push(node.Left);
if (node.Right != null) nodes.Push(node.Right);
}
ProcessNode(tree.Left, action);
ProcessNode(tree.Right, action);
action(tree.Data);
}
using (ManualResetEvent mre = new ManualResetEvent(false)) {
int waitCount = Environment.ProcessorCount;
WaitCallback wc = delegate {
bool gotItem;
do {
T item = default(T);
lock (data) {
if (data.Count > 0) {
item = data.Dequeue();
gotItem = true;
}
else gotItem = false;
}
if (gotItem) action(item);
} while (gotItem);
if (Interlocked.Decrement(ref waitCount) == 0) mre.Set();
};
for (int i = 0; i < Environment.ProcessorCount - 1; i++) {
ThreadPool.QueueUserWorkItem(wc);
}
wc(null);
mre.WaitOne();
}
}
Example: Tree Walk
Parallel Extensions (with Task)
static void ProcessNode<T>(Tree<T> tree, Action<T> action) {
if (tree == null) return;
Task t = Task.Create(delegate { ProcessNode(tree.Left, action); });
ProcessNode(tree.Right, action);
action(tree.Data);
t.Wait();
}
Parallel Extensions (with Parallel)
static void ProcessNode<T>(Tree<T> tree, Action<T> action) {
if (tree == null) return;
Parallel.Do(
() => ProcessNode(tree.Left, action),
() => ProcessNode(tree.Right, action),
() => action(tree.Data) );
}
Parallel Extensions (with PLINQ)
static void ProcessNode<T>(Tree<T> tree, Action<T> action) {
tree.AsParallel().ForAll(action);
}
F# is...
...a functional, object-oriented,
imperative and explorative
programming language for .NET
Libraries
Scalable
Explorative
Succinct
Strongly
Typed
Interoperable
F#
Efficient
Interactive F# Shell
C:\fsharpv2>bin\fsi
MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved
F# Version 1.9.2.9, compiling for .NET Framework Version v2.0.50727
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
NOTE:
See 'fsi --help' for flags
Commands: #r <string>;; reference (dynamically load) the given DLL.
#I <string>;; add the given search path for referenced DLLs.
#use <string>;; accept input from the given file.
#load <string> ...<string>;;
load the given file(s) as a compilation unit.
#time;; toggle timing on/off.
#types;; toggle display of types on/off.
#quit;; exit.
Visit the F# website at http://research.microsoft.com/fsharp.
Bug reports to [email protected]. Enjoy!
> let rec f x = (if x < 2 then x else f (x-1) + f (x-2));;
val f : int -> int
> f 6;;
val it = 8
val it : int
Example: Taming Asynchronous I/O
using System;
using System.IO;
using System.Threading;
public static void ReadInImageCallback(IAsyncResult asyncResult)
{
public static void ProcessImagesInBulk()
ImageStateObject state = (ImageStateObject)asyncResult.AsyncState;
{
public class BulkImageProcAsync
Stream stream = state.fs;
Console.WriteLine("Processing images... ");
int bytesRead = stream.EndRead(asyncResult);
{
long t0 = Environment.TickCount;
if
(bytesRead
!=
numPixels)
public const String ImageBaseName = "tmpImage-";
NumImagesToFinish = numImages;
throw
new
Exception(String.Format
public const int numImages = 200;
AsyncCallback readImageCallback = new
("In ReadInImageCallback, got the wrong number of " + AsyncCallback(ReadInImageCallback);
public const int numPixels = 512 * 512;
"bytes from the image: {0}.", bytesRead));
for (int i = 0; i < numImages; i++)
ProcessImage(state.pixels, state.imageNum);
{
// ProcessImage has a simple O(N) loop,
and
you
can
vary
the
number
stream.Close();
ImageStateObject state = new ImageStateObject();
// of times you repeat that loop to make the application more CPUstate.pixels = new byte[numPixels];
// Now write out the image.
// bound or more IO-bound.
state.imageNum = i;
// =Using
public static int processImageRepeats
20; asynchronous I/O here appears not to be best practice.// Very large items are read only once, so you can make the
// It ends up swamping the threadpool, because the threadpool
// buffer on the FileStream very small to save memory.
// threads are blocked on I/O requests that were just queued toFileStream fs = new FileStream(ImageBaseName + i + ".tmp",
// Threads must decrement NumImagesToFinish,
and protect
// the threadpool.
FileMode.Open, FileAccess.Read, FileShare.Read, 1, true);
// their access to it through a mutex.
FileStream fs = new FileStream(ImageBaseName + state.imageNum +state.fs = fs;
public static int NumImagesToFinish = numImages;
".done", FileMode.Create, FileAccess.Write, FileShare.None,fs.BeginRead(state.pixels, 0, numPixels, readImageCallback,
public static Object[] NumImagesMutex = 4096,
new Object[0];
false);
state);
0, numPixels);
// WaitObject is signalled when all fs.Write(state.pixels,
image processing is done.
}
public static Object[] WaitObject = fs.Close();
new Object[0];
public class ImageStateObject
// Determine whether all images are done being processed.
// This application model uses too much memory.
// If not, block until all are finished.
{
// Releasing memory as soon as possible is a good idea,
bool mustBlock = false;
public byte[] pixels;
// especially global state.
lock (NumImagesMutex)
public int imageNum;
state.pixels = null;
{
public FileStream fs;
fs = null;
if (NumImagesToFinish > 0)
// Record that an image is finished now.
}
mustBlock = true;
Processing
200 images in
parallel
lock (NumImagesMutex)
{
NumImagesToFinish--;
if (NumImagesToFinish == 0)
{
Monitor.Enter(WaitObject);
Monitor.Pulse(WaitObject);
Monitor.Exit(WaitObject);
}
}
}
if (mustBlock)
{
Console.WriteLine("All worker threads are queued. " +
" Blocking until they complete. numLeft: {0}",
NumImagesToFinish);
Monitor.Enter(WaitObject);
Monitor.Wait(WaitObject);
Monitor.Exit(WaitObject);
}
long t1 = Environment.TickCount;
Console.WriteLine("Total time processing images: {0}ms",
(t1 - t0));
}
}
}
Example: Taming Asynchronous I/O
Equivalent F#
code
(same perf)
Open the file
synchronously
Read from the
file,
asynchronously
let ProcessImageAsync(i) =
async { let inStream = File.OpenRead(sprintf "source%d.jpg" i)
let! pixels
= inStream.ReadAsync(numPixels)
let pixels'
= TransformImage(pixels,i)
let outStream = File.OpenWrite(sprintf "result%d.jpg" i)
do! outStream.WriteAsync(pixels')
do
Console.WriteLine "done!" }
Write the result
asynchronously
let ProcessImagesAsync() =
Async.Run (Async.Parallel
[ for i in 1 .. numImages -> ProcessImageAsync(i) ])
Generate the
tasks and queue
them in parallel
The Coming of Accelerators
Current Offerings
Microsoft
AMD
nVidia
Intel
Apple
Accelerator
Brook+
RapidMind
Ct
Grand
Central
D3DX,
DaVinci, FFT,
Scan
ACML-GPU
cuFFT,
cuBLAS, cuPP
MKL++
CoreImage
CoreAnim
Compute
Shader
CAL
CUDA
LRB Native
OpenCL
Any Processor
AMD CPU or
GPU
nVidia GPU
Intel CPU
Larrabee
Any Processor
DirectX11 Compute Shader
• A new processing model for GPUs
–
–
–
–
Integrated with Direct3D
Supports more general constructs
Enables more general data structures
Enables more general algorithms
• Image/Post processing:
– Image Reduction, Histogram, Convolution, FFT
– Video transcode, superResolution, etc.
• Effect physics
– Particles, smoke, water, cloth, etc.
• Ray-tracing, radiosity, etc.
• Gameplay physics, AI
FFT Performance Example
• Complex 1024x1024 2-D FFT:
– Software
– Direct3D9
– CUFFT
– Prototype DX11
– Latest chips
42ms
15ms
8ms
6ms
3ms
6 GFlops
17 GFlops 3x
32 GFlops 5x
42 GFlops 6x
100 GFlops
• Shared register space and random access
writes enable ~2x speedups
IMSL .NET Numerical Library
•
•
•
•
•
•
Linear Algebra
Eigensystems
Interpolation and
Approximation
Quadrature
Differential Equations
Transforms
•
•
•
•
•
•
Nonlinear Equations
Optimization
Basic Statistics
Nonparametric Tests
Goodness of Fit
Regression
•
•
•
•
•
•
Variances, Covariances and
Correlations
Multivariate Analysis
Analysis of Variance
Time Series and Forecasting
Distribution Functions
Random Number Generation
Research
Integrate


Data acquisition from
source systems and
integration
Data transformation
and synthesis
Analyze


Data enrichment, with
business logic,
hierarchical views
Data discovery via
data mining
Report


Data presentation
and distribution
Data access for
the masses
Data Browsing with Excel
Annual
Mean
Monthly
Mean
Weekly
Mean
Courtesy Catherine van Ingen, MSR
Datamining with Excel
Integrated algorithms
• Text Mining
• Neural Nets
• Naïve Bayes
• Time Series
• Sequent Clustering
• Decision Trees
• Association Rules
Workflow Design for Sharepoint
Microsoft HPC++ Labs:
Academic Computational Finance Service
Taking HPC Mainstream
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or
trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation
as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of
Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.