Transcript Slide 1

Bringing Co-processor
Performance to Every
Programmer
David Tarditi, Sidd Puri, Jose Oglesby
Microsoft Research
presented by
Turner Whitted
Outline
• Basics – why, what, how
• Programming model, operations, capabilities
– Examples
• Implementation
• Performance
• Directions
Outline
• Basics – what, why, how
• Programming model, operations, capabilities
– Examples
• Implementation
• Performance
• Directions
Our goal
• Make parallel processing accessible …
… to everyday programmers
• And available …
… for everyday applications
Approach
• Extend existing high-level languages with new
data-parallel array types
– Ease of programming
– Implemented as a library so programmers can use it
now
– Eventually fold into base languages.
• Build implementations with compelling
performance
– Target GPUs and multi-core CPUs
• Create examples and applications
– Educate programmers, provide sample code
Why data parallel?
• It’s the easiest parallel programming model.
• It’s easy to debug.
• It’s easy to adapt to massive parallelism.
• Scaling to hundreds or thousands of parallel units
requires no mindset, design, or code changes.
• There’s widespread application experience in
the scientific, financial, media, and graphics
communities.
• APL/Parallel Fortran/Connection Machines/Stream
programming
• In developing parallel software the data
organization is much more important than
parallelism in the code.
Programming Model
Data-parallel array types
CPU
GPU
DPArray1[ … ]
library_calls()
DPArrayN[ … ]
ArrayN[ … ]
API/Driver/ Hardware
Array1[ … ]
txtr1[ … ]
pix_shdrs()
…
txtrN[ … ]
Explicit coercion
CPU
GPU
DPArray1[ … ]
library_calls()
DPArrayN[ … ]
ArrayN[ … ]
API/Driver/ Hardware
Array1[ … ]
Explicit coercions
between dataparallel arrays and
normal arrays trigger
GPU execution
txtr1[ … ]
pix_shdrs()
…
txtrN[ … ]
Functional style
CPU
GPU
DPArray1[ … ]
Functional style: each
operation produces a new
data-parallel array
DPArrayN[ … ]
ArrayN[ … ]
API/Driver/ Hardware
Array1[ … ]
txtr1[ … ]
pix_shdrs()
…
txtrN[ … ]
Types of operations
CPU
GPU
DPArray1[ … ]
library_calls()
DPArrayN[ … ]
ArrayN[ … ]
API/Driver/ Hardware
Array1[ … ]
Restrict operations to allow
data-parallel programming.
No aliasing, pointer
arithmetic, individual
element access
txtr1[ … ]
pix_shdrs()
…
txtrN[ … ]
Operations
•
•
•
•
•
•
•
Array creation
Element-wise arithmetic operations: +, *, -,
etc.
Element-wise boolean operations: and, or, >, <
etc.
Type coercions: integer to float, etc.
Reductions/scans: sum, product, max, etc.
Transformations: expand, pad, shift, gather,
scatter, etc.
Basic linear algebra: inner product, outer
product.
Example: 2-D convolution
float[,] Blur(float[,] array, float[] kernel) {
using (DFPA parallelArray = new DFPA(array)) {
FPA resultX = new FPA(0.0f, parallelArray.Shape);
for (int i = 0; i < kernel.Length; i++) { // Convolve in X direction.
resultX += parallelArray.Shift(0,i) * kernel[i];
}
FPA resultY = new FPA(0.0f, parallelArray.Shape);
for (int i = 0; i < kernel.Length; i++) { // Convolve in Y direction.
resultY += resultX.Shift(i,0) * kernel[i];
}
using (DFPA result = resultY.Eval()) {
float[,] resultArray;
result.ToArray(out resultArray);
return resultArray;
}
}
}
Implementation
What’s built
• A data-parallel library for .NET
– Simple, high-level set of operations
• A just-in-time compiler that compiles onthe-fly to GPU pixel shader code
– Runs on top of product CLR
• Examples and applications
– Versions using the library, C, and hand-written
pixel shader.
Just-in-time compiler
Programmer
C# code building up
an expression using
the Accelerator API
Accelerator
Build Expression Dag
Build Canonical
Shader Dag
Coercion to normal
C# array
DirectX
Transfer Data
Initialize Pipeline
Triangle Setup
Compile Pixel Shader
Optimize Shader Dag
Render
Run Shader Dag
Implementation details
• See David Tarditi, Sidd Puri, Jose Oglesby,
“Accelerator: using data-parallelism to
program GPUs for general purpose uses,”
to appear in Proceedings of ASPLOS XII,
Oct. 2006.
Performance
Benchmarks
Benchmark
Description
Type
Sum
Sum absolute values of 1000x1000
matrix
Primitive
Matrix-vector
Matrix multiply
Mutiply 1000x1000 matrix by vector
Primitive
Primitive
Life
Game of Life on 1000x1000 grid
(floating point, 1 iteration)
Module
Demosaic
Convert image in Bayer pattern to
RGB (1000x1000 image)
Module
Convolve
Convolve image with 5x5 Gaussian
filter (1000x1000 image)
Module
Multiply two 1000x1000 matrices
Benchmarks (cont.)
Benchmark
Description
Type
Rotate
Rotate 1000x1000 image w/ inter-pixel
interpolation, cropping
Module
Corner detection
Find corner-like features on a 1000x1000
image using KLT algorithm
App
Motion estimation
MPEG style macro-block motion vectors
on two 512x512 monochromatic images
App
Neural net
Train convolutional neural network for
handwriting recognition
App
Stereo matching
Compute distance of object at each pixel,
given 1000x1000 images from 2 cameras
separated by a small horizontal distance
App
Versions
Three implementations
• Accelerator, written in C#
• Hand-written pixel shader 3.0 code
• C (running on CPU)
– Use Intel’s Math Kernel Library for sum,
matrix-multiply, matrix-vector, part of neural
net training
Produce verifiably equivalent output (within
epsilon)
Hardware configuration
• CPU: 3.2 Ghz P4, with 16K L1 cache, 1MB L2
cache
• Machine(s): Dell Optiplex GX280, 1 GB memory,
400ns, PCI Express bus
• GPUs:
–
–
–
–
Nvidia GE Force 6800 Ultra with 256MB, Brand: eVGA
Nvidia GE Force 7800 GTX with 256MB, Brand: eVGA
ATI x850 with 256 MB
ATI x1800 XT
Software configuration
• C++
–
–
–
–
Intel Math Kernel Library 7.0
Intel C++ Compiler 9.0 for Windows
Visual Studio 2005 (“Whidbey”) Beta 2
DirectX 9.0 (June 2005 update)
• C#
– Framework 2.0.50215
– DirectX for Managed Code 1.0.2902.0/1.0.2906.0
• Compiler flags used:
– Intel C++:
/Ox
– Microsoft C++: /Ox /fp:fast
– C#:
/optimize+
0.10
M
ot
io
nE
S
st
te
re
oM
at
ch
or
ne
rs
ot
at
e
R
C
et
ec
t
D
em
os
ai
c
C
on
vo
lv
e
D
Li
fe
M
at
ri x
V
ec
M
to
at
r
ri x
M
ul
tp
ly
ed
uc
e
R
Speedup vs C++
API 1.0 vs Hand-coded PS 3.0 (x1800 XT)
100.00
Accelerator
Pixel Shader
10.00
1.00
0.10
D
or
Li
fe
xM
ul
tp
ly
xV
ec
t
ot
at
e
M
ot
io
nE
S
st
te
re
oM
at
ch
or
ne
rs
C
et
ec
t
R
em
os
ai
c
C
on
vo
lv
e
D
M
at
ri
M
at
ri
ed
uc
e
R
Speedup vs C++
Speedup on various GPUs
100.00
6800 Ultra
7800
x850 XT PE
10.00
x1800 XT
1.00
Directions
Lessons learned/next steps
• Need a non-graphics interface
– For more flexibility
– Less execution overhead
• Need native GPU support
– Replace library with language built-ins
• Need to learn from users
• Retarget for multi-core
Additional information
• Tech Report, “Accelerator: simplified
programming of graphics processing units
for general-purpose uses via data
parallelism,” MSR-TR-2005-184
– Available at http://research.microsoft.com
• Download available from
– http://research.microsoft.com/downloads
• For questions contact
– [email protected]
Acknowledgement
• Jim Kajiya, Rick Szeliski, Raymond
Endres, David Williams