General Purpose Computations on GPU

Download Report

Transcript General Purpose Computations on GPU

Enhancing GPU
for Scientific Computing
Some thoughts
Outline






Motivation
Related work
BLAS Library
Execution Model
Benchmarks
Recommendations
Motivation

GPU Computing
•
•
•


Vector and Fragment Processor
streaming (super)-computers
enormous performance!
ATI 9700, NV30
They have become programmable
Emerging application areas
•
•
Numerical Sim.[Schroder’03], Sorting, Genomics, etc.
Goal: Scientific Computing
Motivation

Most software built from small-efficient parts

Scientific apps built on top of s/w library routines

Harnessing GPU resources

•
•
Arithmetic Intensive
Data parallel
BLAS Library
Related work





Using non-programmable GPUs
[Erik’01] prog. vertex engine for lighting/morphing
[Oskin’02] vector processing using VP
[Ian’03] stream processing using FP
Problems :
•
•
•
•
Monolithic Big Programs
One of VP or FP
CPU – Passive Mode
No Cascading Loop-backs (Parallelism, Setup Times)
BLAS Library


BLAS (Basic Linear Algebra Subprograms)
•
•
Building blocks for vector and matrix operations
development of highly efficient linear algebra software
• LINPACK and LAPACK
Operations
•
•
•
•
Scalar –
Vector –
Vector –
Matrix –
Vector
Vector
Matrix
Matrix
Mapping

Operation

CPU/FP - All ops

VP - no memory access

Restricted data-flows
•
•
CPU
VP
processor
FP
CPU
All operations
CPU
Non-matrix ops
VP
All operations
FP
CPU
VP
FP
CPU
(Vectors, Vectors)
Execution graph
vAdd
CPU
vAdd.cg
Vector Scalar Add Operation
1.
In this example, a Vector of length n is
segmented into m other vectors of length 4 in
the CPU function vsAdd.
2.
The vertex program vsAdd.cg is loaded onto
the vertex processor and the scalar value is
passed as a parameter.
3.
Subsequently, CPU function vsAdd will
stream the set of m vectors onto the CPU as
openGL primitive points. Our vertex program,
vsAdd.cg will add the scalar value to all fields
in the m vertices.
4.
Consequently, these vertices will proceed to
the fragment processor and written onto the
framebuffer memory.
5.
The CPU function vsADD continues to read
the color values off each pixel representation
of the vertices. These color values contain
result of a Vector Scalar add.
6.
Lastly the CPU function concatenates the
sequence of color values into a vector of
length n as result.
[Vertex]m (GL_POINTS)
[vAdd.cg]
Vertex Processor
[Vertex]m
G
[None]
Fragment Processor
P
U
Texture Mem
PBuffer
TextureDatam
[Texture Color values]m
vAdd
CPU
(Vectors)
GL_QUAD [Vector4]m
Execution graph
vAdd
CPU
Vector Vector Add Operation
1.
In this example, 2 vectors of length s are
transformed into texture data in the CPU
function vAdd.
2.
The vertex program vAdd.cg, and texture data
are loaded onto the fragment processor GPU
memory respectively.
3.
Subsequently, CPU function vAdd will draw a
quadrilateral primitive having s pixels.
4.
The vertex processor does nothing and passes
on the vertices to the rasterizer to process into
pixel representation.
5.
The rasterizer creates the s pixels for fragment
processing.
6.
For each pixel, our fragment processor will
lookup the values from both textures and
determine the color value of each pixel. These
pixels are written onto the Pbuffer memory.
7.
The CPU function vADD continues to read
the color values off each pixel representation
of the vertices. These color values contain
result of a Vector Vector add.
8.
The output in Pbuffer is then converted into a
texture entry.
9.
Lastly the CPU function reads the texture
entry and concatenates the sequence of color
values into a vector of length s as result.
[Vertex4]m GL_QUAD
vAdd.cg
[None]
Vertex Processor
[Vertex4]m
G
[vAdd.cg]
Fragment Processor
P
U
TextureData1m
TextureData2m
Texture Mem
PBuffer
TextureData3m
[Texture Color values]m
vAdd
CPU
(Vectors)
GL_QUAD [Vectex4]m
Execution graph
vAdd
CPU
2 Vector Vector Add Operations
1.
In this example, we perform 2 separate vector
vector add operations.
2.
The 1st operation proceeds as described earlier
in our vector vector add operation.
3.
The output of the 1st operation is used as input
for the 2nd operation.
4.
Since it’s the same operation, we do not load a
new Vertex or Fragment program. However
we proceed to load a new texture data.
5.
The 2nd operation proceeds as normal.
6.
Lastly the CPU function concatenates the
sequence of color values into a vector of
length s as result.
[Vertex4]m
[None]
Vertex Processor
TextureData4m
[Vertex4]m
G
[vAdd.cg]
Fragment Processor
P
U
Texture Mem
PBuffer
TextureData3m
[Texture Color values]m
vAdd
CPU
(Vectors)
Performance Issues


Representation inefficiency
•
Memory
•
Communication costs
• Data stored both in CPU and GPU
• Loading data onto GPU
• Reading data from GPU
Execution inefficiencies
•
Computation setup overhead
•
Problem execution time
• Remodeling CPU data for GPU
• Rendering
• Texture lookups
Operation breakdown
Fixed-Point processing
100%
80%
Readback
60%
Compute
Setup
40%
Load
20%
0%
1
2
4
8
16
32
Fixed-Point (Vector-Vector Add)
16000
14000
Time in Milliseconds
12000
10000
8000
6000
4000
2000
0
1
2
4
8
Number of Operations
Resident
Non-Resident
16
32
Operation Breakdown
FP32 processing
100%
90%
80%
70%
60%
Readback
50%
Compute
Setup
40%
Load
30%
20%
10%
0%
1
2
4
8
16
32
FP32 Performance (Vector-Vector Add)
40000
35000
Time in Milliseconds
30000
25000
20000
15000
10000
5000
0
1
2
4
8
Number of cascading operations
Resident
Non-Resident
16
32
Observations



Fixed-point operations are much faster
than FP16/FP32 operations
FP16/FP32 operations have similar
performance
VP is slower than FP
• Operation mappings involving both VP and
FP result in inefficient pipeline
Observations


Simple operations perform better on CPU
Best to design whole algorithm as single
VP/FP program
• Memory cost for storing intermediate results
• Execution cost ?
• More textures result in decreased performance
Bug Reports Filed!


Incorrect dump of floating point values
after render to texture [NVIDIA confirmed]
cgSetcolor parameter does not update
alpha values [Awaiting reply]
Recommendations
(3D Graphics Hackers)

Load important data into Video memory

Maximum use of Fixed-point Pipeline

Code optimization important (Instr., Memory)

Upgrade your video card drivers (must!)
•
Hacking graphics hardware is a *real* pain!
Recommendations (Cg)

Pointer meaningful for numerical computing

Texture fetch instructions (add. Offsets)

Accumulation registers (sum)

Preserving State across multiple calls

Introduce stack mechanisms

Introduce bit wise operators
Recommendations (Hardware)





Allow GPU to read/write from CPU memory
VP and FP as 1st class processors on GPU
•
•
Similar cores and instruction sets
Allow full parallelism
Allow CPU to read/write all registers in GPU
processors
Introduce a stack
Introduce bit wise operators
Deliverables!

A draft subset of the BLAS library

Architecture Insights (issues/constraints)

NV30 Improvements (Bug reports)

Technical Write-up
The End