Rapid Evaluation of Catmull-Clark Subdivision Surfaces

Download Report

Transcript Rapid Evaluation of Catmull-Clark Subdivision Surfaces

Sparse Matrix Solvers on the GPU:
Conjugate Gradients and Multigrid
Jeffrey Bolz, Ian Farmer, Eitan Grinspun,
Peter Schröder
Caltech ASCI Center
• Semiconductor trends
– cost
– wires vs. compute
– Stanford streaming supercomputer
• Parallelism
Actual
1e+7
1e+6
Perf (ps/Inst)
1e+5
Linear (ps/Inst)
1e+4
1e+3
1e+2
1e+1
1e+0
1e-1
1e-2
Possible
1e-3
1e-4
1980
1990
2000
2010
2020
– many functional units
– graphics is prime example
• Harvesting this power
– what application suitable?
– what abstractions useful?
• History
– massively parallel SIMD machines
– media processing
Imagine stream processor; Bill Dally, Stanford
Connection Machine CM2; Thinking Machines
Chart courtesy Bill Dally
Why Use the GPU?
Contributions and Related Work
• Contributions
– numerical algorithms on GPU
• unstructured grids: conjugate gradients
• regular grids: multigrid
– what abstractions are needed?
• Numerical algorithms
–
–
–
–
–
–
Goodnight et al. 2003 (MG)
Hall et al. 2003 (cache)
Harris et al. 2002 (FD sim.)
Hillisland et al. 2003 (optimization)
Krueger & Westermann 2003 (NLA)
Strzodka (PDEs)
Streaming Model
• Abstract model
– Purcell, et al. 2002
– data structures: streams
– algorithms: kernels
input
record
stream
Kernel
Kernel
globals
globals
• Concrete model
– render a rectangle
– data structures: textures
– algorithms: fragment programs
Texture
as read-only
memory
Bind buffer
to texture
Rasterizer
(set up texture
indices and all
associated data)
Fragment
program
(for all pixels
in parallel)
Output goes to
texture
output
record
stream
Sparse Matrices: Geometric Flow
• Ubiquitous in numerical computing
– discretization of PDEs: animation
• finite elements, difference, volumes
– optimization, editing, etc., etc.
• Example here:
– processing of surfaces
Velocity opposite mean
curvature normal

t xi (t )  i (t ) Hi (t )ni (t )
• Canonical non-linear problem
– mean curvature flow
– implicit time discretization
Axi 1  x i 
• solve sequence of SPD systems
aij  t (cot(ij )  cot(ij ))
aii  4 Ai   jN ( i ) aij
Conjugate Gradients
• High level code
– inner loop
– matrix-vector
multiply
– sum-reduction
– scalar-vector
MAD
• Inner product
– fragment-wise multiply
– followed by sum-reduction
– odd dimensions can be handled
y=Ax
Aj – off-diagonal
matrix elements
R – pointers
to segments
Row-Vector Product
X – vector elements
J – pointers to xj
Fragment program
R – pointers to segments
Ai – diagonal matrix elements
Aj – off-diagonal matrix elements
Apply to All Pixels
• Two extremes
– one row at a time: setup overhead
– all rows at once: limited by worst row
• Middle ground
– organize “batches” of work
• How to arrange batches?
– order rows by non-zero entries
Time
• optimal packing NP hard
• We choose fixed size rectangles
– fragment pipe is quantized
– simple experiments reveal best size
• 26 x 18 – 91% efficient
• wasted fragments on diagonal
Area
(pixels)
Packing (Greedy)
15 13 13 12 12 11 10 9 9 9 9 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 6 5 5 4
non-zero entries
per row
15 13 13 9 9 8 7 7 7
12 12 11 8 8 8 7 7 7
10 9 9 8 7 7 7 7 6
each batch
bound to an
appropriate
fragment program
All this setup done
once only at the
beginning of time.
Depends only on
mesh connectivity
…
Recomputing Matrix
• Matrix entries depend on surface
– must “render” into matrix
– two additional indirection textures
• previous and next
Results (NV30@500MHz)
• 37k elements
– matrix multiply
• 33 instructions, 120 per second
• only 13 flops
• latency limited
– reduction
• 7 inst/frag/pass, 3400 per second
– CG solve: 20 per second
Regular Grids
• Poisson solver as example
– multigrid approach
– this time variables on “pixel grid”
• e.g.: Navier-Stokes
 u  0
u

 (u  )u   2u  b
t
after discretization:
solve Poisson eq.
at each time step
2 p    u
Poisson Equation
• Appears all over the place
– easy to discretize on regular grid
– matrix multiply is
stencil application
– FD Laplace stencil:
• Use iterative matrix solver
– just need application of stencil
• easy: just like filtering
• incorporate geometry (Jacobian)
• variable coefficients
 2 X i , j  X i 1, j  X i 1, j
 X i , j 1  X i , j 1  4 X i , j
0
1
0
1
-4
1
(i,j)
0
1
0
Multigrid
• Fine to coarse to fine cycle
– high freq. error removed quickly
– lower frequency error takes longer
Projection
Projection
Interpolation
Interpolation
Relax
Relax
Relax
Relax
Relax, Project, Interpolate
Relax
Computations and Storage Layout
• Lots of stencil applications
– matrix multiply: 3x3 stencil
– projection: 3x3 stencil
– interpolation: 2x2(!)
• floor op in indexing
1/16
1
2
1
2
4
2
1
2
1
v h i   1 4 d0,12 v 2 h (i  d ) / 2
• Storage for matrices and DOFs
– variables in one texture
– matrices in 9(=3x3) textures
– all textures packed
• exploit 4 channels
• domain decomp.
• padded boundary
x
y
z
w
Coarser Matrices
• Operator at coarser level
– needed for relaxation at all levels
Ac
=
P
Af
• Triple matrix product…
– work out terms and map to stencils
• exploit local support of stencils
• straightforward but t-e-d-i-o-u-s
A2dh [i ]  1 / 4
e e g 2d g
S
Ah [2i  e]
 S
e , g{1, 0 ,1}2
 1/ 4
g
S
'
[
e
]
S
'
[
e

g

2
d
]
A

h [ 2i  e]
e , g{1, 0 ,1}2
S
Results (NV30@500MHz)
• 257x257 grid
– matrix multiply - 27 instructions
• 1370 per second
– interpolation 10 inst.
– projection 19 inst.
• Overall performance
– 257x257 at 80 fps!
Conclusions
• Enhancements
–
–
–
–
global registers for reductions
texture fetch with offset
rectangular texture border
scalar versus vector problems
• Where are we now?
– good streaming processor
– twice as fast as CPU implementation
– lots of room for improvement
• Scientific computing compiler
– better languages! Brook? C*?
– manage layout in a buffer