Transcript Document

ME964
High Performance Computing
for Engineering Applications
Outlining Midterm Projects
Topic 3: GPU-based FEA
Topic 4: GPU Direct Solver for Sparse Linear Algebra
March 01, 2011
© Dan Negrut, 2011
ME964 UW-Madison
“The real problem is not whether machines think but whether men do.”
B. F. Skinner
Before We Get Started…

Last time

Midterm Project topics 1 and 2



Today

Midterm Project topics 3 and 4



Finite Element Method on the GPU. Area coordinators: Prof. Suresh and Naresh Khude
Sparse direct solver on the GPU (Cholesky). Area coordinator: Dan Negrut
Midterm Project Related Issues



Midterm Project is due on 04/13 at 11:59 PM (use Learn@UW drop-box)
Intermediate report due on 03/22 at 11:59 PM (use the same Learn@UW drop-box)
Each area coordinator



Will provide a test problem for you to test your GPU implementation
Will also assist you with questions related to the non-programming aspects (the “theory”) behind the topic you chose
You can continue your Midterm Project (MP) and have it become your Final Project (FP)


Discrete Element Method on the GPU. Area coordinator: Toby Heyn
Collision Detection on the GPU. Area coordinator: Arman Pazouki
In this case you will be expected to show how the FP implementation is superior to your MP implementation
Other issues

HW5 due tonight at 11:59 PM

Use Learn@UW drop-box to submit homework
2
Finite Element Analysis
on the GPU?
Krishnan Suresh
[email protected]
Associate Professor
Finite Element Analysis
 Computer simulation of engineering models
 Physics:
– Structural, thermal, fluid, …
 Mode:
– Static, modal, transient
– Linear, non-linear, multi-physics
Why GPU?
[Gordon; JPL]
Hours or even days of CPU time.
Question
Can one exploit graphics programmable units (GPU) to speedup Finite Element analysis?
+
Structural Static FEA
Ke
fe
Model
Discretize
Element
Stiffness
K   Ke
f   fe
Ku  f
Assemble/
Solve
Postprocess
FEA: Variations
K   Ke
f   fe
Ke
fe
Tet/Hex/…
Model
Discretize
Ku  f
Order/Hybrid
Element
Stiffness
Direct/Iterative
Assemble/
Solve
Nonlinear
Optimization
Postprocess
FEA: Challenges
K   Ke
Ke
f   fe
fe
Tet/Hex/…
Model
Discretize
Ku  f
Order/Hybrid
Element
Stiffness
Direct/Iterative
Assemble/
Solve
Nonlinear
Optimization
Postprocess
1. Accuracy
2. Automation
3. Speed
Typical Bottleneck
Ke
Model
Discretize
K   Ke
fe
f   fe
Element
Stiffness
Assemble/
Solve
Ku  f
Postprocess
GPU & Engineering Analysis
Model
Discretize
CPU
GPU?
Not a good candidate for GPU!?
Discretization

Data: Small b-rep (+)

Logic: Complex (-)

Threads: Few (-)
Element Stiffness
Ke
fe
Model
Discretize
Element
Stiffness
CPU
GPU?
CPU
Element Stiffness

Data: O(N) (+/-)

Logic: Simple (+)

Threads: N (+)
Hex 2nd Order
Hex Hybrid
Stiffness: Hex 2nd Order
Ke  
(8 Corners)
(27 Nodes)



8 Corners~100 Bytes Data (x y z)
27 Nodes~ M = 81 DOF (u v w)
kij ~ Gaussian integration
–
30 flops
Flops  N (15M 2 )
N  200000, M  81
TCPU  4sec
( M ,M )
Typical Bottleneck
Ke
Model
Discretize
K   Ke
fe
f   fe
Element
Stiffness
Assemble/
Solve
Ku  f
Direct vs. Iterative
Ku  f
Direct
K is sparse & usually symmetric P.D
Iterative
K  LDL
T
1
1 T
uL D L f
i 1
u  u  B( f  Ku )
B : Preconditioner of K
i
i
(GPU Variation: Assembly-free)
Note: Nvidia offers CuBLAS-3 dense matrix library
Direct Sparse on GPU (1)
(2006)
Direct Sparse on GPU (1)
Ku  f
Direct Sparse on GPU (1)
Ku  f
Direct Sparse on GPU (2)
Ku  f
(2008)
Direct Sparse on GPU (2)
Ku  f
Iterative Sparse on GPU (1)
(2008)
 Jacobi preconditioned conjugate gradient
 ATI GPU
 Speed-up 3.5.
Iterative Sparse on GPU (2)
 Double precision real world SpMv
– CPU (2.3 GHz Dual Xeon): 1 GFLOPS
– GPU (GTX 280): 16 GFLOPS
– Speedup ~ 16
FEA/GPU Class Projects?
1. Complete < 6 weeks
2. Important (publishable)
3. Pilot code
FEA/GPU Class Projects?
1. GPU Friendly Preconditioners for Thin Structures
– Research papers
– OpenCL and ViennaCL Pilot Code
2. Topology Optimization
– Research papers
– CUDA code
3. Others
– Can discuss …
Thin Structure?
Thin Structure?
Large K
Preconditioners?
Ku  f

u i 1  u i  B( f  Ku i )
B : Preconditioner of K
Iterative Methods:
– GPU methods available for K*u
– Typical preconditioners: simple Jacobi, …


Poor preconditioner … slow convergence
Objective:
– GPU friendly preconditioner for thin structures
Research Publication
Basic Idea
Restriction
(via dual-representation)
Prolongation
(via dual-representation)
1-D Coarse Mesh
(via dual-representation)
Algorithm
Why Preconditioner?
Why Double Precision?
How Expensive is Preconditioner?
GPU Friendly
Speed-up without Preconditioner
Speed-up with Preconditioner
FEA/GPU Class Projects?
1. GPU Friendly Preconditioners for Thin Structures
– Research papers
– OpenCL and ViennaCL Pilot Code
2. Topology Optimization
– Research papers
– CUDA code
3. Others
– Can discuss …
Topology Optimization
D
Stiffest topology for a given volume?
Where to remove material?
Min J
V = 50%
[Sigmund 2001]
WÌ D
W = V0
Multi Objective + Topology Optimization = MOTO
Min {J ,V 0 }
WÌ D
Demo
Matlab code www.ersl.wisc.edu
Pareto Optimal Designs

Purely pareto optimal
Comparison
D
3-D
Pareto-Method
SIMP
3-D GPU Implementation
Multi-grid Topology Optimization on the GPU
(IDETC conf. 2011)
Motivation for Topic 4:
Sparse Direct Solver
42
Nomenclature
&
Simplifying Assumptions
43
The Schur Complement Problem in
Multi-Body Dynamics Applications
44
Formulation Framework

T
r

[
x
,
y
,
z
]
Position: i
i
i
i

Orientation: Euler parameters, pi  [ei0 , ei1, ei2 , ei3 ]T

T
r

[
x
,
y
,
z
]
Translational Velocity: i
i
i
i

Angular velocities
i  ix , iy , iy ]T
45
Constrained Equations of Motion
 (r, p, t )  0
 (r, p, t )r   (r, p, t )  t (r, p, t )
  (r, p, t )r    (r, p, t )  (r, , r, p, t )
T
M 0   r    (r, p, t ) 
F(r, , r, p, t ) 
 0 J    T (r, p, t )     nˆ (r, , r, p, t ) 

   



46
Numerical Solution of the Newton-Euler
Constrained Equations of Motion

One has to solve a set of Differential Algebraic Equations
(DAEs) to find the time evolution of a mechanical system

Most often the numerical solution of the DAEs requires the
solution of a linear system of the form:
M

 0
 
0
J

   r  F 
   
    nˆ
   
0       
T

T

47
Approach Followed

First solve the “Reduced System” for  :
 

M 1 0  T 
   
 b
1   T 
J    
 0
Then recover accelerations
r  M 1 (F   T  )
  J 1 (nˆ   T  )
48
Iterative Solution of the
Reduced System

Define positive definite Reduced Matrix E
E   

M 1 0  T 
   
1   T 
J    
 0
Preconditioned Conjugate Gradient

requires computation at time t n of En  ( k )

requires preconditioning:
Eold   b
49
Computing En 
(k )
Time step n, iteration (k):
e(nk )  En (nk )
 e1 
 2
e 


 Rm
 
 J
e 

A thread is associated with each body

We’ll look at how thread 9 does its
share of work to compute e 3
50
How Thread-9 Does its Work
S1. Compute reaction forces acting on me:
F9C  (93 )T 3  (95 )T 5  (96 )T  6
S2. Compute my constraint acceleration
1
9
a  M F
C
9
C
9
S3. Project my constraint acceleration
   a
3
9
Finally,
3
9
C
9
   a
5
9
e3  39  123
5
9
C
9
   a
6
9
6
9
C
9
51
Iteration Operation Count
for Body 9 (Thread-9)
Step
Multiplications
Additions
S1
6  C9
6  (C9  1)
S2
6
5
S3
6  C9
5  C9
52
Computing En  (k )
[Concluding Remarks]

The algorithm scales very well: one
thread for each body

Each thread only interacts with
adjacent joints

Load balance is obtained when the
bodies have similar topology index
53
Direct Solution of the
Reduced System
54
The Sparse Direct Solver
55
The Direct Solver: How Things Get Done

In the reduced linear system E  b
each constraint induces an equation

Example: constraint 3 induced equation:
E32 2  E33 3  E35 5  E36 6  b3


Since E is positive definite, E33 is also
positive definite
Fundamental Idea:

Solve for ¸3 and substitute it in all the equations where it shows up
56
First Example:
Seven-Body Mechanism
57
58
The Elimination Sequence

The fundamental question is this: what should be the sequence in
which the unknowns (the edges of the graph) are eliminated?


The question becomes more complicated since you are interested in
a parallel elimination sequence


You would like to limit the amount of synchronization barriers that you
impose in the implementation
In the end, although it’s formulated like solving
a system, the problem becomes that starting
with a graph and eliminating its edges in
parallel

59
Different elimination sequences result in different levels of effort
Similar to a Mikado, or “pick-up sticks”, game that you
want to play in parallel
Second Example:
HMMWV Model
Elim. Sequence
Bad
Good
Index Reduction
A
1240
459
220
M
1336
469
233
I
195
109
90
F
96
10
13
NNZ
99
99
77
60