Transcript Document
ME964
High Performance Computing
for Engineering Applications
Outlining Midterm Projects
Topic 3: GPU-based FEA
Topic 4: GPU Direct Solver for Sparse Linear Algebra
March 01, 2011
© Dan Negrut, 2011
ME964 UW-Madison
“The real problem is not whether machines think but whether men do.”
B. F. Skinner
Before We Get Started…
Last time
Midterm Project topics 1 and 2
Today
Midterm Project topics 3 and 4
Finite Element Method on the GPU. Area coordinators: Prof. Suresh and Naresh Khude
Sparse direct solver on the GPU (Cholesky). Area coordinator: Dan Negrut
Midterm Project Related Issues
Midterm Project is due on 04/13 at 11:59 PM (use Learn@UW drop-box)
Intermediate report due on 03/22 at 11:59 PM (use the same Learn@UW drop-box)
Each area coordinator
Will provide a test problem for you to test your GPU implementation
Will also assist you with questions related to the non-programming aspects (the “theory”) behind the topic you chose
You can continue your Midterm Project (MP) and have it become your Final Project (FP)
Discrete Element Method on the GPU. Area coordinator: Toby Heyn
Collision Detection on the GPU. Area coordinator: Arman Pazouki
In this case you will be expected to show how the FP implementation is superior to your MP implementation
Other issues
HW5 due tonight at 11:59 PM
Use Learn@UW drop-box to submit homework
2
Finite Element Analysis
on the GPU?
Krishnan Suresh
[email protected]
Associate Professor
Finite Element Analysis
Computer simulation of engineering models
Physics:
– Structural, thermal, fluid, …
Mode:
– Static, modal, transient
– Linear, non-linear, multi-physics
Why GPU?
[Gordon; JPL]
Hours or even days of CPU time.
Question
Can one exploit graphics programmable units (GPU) to speedup Finite Element analysis?
+
Structural Static FEA
Ke
fe
Model
Discretize
Element
Stiffness
K Ke
f fe
Ku f
Assemble/
Solve
Postprocess
FEA: Variations
K Ke
f fe
Ke
fe
Tet/Hex/…
Model
Discretize
Ku f
Order/Hybrid
Element
Stiffness
Direct/Iterative
Assemble/
Solve
Nonlinear
Optimization
Postprocess
FEA: Challenges
K Ke
Ke
f fe
fe
Tet/Hex/…
Model
Discretize
Ku f
Order/Hybrid
Element
Stiffness
Direct/Iterative
Assemble/
Solve
Nonlinear
Optimization
Postprocess
1. Accuracy
2. Automation
3. Speed
Typical Bottleneck
Ke
Model
Discretize
K Ke
fe
f fe
Element
Stiffness
Assemble/
Solve
Ku f
Postprocess
GPU & Engineering Analysis
Model
Discretize
CPU
GPU?
Not a good candidate for GPU!?
Discretization
Data: Small b-rep (+)
Logic: Complex (-)
Threads: Few (-)
Element Stiffness
Ke
fe
Model
Discretize
Element
Stiffness
CPU
GPU?
CPU
Element Stiffness
Data: O(N) (+/-)
Logic: Simple (+)
Threads: N (+)
Hex 2nd Order
Hex Hybrid
Stiffness: Hex 2nd Order
Ke
(8 Corners)
(27 Nodes)
8 Corners~100 Bytes Data (x y z)
27 Nodes~ M = 81 DOF (u v w)
kij ~ Gaussian integration
–
30 flops
Flops N (15M 2 )
N 200000, M 81
TCPU 4sec
( M ,M )
Typical Bottleneck
Ke
Model
Discretize
K Ke
fe
f fe
Element
Stiffness
Assemble/
Solve
Ku f
Direct vs. Iterative
Ku f
Direct
K is sparse & usually symmetric P.D
Iterative
K LDL
T
1
1 T
uL D L f
i 1
u u B( f Ku )
B : Preconditioner of K
i
i
(GPU Variation: Assembly-free)
Note: Nvidia offers CuBLAS-3 dense matrix library
Direct Sparse on GPU (1)
(2006)
Direct Sparse on GPU (1)
Ku f
Direct Sparse on GPU (1)
Ku f
Direct Sparse on GPU (2)
Ku f
(2008)
Direct Sparse on GPU (2)
Ku f
Iterative Sparse on GPU (1)
(2008)
Jacobi preconditioned conjugate gradient
ATI GPU
Speed-up 3.5.
Iterative Sparse on GPU (2)
Double precision real world SpMv
– CPU (2.3 GHz Dual Xeon): 1 GFLOPS
– GPU (GTX 280): 16 GFLOPS
– Speedup ~ 16
FEA/GPU Class Projects?
1. Complete < 6 weeks
2. Important (publishable)
3. Pilot code
FEA/GPU Class Projects?
1. GPU Friendly Preconditioners for Thin Structures
– Research papers
– OpenCL and ViennaCL Pilot Code
2. Topology Optimization
– Research papers
– CUDA code
3. Others
– Can discuss …
Thin Structure?
Thin Structure?
Large K
Preconditioners?
Ku f
u i 1 u i B( f Ku i )
B : Preconditioner of K
Iterative Methods:
– GPU methods available for K*u
– Typical preconditioners: simple Jacobi, …
Poor preconditioner … slow convergence
Objective:
– GPU friendly preconditioner for thin structures
Research Publication
Basic Idea
Restriction
(via dual-representation)
Prolongation
(via dual-representation)
1-D Coarse Mesh
(via dual-representation)
Algorithm
Why Preconditioner?
Why Double Precision?
How Expensive is Preconditioner?
GPU Friendly
Speed-up without Preconditioner
Speed-up with Preconditioner
FEA/GPU Class Projects?
1. GPU Friendly Preconditioners for Thin Structures
– Research papers
– OpenCL and ViennaCL Pilot Code
2. Topology Optimization
– Research papers
– CUDA code
3. Others
– Can discuss …
Topology Optimization
D
Stiffest topology for a given volume?
Where to remove material?
Min J
V = 50%
[Sigmund 2001]
WÌ D
W = V0
Multi Objective + Topology Optimization = MOTO
Min {J ,V 0 }
WÌ D
Demo
Matlab code www.ersl.wisc.edu
Pareto Optimal Designs
Purely pareto optimal
Comparison
D
3-D
Pareto-Method
SIMP
3-D GPU Implementation
Multi-grid Topology Optimization on the GPU
(IDETC conf. 2011)
Motivation for Topic 4:
Sparse Direct Solver
42
Nomenclature
&
Simplifying Assumptions
43
The Schur Complement Problem in
Multi-Body Dynamics Applications
44
Formulation Framework
T
r
[
x
,
y
,
z
]
Position: i
i
i
i
Orientation: Euler parameters, pi [ei0 , ei1, ei2 , ei3 ]T
T
r
[
x
,
y
,
z
]
Translational Velocity: i
i
i
i
Angular velocities
i ix , iy , iy ]T
45
Constrained Equations of Motion
(r, p, t ) 0
(r, p, t )r (r, p, t ) t (r, p, t )
(r, p, t )r (r, p, t ) (r, , r, p, t )
T
M 0 r (r, p, t )
F(r, , r, p, t )
0 J T (r, p, t ) nˆ (r, , r, p, t )
46
Numerical Solution of the Newton-Euler
Constrained Equations of Motion
One has to solve a set of Differential Algebraic Equations
(DAEs) to find the time evolution of a mechanical system
Most often the numerical solution of the DAEs requires the
solution of a linear system of the form:
M
0
0
J
r F
nˆ
0
T
T
47
Approach Followed
First solve the “Reduced System” for :
M 1 0 T
b
1 T
J
0
Then recover accelerations
r M 1 (F T )
J 1 (nˆ T )
48
Iterative Solution of the
Reduced System
Define positive definite Reduced Matrix E
E
M 1 0 T
1 T
J
0
Preconditioned Conjugate Gradient
requires computation at time t n of En ( k )
requires preconditioning:
Eold b
49
Computing En
(k )
Time step n, iteration (k):
e(nk ) En (nk )
e1
2
e
Rm
J
e
A thread is associated with each body
We’ll look at how thread 9 does its
share of work to compute e 3
50
How Thread-9 Does its Work
S1. Compute reaction forces acting on me:
F9C (93 )T 3 (95 )T 5 (96 )T 6
S2. Compute my constraint acceleration
1
9
a M F
C
9
C
9
S3. Project my constraint acceleration
a
3
9
Finally,
3
9
C
9
a
5
9
e3 39 123
5
9
C
9
a
6
9
6
9
C
9
51
Iteration Operation Count
for Body 9 (Thread-9)
Step
Multiplications
Additions
S1
6 C9
6 (C9 1)
S2
6
5
S3
6 C9
5 C9
52
Computing En (k )
[Concluding Remarks]
The algorithm scales very well: one
thread for each body
Each thread only interacts with
adjacent joints
Load balance is obtained when the
bodies have similar topology index
53
Direct Solution of the
Reduced System
54
The Sparse Direct Solver
55
The Direct Solver: How Things Get Done
In the reduced linear system E b
each constraint induces an equation
Example: constraint 3 induced equation:
E32 2 E33 3 E35 5 E36 6 b3
Since E is positive definite, E33 is also
positive definite
Fundamental Idea:
Solve for ¸3 and substitute it in all the equations where it shows up
56
First Example:
Seven-Body Mechanism
57
58
The Elimination Sequence
The fundamental question is this: what should be the sequence in
which the unknowns (the edges of the graph) are eliminated?
The question becomes more complicated since you are interested in
a parallel elimination sequence
You would like to limit the amount of synchronization barriers that you
impose in the implementation
In the end, although it’s formulated like solving
a system, the problem becomes that starting
with a graph and eliminating its edges in
parallel
59
Different elimination sequences result in different levels of effort
Similar to a Mikado, or “pick-up sticks”, game that you
want to play in parallel
Second Example:
HMMWV Model
Elim. Sequence
Bad
Good
Index Reduction
A
1240
459
220
M
1336
469
233
I
195
109
90
F
96
10
13
NNZ
99
99
77
60