並列前処理手法と領域分割，マルチコア時代の戦略

Transcript 並列前処理手法と領域分割，マルチコア時代の戦略

並列前処理手法と領域分割，
マルチコア時代の戦略
中島研吾
東京大学情報基盤センター
海洋研究開発機構地球シミュレータセンター
科学技術振興機構戦略的創造研究推進事業（CREST）
分野横断型研究会「アルゴリズムによる計算科学の融合と発展」
2009年4月22・23日筑波大学計算科学研究センター
2
Algorithm09
科学技術計算の真髄：SMASH
Science
Modeling
Algorithm
Software
Hardware
•
•
•
•
幅広い分野
バランス
分野間協力
少しは他のカテゴリーについても
知らないと協力は進まない。
• 応用
– 自分の問題が解ければOK
– 巷に役に立つライブラリが無い（特
に疎行列）
• アルゴリズム
– 万能を目指したい
3
Algorithm09
並列有限要素法による
大規模シミュレーション
•
•
•
•
非構造格子
要素単位のローカルな処理⇒大規模疎行列
悪条件問題（ill-conditioned problems）
前処理付並列反復法
Complicated Plate Model around Japan Islands
Magnetic Field of the Earth : MHD code
h=5.00
h=1.25
Simulation of Earthquake Generation Cycle
in Southwestern Japan
T=100
T=200
T=300
T=400
T=500
Transportation by Groundwater Flow
through Heterogeneous Porous Media
TSUNAMI !!
Algorithm09
4
講演の概要
• 並列反復法と領域分割：SMASH
– 選択的オーバーラップ〔KN 2007〕
– 階層型領域間境界分割（Hierarchical Interface
Decomposition）〔Henon & Saad 2007〕
• 拡張型HID法の提案：SMASH
– 悪条件向け領域分割手法
Science
Modeling
• Hybrid並列プログラミングモデル：SMASH
• HIDと並列多重格子法（時間があれば）：
SMASH
Algorithm
Software
Hardware
5
Algorithm09
「並列」前処理手法の技術的課題
（の一部）
• 領域分割
• 簡単な問題は簡単に解ける，
効率は出やすい
• 難しい問題はやっぱり難しい
– Block Jacobi型局所前処理 ⇒
領域数増加による反復回数増加
• 領域外の影響を（基本的に）無視
• 悪条件問題で顕著
6
Algorithm09
本講演で扱う悪条件問題
• 様々な悪条件問題がある
– 普通の工学的な問題は大抵「悪条
件問題」
– 係数行列の固有値分布，条件数
• ここでは，特に三次元固体力学に
おける問題を扱う
–
–
–
–
接触条件
不均質性
捩れ等
Block ILU型の前処理手法
• 各節点に3自由度（変位3成分）：効率，
安定性
0
1
2
2 ux 0 =  ux 1 +  ux 2
2 uy 0 =  uy 1 +  uy 2
2 uz 0 =  uz 1 +  uz 2
 ux0 =  ux 1
 uy0 =  uy 1
 uz0 =  uz 1
3 nodes form
1 selective block.
0
1
2 nodes form
1 selective block.
7
Algorithm09
悪条件問題の例
Heterogeneous Fields, Distorted Meshes
Algorithm09
8
地震発生サイクルシミュレーション
における接触問題
• プレート境界における準静的応力蓄積過程
• 非線形接触問題をNewton-Raphson法によって解く
• ALM法（Augmented Lagrangean,拡大ラグランジェ法）による
拘束条件：ペナルティ数
• 領域分割による並列有限要素法
9
Algorithm09
領域間境界＝接触面
収束は最悪
0.10
Y
1.00
X
1.00
Algorithm09
10
「硬い」要素群上に領域境界が来ると
収束は悪化する
E=103
E=100
3D Solid Mechanics
E: Young’s Modulus
Algorithm09
11
不均質弾性問題：203要素
■：4×4×20
Algorithm09
12
不均質弾性問題：203要素
BILU(0)-GPBiCG，反復回数
• ■■：n= 0.25
• ■ ：E=1.00
• 1-プロセッサ
– ■：E=10-3， 34回
– ■：E=100 ， 31回
– ■：E=10+3, 84回
• 8-プロセッサ（オーバーラップ無し）
– ■：E=10-3， 53回（×1.56）
– ■：E=100 ， 52回（×1.68）
– ■：E=10+3，158回（×1.88）
z
Uniform Distributed Force in
z-dirrection @ z=Zmin
Uy=0 @ y=Ymin
Ux=0 @ x=Xmin
Nz-1
Ny-1
Uz=0 @ z=Zmin
x
y
Nx-1
Algorithm09
13
悪条件問題への対処
• 悪条件問題を前処理付き反復法で解く場合，並列計算時に
は収束性が著しく悪化する場合があり，安定した収束を与え
る前処理，領域分割の研究は重要
– 領域分割：いわゆる数値計算ライブラリではカバーされていない分
野
• 対処法
– マルチレベル解析，Coarsegrid法
– 深い領域間オーバーラップ
• 計算・通信コスト
14
Algorithm09
領域間オーバーラップの拡張
PE#1
21
22
17
16
11
PE#0
23
24
18
19
13
14
8
9
7
1
2
PE#3
5
12
3
7
8
9
11
10
12
5
2
PE#3
11
10
8
4
1
15
6
7
14
13
4
5
10
1
2
3
8
9
11
12
10
9
11
12
9
6
3
5
PE#0
6
2
10
PE#2
PE#1
1
15
4
3
Cost for computation and
communication may increase
20
12
6
4
25
7
8
4
7
1
6
5
2
PE#2
3
●：Internal Nodes，●：External Nodes
■：Overlapped Elements
Algorithm09
15
不均質弾性問題：203要素
ILU(0)-GPBiCG，8領域，反復回数
オーバーラップ領域拡張の影響
Overlap
深さ
E=10-3
E=100
E=10+3
0
53
52
158
1
34
33
103
2
32
32
100
3
30
32
97
4
31
31
82
1領域
34
31
84
16
Algorithm09
接触問題：オーバーラップ拡張の効果
• [KN 2005]
– BILU(0,1,2)
– for “consistent” node number cases
– IBM SP3 in NERSC/LBNL
Preconditi
oning
SB-BILU
(0)
BILU(1)
BILU(1)
BILU(1)
SPAI
partitioning
(overlap #)
special [3]
1-layer
special [3]
1-layer
regular
1-layers
regular
2-layers
regular
2-layers
PE
#
16
128
16
128
16
128
16
128
16
128
iter’s
386
410
225
247
444
529
405
430
891
888
set-up+
solve(sec.)
506.2
63.9
563.2
95.0
1033.2
191.0
1063.3
204.6
626.3
105.1
parallel
speed-up
16.0
126.7
16.0
94.8
16.0
86.6
16.0
83.2
16.0
95.4
17
Algorithm09
最近やっていること
• 安定で効率的な並列前処理手法，領域分割手法開発
– アプリケーションの特徴を最大限使用
• 安定化，効率：ブロック化
– アプリケーション⇒特殊前処理⇒一般化
• 問題に応じて，最適な前処理手法，領域分割，パラメータの
組み合わせを自動的に選択するための手法の確立
• 大域的・局所的情報の利用
– 大域的情報：係数行列の条件数等
– 局所的情報：各要素のローカルな情報
– アプリケーションの性質
• 実問題の特性を反映させたベンチマーク
18
Algorithm09
実問題の特性を反映させたベンチマーク
• 疎行列解法，前処理手法の検証
– Matrix Market，実アプリマトリクス
– 制約が多い⇒結局ごく限られた条件を代表
– 取得が困難な場合がある
• 長時間のシミュレーション，大規模マトリクスデータ
• 非線形問題：違うフェーズ⇒性質違うマトリクス
• 実問題の特性を反映させたベンチマーク
– 元の問題と類似した性質の係数マトリクスを生成
• 元の問題が非線形でも，何らかの形で線形化が行われているはず
– 「ベンチマーク」は線形問題でも良い
– パラメータ，形状，問題規模を自由に変えられる
– 係数マトリクスが導出された過程くらいはわかっていなくてはな
らない
Algorithm09
19
領域分割手法
• 選択的オーバーラップ〔KN 2007〕
– 様々な条件に対応できるよう，有限要素法によるアプリケーション
の特性（要素の属性，物性）を利用した前処理手法，領域分割手法
• 適応的にオーバーラップ深さを調整
• 選択的フィルイン：前処理手法
• 階層型領域間境界分割（HID）〔Henon & Saad 2007〕
– Hierarchical Interface Decomposition
– 領域数増加による反復回数増加の効率的な抑止
• PHIDAL（Parallel Hierarchical Interface Decomposition Algorithm ）アル
ゴリズム
Algorithm09
20
選択的オーバーラップ，選択的フィルイン
• アセンブリ構造物における接触問題
• アプリケーションの特性を利用
– 選択的フィルイン：前処理
– 選択的オーバーラップ：領域分割
アセンブリ構造の例：ジェットエンジン
Algorithm09
21
22
Algorithm09
接触面における節点整合，不整合
接触面
整合
不整合
アセンブリ構造：部位ごとに別々に
メッシュを作るのでこのようなことが
おこりうる
地震：大すべり
例題：接触面節点不整合問題
アセンブリ構造を模擬
0.10
Z
Y
Y
X
1.00
X
1.00
• 各ブロックは1辺長さ1.0の立方体要素（ヤング率=1.00，ポ
アソン比=0.30の弾性体）に分割。
• 各ブロックは0.10ずつ離れており，その間が交差するトラス
要素（弾性体）で結合されている。
Algorithm09
23
例題：接触面節点不整合問題
アセンブリ構造を模擬
0.10
Z
Y
Y
X
1.00
X
1.00
• トラス要素のヤング率をブロック部分の103とすることによっ
て接触面における拘束条件を模擬。
• z=0でz方向の変位を固定，x=0及びy=0で対称とし，z=zmax
の面にz方向に一様分布荷重を与えている。
Algorithm09
24
25
Algorithm09
Selective fill-ins
• トラス要素に接続している節点のみに1レベル高いfill-inを適
用する
• 本問題の場合： BILU(1+)
– Block ILU（三次元弾性問題，1節点3自由度）
– BILU(2)：トラス要素に接続する節点
– BILU(1)：それ以外の節点
• 計算量はBILU(1)並みであるが，前処理性能としては
BILU(2)相当が期待される。
26
Algorithm09
Idea of “Selective fill-ins”: ILU(1+)
● 2nd order fill-in’s are
considered for these nodes
● 2nd order fill-in’s are NOT
considered for these nodes
● 2nd order fill-in’s are NOT
considered for these nodes
27
Algorithm09
Selective Overlapping
• 「Selective fill-ins」の考え方を領域間オーバーラップに拡張
する
• 一般の節点に関しては領域間オーバーラップの拡張を「遅ら
せる」
• 領域間オーバーラップ拡張による計算量，通信量の増加を
抑制できる。
28
Algorithm09
Internal Nodes for Partitioning
● Internal Nodes
Domain Boundary
29
Algorithm09
One-Layer Overlapping
(d=0/1)
● Internal Nodes
● External Nodes
■ Overlapped Elements
This is the general configuration
of local data set for parallel
FEM (one-layer of overlapping).
Algorithm09
Extension of Overlapped Zones
● Internal Nodes
(2-layers:
d=2)
● External Nodes
■ Overlapped Elements
30
Algorithm09
Extension of Overlapped Zones
● Internal Nodes
(d=2
and
d=1+)
● External Nodes
■ Overlapped Elements
31
32
Algorithm09
Extension of Overlapped Zones
● Internal Nodes
(d=2
and
d=1+)
● External Nodes
■ Overlapped Elements
Selective Overlapping (d=1+)
“Delayed” extension for elements
which do not include nodes connected
to truss-type elements
33
Algorithm09
Extension of Overlapped Zones
● Internal Nodes
(d=3
and
d=2+)
● External Nodes
■ Overlapped Elements
Selective Overlapping (d=2+)
Reduced cost for computations
and communications
delayed delayed
34
Algorithm09
BILU with selective fill-in/overlapping
• BILU (p)-(d)
– p
– d
level of fill-ins (0, 1, 1+, 2, 2+ …)
depth of overlapping (0, 1, 1+, 2, 2+ …)
35
Algorithm09
1500
Results: 64 cores
3,090,903 DOF, =103, e=10-8
Effect of Overlapping
ITERATIONS
ITERATIONS
● BILU(1)-(d)
■ BILU(1+)-(d)
▲ BILU(2)-(d)
1000
500
0
0
1+
2
2+
3
400
Elapsed Time
2.0E+08
300
1.5E+08
sec.
Off-Diag. Component #
2.5E+08
1
200
1.0E+08
100
5.0E+07
#-non-zero’s of [M]
0.0E+00
0
0
1
1+
2
2+
3
0
1
1+
2
2+
3
36
Algorithm09
Results: 64 cores
3,090,903 DOF, =103, e=10-8
Effect of Overlapping
100
75
200
sec.
ITERATIONS
300
● BILU(1)-(d)
■ BILU(1+)-(d)
▲ BILU(2)-(d)
50
100
25
ITERATIONS
0
Elapsed Time
0
1+
2
2+
3
1+
2
2+
3
Algorithm09
37
HID：Hierarchical Interface
Decomposition 〔Henon & Saad〕
• 階層型領域間境界分割
• 二言くらいで言うと：
– 階層的な領域分割
• Nested Dissection
– 各レベルでは各領域は直接結合しない⇒レベル内並列性
• 計算コスト的には（d=0）と（d=1）の中間くらい：低コスト
– d：オーバーラップ
Algorithm09
38
Parallel ILU for each Connector
at each LEVEL
2
2
2
2,3
3
3
3
0
2
2
2
2,3
3
3
3
1,3
1,3
1,3
2
2
2
0,1
2,3
0,2
0,2
0,2
0,1
2,3
3
3
3
Level-1
1
2
3
0,1
1,3
0
0
0
0,1
2,3
1
1
1
0
0
0
0,1
1
1
1
Level-2
0,2
Level-4
0
0
0
0,1
1
1
• sub-domainの番号を0,1,2,3とする
• 数字は隣接するsub-domainの番号
2,3
0,1,
2,3
1
• レベルの若い順に番号を振りなおす
• 各レベル内で不完全LU（0）分解を並列に実
施可能
Algorithm09
39
各レベルにおける通信の発生
• 「高い」レベルのコネクタ
に属する節点群におけ
る計算
0
Level-1
– 「低い」レベルのコネクタ
の計算結果を利用
• 通信が必要となる
2
3
0,1
• 計算済
– 隣接する「低い」レベル
のコネクタに属する節点
群が他の領域に属して
いる場合
1
1,3
Level-2
2,3
0,2
Level-4
0,1,
2,3
Algorithm09
40
Forward Substitution
do lev= 1, LEVELtot
do i= LEVindex(lev-1)+1, LEVindex(lev)
SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)
isL= INL(i-1)+1; ieL= INL(i)
do j= isL, ieL
k= IAL(j)
X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)
SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3
SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3
SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3
enddo
X1= SW1; X2= SW2; X3= SW3
X2= X2 - ALU(9*i-5)*X1
X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2
X3= ALU(9*i )* X3
X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )
X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)
WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3
余計な通信
enddo
が発生
call SOLVER_SEND_RECV_3_LEV(lev,…):
Communications using
enddo
Hierarchical Comm. Tables.
Algorithm09
41
計算結果（64 cores）
接触問題
■BILU(p)-(0): Block Jacobi
■BILU(p)-(1)
■BILU(p)-(1+)
■BILU(p)-HID
GPBiCG
3,090,903 DOF
1500
350
250
1000
sec.
ITERATIONS
300
500
200
150
100
50
0
0
BILU(1)
BILU(1+)
BILU(2)
BILU(1)
BILU(1+)
BILU(2)
42
Algorithm09
HID vs. Selective Overlapping
• HIDと選択的オーバーラップはほぼ同じ性能だが後者が若
干良い
– 特に条件の悪い問題
– BILUで高いorderのfill-inが必要となるような問題
• オリジナルのHIDはfill-inの
orderが高くなると，領域外
の（同じレベルにある節点
の）fill-inの影響を考慮でき
ない
– ILU(0)としては完全だが・・・
0
Level-1
1
2
3
0,1
0,2
Level-2
2,3
1,3
Level-4
0,1,
2,3
43
Algorithm09
HID vs. Selective Overlapping
• HIDと選択的オーバーラップはほぼ同じ性能だが後者が若
干良い
– 特に条件の悪い問題
– BILUで高いorderのfill-inが必要となるような問題
• オリジナルのHIDはfill-inの
orderが高くなると，領域外
の（同じレベルにある節点
の）fill-inの影響を考慮でき
ない
– ILU(0)としては完全だが・・・
0
Level-1
1
2
3
0,1
0,2
Level-2
2,3
1,3
Level-4
0,1,
2,3
Algorithm09
44
• 並列反復法と領域分割
– 選択的オーバーラップ〔KN 2007〕
– 階層型領域間境界分割（Hierarchical Interface Decomposition）
〔Henon & Saad 2007〕
• 拡張型HID法の提案
– 悪条件向け領域分割手法
• Hybrid並列プログラミングモデル
• HIDと並列多重格子法（時間があれば）
Algorithm09
45
要素がねじれた問題（1/3）
• 3D linear elastic problem with locally distorted elements
z
Uniform Distributed Force in
z-direction @ z=Zmax
Uy=0 @ y=Ymin
Ux=0 @ x=Xmin
(Nz-1) elements
Nz nodes
(Ny-1) elements
Ny nodes
y
Uz=0 @ z=Zmin
x
(Nx-1) elements
Nx nodes
Algorithm09
46
要素がねじれた問題（2/3）
• 3D linear elastic problem with locally distorted elements
• 立方体メッシュ
– Z軸周りに回転
• 局所的な不均質性
– 局所的なねじれの程度を表す
– sequential Gauss algorithm [Deutsch & Journel 1988]

Algorithm09
47
要素がねじれた問題（3/3）
• 3D linear elastic problem with locally distorted elements
Algorithm09
要素がねじれた問題
（不均質分布）
BILU(p,)-(d,a)
48
■BILU(1)-(d,a) GPBiCG
■BILU(1+,120°)-(d,a)
■BILU(1+, 60°)-(d,a)
■BILU(1+, 30°)-(d,a)
■BILU(2)-(d,a)
,a
3,090,903 DOF，64コア
MAX distortion: 150-deg.
700
2000
600
1500
ITERATIONS
sec.
500
400
300
200
1000
500
100
0
0
(0)
(1)
(1+, 120) (1+, 90)
(1+, 60)
Depth of Overlapping
(2)
(0)
(1)
(1+, 120) (1+, 90)
(1+, 60)
Depth of Overlapping
(2)
Algorithm09
要素がねじれた問題
（不均質分布）
BILU(p,)-(d,a)
■BILU(1)-(d,a) GPBiCG
■BILU(1+,120°)-(d,a)
■BILU(1+, 60°)-(d,a)
■BILU(1+, 30°)-(d,a)
■BILU(2)-(d,a)
49
,a
3,090,903 DOF，64コア
MAX distortion: 225-deg.
700
2000
600
1500
ITERATIONS
sec.
500
400
300
200
1000
500
100
0
0
(1+, 90)
(1+, 45)
(2)
(2+, 135) (2+, 90)
Depth of Overlapping
(3)
(1+, 90)
(1+, 45)
(2)
(2+, 135) (2+, 90)
Depth of Overlapping
(3)
50
Algorithm09
HIDの改良を試みる
Extended Version of HID
• オーバーラップ領域の拡張
• レベル間セパレータを「厚く」する
– Thicker Separators
51
Algorithm09
Original Local Data Set
• Original HID
– オーバーラップ深さ＝0また
は1の場合
– 同じレベルにある領域外節
点からのfill-inの寄与を考慮
できない
• BILU(2)で節点Bの効果はAに
おいて考慮できない
2
2
3
3
2
2
3
3
2
B
3
3
A
level-1 ●
level-2 ●
52
Algorithm09
対策1: オーバーラップ領域拡張
• オーバーラップ領域拡張
– 2層のオーバーラップ
– 同じレベルにある領域外節
点からのfill-inの寄与を考慮
できるようになる
2
2
3
3
2
2
3
3
2
B
3
3
A
level-1 ●
level-2 ●
• BILU(2)で節点Bの効果はAに
おいて考慮可能である
– しかし，局所化，ブロック
Jacobiであることには変わ
りない
• Bにおける値は最新では無い
2
2
3
3
2
2
3
3
2
B
3
3
A
53
Algorithm09
対策2: Thicker Separator
• Thicker Separator
– HIDnew
– 同じレベルにある領域外節点
からのfill-inの寄与を考慮でき
るようになる
2
2
3
3
2
2
3
3
2
B
3
3
A
level-1 ●
level-2 ●
• BILU(2)で節点Bの効果はAにお
いて考慮可能である
– 大域的な手法
– 対策1より有効そうに見える
– 負荷分散困難
– 前処理のFill-in深さに応じて
セパレータの厚さが決まる
2
2
3
3
2
2
3
3
2
B
3
3
A
54
Algorithm09
対策2: Thicker Separator
• 関連研究
– Takeshi Iwashita and Masaaki Shimasaki; "Block Red-Black
Ordering: A New Ordering Strategy for Parallelization of ICCG
Method", International Journal of Parallel Programming, Vol. 31,
No. 1, (2003), pp.55-75
• 差分格子
• Red-Blackの2レベルを交互に解く
Algorithm09
55
環境，プログラム概要
• T2Kオープンスパコン（東大）
• MPI + FORTRAN90 （日立コンパイラ）
– Flat MPI
• 要素がねじれた問題
– FEM，Tri-Linear六面体要素
• GPBiCG [Zhang 1997]
• 前処理手法
– Block ILU(2): 2nd order of fill-ins
Algorithm09
56
領域分割について
• BILU (2,d)
– 局所化ブロックJacobi＋オーバーラップ拡張
• BILU(2,2), BILU(2,3)
• BILU (2, HID-d)
– HID＋オーバーラップ拡張
• BILU(2,HID-1), BILU(2,HID-2)
– BILU(2,HID-1) : BILU(2) with original HID
• BILU (2, HIDnew-d)
– HID＋オーバーラップ拡張＋Thicker Separators
• BILU(2,HIDnew-1), BILU(2,HIDnew-2)
• レベル2セパレータを3層
• 負荷分散の工夫は無し
Algorithm09
57
Strategies for Domain Decomposition
HID
HIDnew
2
2
2
2,3
3
3
3
2
2
2
2
2
2,3
3
3
3
2
2
2
2
2
3
3
3
0,2
0,2
0,2
0,2
0,2
1,3
1,3
1,3
0,2
0,2
0
0
0
1
1
1
0,2
0,2
0
0
0
0,1
1
1
1
0
0
0
0
0
0,1
1
1
1
0
0
level-1 ●
level-2 ●
level-4 ○
2,3
2,3
2,3
2,3
0,2
1,3
0,1
0,1
0,1
0,1
level-1 ●
level-2 ●
level-3 ● ● ● ●
level-4 ○
3
3
3
3
1,3
1,3
1,3
1,3
1,3
1,3
1
1
1
1
Algorithm09
58
Type-I: 64 cores, 1003 elements
• HIDはすでに局所化ブロックJacobi（＋オーバーラップ拡
張）より良い
• HIDnewの効果はいま一つ不明
600
1000
BILU(2,2)
BILU(2,3)
BILU(2,HID-1)
BILU(2,HID-2)
BILU(2,HIDnew-1)
BILU(2,HIDnew-2)
400
sec.
Iterations
1500
500
200
0
0
150-deg.
200-deg.
250-deg.
MAX Distortion Angle
BILU(2,2)
BILU(2,3)
BILU(2,HID-1)
BILU(2,HID-2)
BILU(2,HIDnew-1)
BILU(2,HIDnew-2)
150-deg.
200-deg.
MAX Distortion Angle
250-deg.
Algorithm09
59
Type-II: Strong Scaling, 1283 elements
MAX: 200 deg.
BILU(2,HID-1) はコア数が増加すると性能劣化
BILU(2,HID-2) は比較的安定
BILU(2,HIDnew-d)は安定
2000
2000
BILU(2,HID-1)
BILU(2,HID-2)
BILU(2,HIDnew-1)
BILU(2,HIDnew-2)
BILU(2,2)
BILU(2,3)
1500
Iterations
Iterations
1500
1000
1000
500
500
0
0
32
64
128
192
core#
384
512
32
64
128
192
core#
256
384
512
Algorithm09
60
Type-II: Strong Scaling, 1283 elements
MAX: 200 deg., Scalability
400
BILU(2,2)
BILU(2,3)
BILU(2,HID-1)
BILU(2,HID-2)
BILU(2,HIDnew-1)
BILU(2,HIDnew-2)
Speed-Up
300
200
100
0
0
100
200
300
core#
400
500
600
Algorithm09
61
Type-II: Strong Scaling, 1283 elements
MAX: 200 deg, Relative Performance
Normalized by BILU(2,HID-1) at each number of core
3.00
BILU(2,2)
BILU(2,3)
2.00
1.00
0.00
Relative Performance
Relative Performance
3.00
BILU(2,HID-1)
BILU(2,HID-2)
BILU(2,HIDnew-1)
BILU(2,HIDnew-2)
2.00
1.00
0.00
32
64
128
192
core#
384
512
32
64
128
192
core#
256
384
512
Algorithm09
62
並列化の阻害要因
• HID/HIDnewにおける通信
• 512コアにおける負荷分散
– 標準偏差 (s)
• BILU(2,d)
• BILU(2,HID-d)
• BILU(2,HIDnew-d)
40
85
155
289
6000
BILU(2,HID-1)
BILU(2,HID-2)
BILU(2,HIDnew-1)
BILU(2,HIDnew-2)
30
min.
Internal Node #
Additional Communication (%)
– 線形ソルバーの計算時間に対
する割合
20
10
max.
4000
2000
0
0
32
64
128
192
core#
256
384
512
BILU(2,d)
BILU(2,HID-d)
BILU(2,HIDnew-d)
Partitioning Method
Algorithm09
63
まとめ，今後の研究
• 改良型HID
– 領域間オーバーラップ拡張
– Thicker Separator：これは特に有効であった
• 領域外のFill-inの効果を取り入れるための工夫
• 問題点
– 複雑形状への適用
• 特にThicker Separator
• レベル2以外の節点
– 負荷分散
• Intelligent Partitioner
– HIDを複雑形状へ適用する場合には必要
• 様々な問題への適用
– Selective Overlapping vs.(or +) HIDnew
Algorithm09
64
• 並列反復法と領域分割
– 選択的オーバーラップ〔KN 2007〕
– 階層型領域間境界分割（Hierarchical Interface Decomposition）
〔Henon & Saad 2007〕
• 拡張型HID法の提案
– 悪条件向け領域分割手法
• Hybrid並列プログラミングモデル
• HIDと並列多重格子法（時間があれば）
Algorithm09
65
Topics
• 悪条件問題の並列前処理手法
– 有限要素法，疎行列
• T2Kオープンスパコン（東大）
• Flat MPI vs. Hybrid
– 最適化
Algorithm09
66
T2K（東大）（1/2）
• T2Kオープンスパコン仕様
– http://www.open-supercomputer.org/
– 筑波大，東大，京大
• T2Kオープンスパコン（東大）
– Hitachi HA8000クラスタシステム
– 2008年6月～
– 952ノード（15,232コア），
141 TFLOPS peak
• Quad-core Opteron (Barcelona)
– TOP500 27位（NOV 2008）
• （その時点で日本では1位だった）
Algorithm09
67
T2K（東大）（2/2）
• AMD Quad-core Opteron
(Barcelona) 2.3GHz
– 4 “sockets” per node
– 16 cores/node
• マルチコア，マルチソケット
• cc-NUMA（cache coherent
Non-Uniform Memory
Access）
– ローカルメモリ上のデータをで
きるだけ使用する
– 陽的なコマンドラインスイッチ
– NUMA control
L2
L1
Memory
Memory
L3
L3
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L3
L3
Memory
Memory
L1
L2
Algorithm09
68
Flat MPI vs. Hybrid
Flat-MPI：Each PE -> Independent
core
core
core
core
core
memory
core
core
memory
memory
core
core
core
core
core
core
core
core
core
memory
core
core
core
core
memory
memory
Hybrid：Hierarchal Structure
core
core
core
core
Algorithm09
69
Flat MPI vs. Hybrid
• 性能は様々なパラメータの組み合わせによって決まる
• ハードウェア
–
–
–
–
–
コア，CPUのアーキテクチュア
ピーク性能
メモリ性能（バンド幅，レイテンシ）
通信性能（バンド幅，レイテンシ）
それらのバランス
• アプリケーション
– 特性：memory bound，communication bound
– 問題サイズ
Algorithm09
70
疎行列ソルバー：FEM，FDM
• Memory-Bound（メモリに負担）
for (i=0; i<N; i++) {
for (k=Index(i-1); k<Index(i); k++{
Y[i]= Y[i] + A [k]*X[Item[k]];
}
}
– 間接参照
– Hybrid (OpenMP) は更に memory-bound
• Latency-Bound（並列計算時）
– 通信は領域境界のみで発生
– メッセージサイズも小さい
• エクサスケールシステム
– コア数：O(108)
– 108-way MPI： MPI Latencyによるオーバー
ヘッドは？
– Hybridへの期待
• とりあえず，MPIプロセスの数を減らせる（T2Kの場
合で1/16）
Algorithm09
71
Flat MPI vs. Hybrid for 疎行列ソルバー
GeoFEM Benchmarks [KN 2003]
• MPI Latencyが効く
• 「地球シミュレータ」：特に顕著
TFLOPS
– 実効性能，通信バンド幅高い，でもlatencyは月並み（6～8msec.）
– ノード数が増えると一般的にHybridが有利
– 特にノードあたり問題規模が小さい場合
4.00
Large
Flat MPI: Large
Flat MPI: Small
● Flat MPI
3.00
Hybrid: Large
● Hybrid
Hybrid: Small
2.00
Small
▲ Flat MPI
▲ Hybrid
1.00
0.00
0
256
512
768
PE#
1024
1280
Algorithm09
72
対象とする問題
• 三次元弾性静解析（固体力学），不均質物性
– Emax=103, Emin=10-3, n=0.25
• 地質統計学的手法によって発生 [Deutsch & Journel, 1998]
– 1283 個の六面体要素, 6,291,456 DOF
• Strong Scaling
• 反復解法：(SGS+CG)
– Symmetric Gauss-Seidel
– HID
• T2K（東大）
– 512 cores (32 nodes)
• FORTARN90 (Hitachi) + MPI
– Flat MPI, Hybrid (4x4, 8x2, 16x1)
Algorithm09
73
前処理付き反復法のSMP/Multicoreでの
OpenMPによる並列化
• DAXPY, SMVP, Dot Products
– 簡単
• 前処理：ILU系分解，前進後退代入
– 大域的な依存性（Global dependency）
– 並び替え（Reordering）による並列性の抽出
• Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM)
• 同じ色内の要素は独立⇒並列化可能
– 「地球シミュレータ」向け最適化 [KN 2002,2003]
• 並列及びベクトル性能
– 並列性高く安定なCM-RCMを採用
Algorithm09
74
Ordering Methods
29 61
45
22 46
16 62
11 47
7 63
4 48
2 64
1
29 22 16 11
7
4
2
1
53 36 19
13
37 29
49
30 14
23 30
51
17 15
12 31
53
8 16
55
5 32
3
37 30 23 17 12
8
5
3
7
54 37 20
44 57
41
38 42
31 58
24 43
18 59
13 44
9 60
6
44 38 31 24 18 13
9
6
25
8
9 25
33
50
45 10
39 26
35
32 11
25 27
37
19 12
14 28
39
10
50 45 39 32 25 19 14 10
43 26
5
52
55 53
37
51 38
46 54
40 39
33 55
26 40
20 56
15
55 51 46 40 33 26 20 15
61 44 27 10 57 40 23
6
5 21
6 22
7 23
8 24
17
59
56 19
52
47 21
41
34 23
27
21
59 56 52 47 41 34 27 21
14 62 45 28 11 58 41 24
60 34
57 50
53 35
48 51
42 36
35 52
28
62 49
33
62 60 57 53 48 42 35 28
31 15 63 46 29 12 59 42
1 17
2
3
4
64
63 61
3 18
58 54
5 19
49 43
7 20
36
64 63 61 58 54 49 43 36
48 32 16 64 47 30 13 60
MC (Color#=4)
Multicoloring
RCM
Reverse Cuthill-Mckee
CM-RCM (Color#=4)
Cyclic MC + RCM
2
49 33 17
3
55 38 21
9
1
50 34 18
4
56 39 22
51 35
Algorithm09
75
Effect of Ordering Methods on
Convergence
90
▲ MC
● CM-RCM
Iterations
80
70
60
50
1
10
100
color #
1000
Algorithm09
76
CM-RCMによる並べ替え（reordering）
5 colors, 8 threads
Initial Vector
Coloring
(5 colors)
+Ordering
color=1
color=1
color=2
color=2
color=3
color=4
color=3
color=4
color=5
color=5
12345678 12345678 12345678 12345678 12345678
同じ「色」に属する要素は独立，したがって並列計算可能
⇒ スレッド並列化（OpenMP等）が可能
Algorithm09
77
Flat MPI, Hybrid (4x4, 8x2, 16x1)
Flat MPI
0
1
2
3
Hybrid
4x4
0
1
2
3
Hybrid
8x2
0
1
2
3
Hybrid
16x1
0
1
2
3
Algorithm09
78
実施ケース
• CASE-1
– Initial Case (CM-RCM)
– 主としてNUMA controlの影響を見る
• CASE-2 (Hybrid only)
– First-Touch
• CASE-3 (Hybrid only)
– Further Data Reordering + First-Touch
• 各ケースにおいてNUMA Control （Policy 0~5）を適用
Algorithm09
79
結果：CASE-1, 32 nodes/512cores
線形ソルバーの性能
1.50
Command line switches
0
no command line switches
1
--cpunodebind=$SOCKET
--interleave=all
2
--cpunodebind=$SOCKET
--interleave=$SOCKET
3
--cpunodebind=$SOCKET
--membind=$SOCKET
policy 0
Relative Performance
Policy
ID
Normalized by
Flat MPI (Policy 0)
best (policy 2)
1.00
0.50
0.00
4
--cpunodebind=$SOCKET
--localalloc
5
--localalloc
e.g. mpirun –np 64 –cpunodebind 0,1,2,3 a.out
Flat MPI
HB 4x4
HB 8x2
HB 16x1
Parallel Programming Models
Method
Iterations
Best Policy
CASE-1
Flat MPI
1264
2
HB 4x4
1261
2
HB 8x2
1216
2
HB 16x1
1244
2
Algorithm09
80
First Touch Data Placement
配列のメモリ・ページ：
最初にtouchしたコアのローカルメモリ上に確保
計算と同じ順番で初期化
do lev= 1, LEVELtot
do ic= 1, COLORtot(lev)
!$omp parallel do private(ip,i,j,isL,ieL,isU,ieU)
do ip= 1, PEsmpTOT
do i = STACKmc(ip,ic-1,lev)+1, STACKmc(ip,ic,lev)
RHS(i)= 0.d0; X(i)= 0.d0; D(i)= 0.d0
isL= indexL(i-1)+1
ieL= indexL(i)
do j= isL, ieL
itemL(j)= 0; AL(j)= 0.d0
enddo
isU= indexU(i-1)+1
ieU= indexU(i)
do j= isU, ieU
itemU(j)= 0; AU(j)= 0.d0
enddo
enddo
enddo
!$omp omp end parallel do
enddo
enddo
Algorithm09
81
各スレッド上でメモリアクセスが連続となる
ように更なる並び替え
5 colors, 8 threads
Initial Vector
Coloring
(5 colors)
+Ordering
color=1
color=2
color=3
color=4
color=5
各スレッド上で不連
color=1
color=2
color=3
color=4
color=5
続なメモリアクセス
（色の順に番号付け）1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
11111 22222 33333 44444 55555 66666 77777 88888
スレッド内で連続に番号付け
Algorithm09
82
性能の改善： CASE-1 ⇒ CASE-3
Flat MPIのベストケースで無次元化
32nodes, 512cores
196,608 DOF/node
Relative Performance
1.50
Initial
CASE-1
CASE-2
CASE-3
1.00
0.50
0.00
Flat MPI
HB 4x4
HB 8x2
HB 16x1
Parallel Programming Models
CASE-1: NUMA control
CASE-2: + F.T.
CASE-3: + Further Reordering
Algorithm09
83
性能の改善： CASE-1 ⇒ CASE-3
Flat MPIのベストケースで無次元化
32nodes, 512cores
196,608 DOF/node
8nodes, 128cores
786,432 DOF/node
1.50
Initial
CASE-1
CASE-2
CASE-3
1.00
0.50
0.00
Relative Performance
Relative Performance
1.50
Initial
CASE-1
CASE-2
CASE-3
1.00
0.50
0.00
Flat MPI
HB 4x4
HB 8x2
HB 16x1
Parallel Programming Models
Flat MPI
HB 4x4
HB 8x2
HB 16x1
Parallel Programming Models
Algorithm09
84
各コア数における相対性能
32~512 cores
各コア数におけるFlat MPIのベストケースの性能で無次元化
Hybrid は「コア数が多く」，「コアあたりの問題規模が小さい」
場合に有効
Relative Performance
1.25
1.00
0.75
0.50
HB 4x4
HB 8x2
HB 16x1
0.25
0.00
32
64
128
192
core#
256
384
512
Algorithm09
85
まとめ
• HIDによる並列前処理手法のT2K（東大）での実装
– Hybrid/Flat MPI, CM-RCM reordering
• Hybid 4x4 はFlat MPIと同じ，あるいは少し良い
• （並び替え＋F.T.）によるデータ局所化，メモリアクセス連続
性確保によりHybrid 8x2，16x1の性能が大幅に改善
• Hybrid は「コア数が多く」，「コアあたりの問題規模が小さい」
場合に特に有効
– T2Kに適した並列プログラミングモデルである
– ノード数がもっと増えるとHB 16x1の優位性が高まるかも知れない
• 今後の仕事
– 高レベルのfill-inの考慮：BILU(p)
– ノード内オーダリング手法
– 性能評価モデル：Hybridについては無い
Algorithm09
86
• 並列反復法と領域分割
– 選択的オーバーラップ〔KN 2007〕
– 階層型領域間境界分割（Hierarchical Interface Decomposition）
〔Henon & Saad 2007〕
• 拡張型HID法の提案
– 悪条件向け領域分割手法
• Hybrid並列プログラミングモデル
• HIDと並列多重格子法（時間があれば）
Algorithm09
87
動機
• 階層型領域間境界分割（Hierarchical Interface
Decomposition, HID）〔Henon & Saad, 2007〕の並列多
重格子法への適用
– SmootherがGauss-Seidel，IC/ILUの場合でも反復回数の増大
を抑制できるのではないか？
88
解析対象
• 透水係数が空間的に分布する三次元地下水流れ
– ポアソン方程式
– 透水係数は地質統計学的手法によって決定〔Deutsch & Journel,
1998〕
• 規則正しい立方体ボクセルメッシュを使用した有限体積法
Algorithm09
89
支配方程式：ダルシー流れ



u   , v   , w  
x
y
z
：透水係数
u v w            
 
             q
x y z x  x  y  y  z  z 
  0 @ z  zmax
Algorithm09
90
不均質場における地下水流れ
Homogeneous
Uniform
Flow Field
Heterogeneous
Random
Flow Field
Algorithm09
91
速度：
モード：
位置：
背景色：
透明度：
Algorithm09
92
計算手法
• 前処理付きCG法
• Gauss-Seidelに基づく多重格子法
• 多重格子法
– 8個のfine meshからcoarse meshを生成
– V-cycle
• 各領域のメッシュ数が「1」となるまで計算⇒最後は1プロセスで計算
Algorithm09
93
並列MG：局所データ構造
Fine Level
Coarse Level
Internal Meshes
External Meshes
Algorithm09
94
HID for 並列MG：局所データ構造
領域分割例
16×16メッシュ⇒4×4領域，数字はHIDのレベル
2
2
2
4
2
2
2
4
2
2
2
4
2
2
2
4
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
2
2
2
4
2
2
2
4
2
2
2
4
2
2
2
4
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
2
2
2
4
2
2
2
4
2
2
2
4
2
2
2
4
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
2
2
2
4
2
2
2
4
2
2
2
4
2
2
2
4
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
4
2
2
2
2
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
Algorithm09
95
HID for 並列MG：局所データ構造
Fine Level
4
2
2
2
1
2
1
1
1
2
Coarse Level
1
2
1
1
1
2
1
2
1
1
1
2
2
4
2
2
2
4
2
1
1
1
1
2
4
2
4
1
2
1
2
1
2
4
Algorithm09
96
領域間通信：Gauss-Seidel
Original
HID
並替あり
OLD
NEW
MID
do iterC= 1, ITERtotC
do icel= levMGcelINDEX(lev-1)+1,levMGcelINDEX(lev)
RF= Bmg(icel)
do j= 1, INLmg(icel)
RF= RF - ALmg(j,icel)*Xmg(IALmg(j,icel))
enddo
Xmg(icel)= RF*DDmg(icel)
enddo
call SOLVER-SEND-RECV : MPI-ISEND/IRECV
enddo
do iterC= 1, ITERtotC
do levh= 1, levHIDtot
do i0= hLEVEL_index(levh-1,lev)+1,
hLEVEL_index(levh,lev)
icel= MIDtoNEW(i0)
RF= Bmg(icel)
do j= 1, INLmg(icel)
RF= RF - ALmg(j,icel)*Xmg(IALmg(j,icel))
enddo
Xmg(icel)= RF*DDmg(icel)
enddo
call SOLVER-SEND-RECV-HID-LEV : MPI-ISEND/IRECV
enddo
enddo
Algorithm09
97
計算結果：IBM SP3/SR11k
256コアまで，MGCG
IBM SP3
Hitachi SR11000/J2
40
300
30
sec.
sec.
200
20
100
10
0
0
0.E+00
2.E+07
4.E+07
6.E+07
8.E+07
0.E+00
2.E+07
4.E+07
6.E+07
8.E+07
DOF
DOF
（max/min=10+6）
▲ MGCG/ORG
△ MGCG/HID
（max/min=10+8）
■ MGCG/ORG
□ MGCG/HID
Algorithm09
98
計算結果：IBM SP3
256コアまで，MGCG，SmootherとしてIC(0)の方が良い
しかも，ORGとHIDの差が無い
Gauss Seidel 4/4
IC(0) 2/2
300
200
200
sec.
sec.
300
100
100
0
0
0.E+00
2.E+07
4.E+07
6.E+07
8.E+07
0.E+00
2.E+07
DOF
（max/min=10+6）
▲ MGCG/ORG
△ MGCG/HID
4.E+07
6.E+07
8.E+07
DOF
（max/min=10+8）
■ MGCG/ORG
□ MGCG/HID
Algorithm09
Algorithm09
99
まとめ
• Smootherの選択
– 現状ではIC(0)/ILU(0)が良い
– ILUT，ILU(p)
• FraunhoferのAMGライブラリ
• HIDの効果
– GSでは顕著：特に悪条件問題
– IC(0)ではほとんど効果無し
• 反復回数は減っているが通信オーバーヘッドのため，計算時間はほとんど
変わらない
• より悪条件の問題に使えるかも知れない
• Multigrid
– Coarse Grid：グローバルな情報をある程度考慮できる（？）
100
Algorithm09
おわりに
Science
Modeling
Algorithm
Software
Hardware
• 問題の特性を利用した前処理，
領域分割
• アプリケーションの特徴を反映し
たベンチマーク
– マトリクスより更にSMASH寄り
• 様々な技術的課題
– HID（Hierarchical Interface
Decomposition）
• 連成した非圧縮性NS（拘束条件付き，
非正定）
– ノード内並列化手法
– コア単位の最適化

並列前処理手法と領域分割， マルチコア時代の戦略

Transcript 並列前処理手法と領域分割， マルチコア時代の戦略

Directory

並列前処理手法と領域分割，マルチコア時代の戦略

Transcript 並列前処理手法と領域分割，マルチコア時代の戦略