Parallel Computation for SDPs Focusing on the Sparsity of

Download Report

Transcript Parallel Computation for SDPs Focusing on the Sparsity of

This talk is supported by Ewha University
High Performance Solvers for
Semidefinite Programs
Makoto Yamashita @ Tokyo Tech
Katsuki Fujisawa
@ Chuo Univ
Mituhiro Fukuda
@ Tokyo Tech
Kazuhiro Kobayashi @ NMRI
Kazuhide Nakata
@ Tokyo Tech
Maho Nakata
@ RIKEN
KSIAM Annual Meeting @ Jeju 2011/11/25
(2011/11/25-2011/11/26)
Our interests & SDPA Family
 How fast can we solve SDPs?
 How large SDP can we solve?
 How accurate can we solve SDPs?
Base solver
SDPARA
Parallel
SDPA
SDPA-M
Matlab
SDPA-C
SDPA-GMP
SDPARA-C
Strucutural Sparsity
Multiple precision
SDPA Homepage http://sdpa.sf.net/
KSIAM 2011 @ Jeju
2
SDPA Online Solver
http://sdpa.sf.net/ ⇒ Online Solver
1. Log-in the online
solver
2. Upload your
problem
3. Push ’Execute’
button
4. Receive the result
via Web/Mail
KSIAM 2011 @ Jeju
3
Outline
1.
2.
3.
4.
5.
SDP Applications
Primal-Dual Interior-Point Methods
Inside of SDPARA (Large & Fast)
Inside of SDPA-GMP (Accurate)
Conclusion
SDP Applications




Control Theory
Quantum Chemistry
Sensor Network Localization Problem
Polynomial Optimization
KSIAM 2011 @ Jeju
5
SDP Applications
1.Control theory
 Against swing,
we want to keep
stability.
 Stability Condition
⇒ Lyapnov Condition
⇒ SDP
INFOMRS 2011 @ Charlotte
6
SDP Applications
2. Quantum Chemistry
 Ground state energy
 Locate electrons
 Schrodinger Equation
⇒Reduced Density Matrix
⇒SDP
INFOMRS 2011 @ Charlotte
7
SDP Applications
3. Sensor Network Localization
 Distance
Information
⇒Sensor
Locations
 Protein
Structure
INFOMRS 2011 @ Charlotte
8
SDP Applications
4. Polynomial Optimization
 For example,
min : Polynomial s.t. Polynomial constraints
n 1


min : f ( x)   (1  xi ) 2  100( xi 1  xi2 ) 2 , x  R n
i 1
 NP-hard in general
 Very good lower bound
by SDP relaxation method
KSIAM 2011 @ Jeju
9
SDP Applications




Control Theory
Quantum Chemistry
Polynomial Optimization
Sensor Network Localization Problem
Many Applications 
How Large & How Fast & How Accurate
KSIAM 2011 @ Jeju
10
Standard form
( P)
CX
min
s.t.
Ak  X  bk (k  1,  , m),
X O
m
b z
max
k 1
( D)
m
s.t.
A z
k 1
k k
k k
 Y  C , Y O
n
n
m
 The variables are  X , Y , z  nS , S , R 
 Inner Product is X  Y   X ijYij
i , j 1
 The size is roughly determined by
m the number of equality constraint s in ( P)
n
the size of X and Y
KSIAM 2011 @ Jeju
Ordinal solver
m  10,000
Our target
m  30,000
11
Primal-Dual Interior-Point Methods
Central Path
X
X , Y , z 
1
Optimal
X
*
, Y * , z*

1
1
Target
(dX , dY , dz )
X

2
,Y 2 , z 2
KSIAM 2011 @ Jeju
0
,Y 0 , z0

Feasible region
 X , Y , z  S n , S n , R m 
12
Schur Complement Matrix

Schur Complement Equation
Bdz

r

m

dY  D   A j dz j

j 1

dX  R  XdY Y 1 , dX  dX  dX T / 2



where

Bij  XAiY
Schur Complement Matrix
1
 A
j
1. ELEMENTS (Evaluation of SCM)
2. CHOLESKY (Cholesky factorization of SCM)
KSIAM 2011 @ Jeju
13
Computation time on single processor
Time unit is second, SDPA 7, Xeon 5460 (3.16GHz)
Control distribution
POP
Row-wise
ELEMENTS
22228
668
CHOLESKY
1593
1992
Total
 95%
Two-dimensional
distribution
23986block-cyclic2713
 SDPARA replaces these bottleneks by
parallel computation
KSIAM 2011 @ Jeju
14
Row-wise distribution
Bij  XAiY 1  Aj
Example
BS
88
Processor1
 All rows are
independent
 Assign processors
in a cyclic manner
Processor2
Processor3
Processor4
Processor1
 Simple idea
⇒Very EFFICIENT
 High scalability
KSIAM 2011 @ Jeju
B
Processor2
Processor3
Processor4
15
Block Algorithm
for Cholesky factorization
 Triangular Factorization
B U U
T
 B11
 T
 B12
(U: upper triangular matrix)
T
B12  U11 U12  U11 U12   U11T U11
  
 
   T
T
B22   O U 22   O U 22   U11U12

B11  Sp , B22  Sm p
(e.g. p  4)
1. B11  U11T U11
 
2. U12  U
T 1
11



T
T
U12U12  U 22U 22 
U11T U12
B12
3. B22  B22  U12T U12
Small Cholesky factorizaton
Block Updates
Parallel
Computing
Two-dimensional block-cyclic
distribution
Example
 Scalapack library
Processor1
1
1
2
2
1
1
2
2
Processor2
1
1
2
2
1
1
2
2
3
3
4
4
3
3
4
4
3
3
4
4
3
3
4
4
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
3
4
4
3
3
4
4
3
4
4
3
3
4
4
 From the row-wise
Processor3
to TDBCD requires
network Processor4
Processor1
communication
Processor2
 Cholesky on
TDBCD
is much faster
than
Processor3
the on row-wise
Processor4
B
BS
88
3
3
KSIAM 2011 @ Jeju
B
17
Numerical Results of SDPARA
Quantum Chemistry (m=7230, SCM=100%), middle size
SDPARA 7.3.1, Xeon X5460, 3.16GHz x2, 48GB memory


100000
10000
29700
28678
7764
Second
7192
1000
2294
1826
548
131
100
47
ELEMENTS
CHOLESKY
Total
ELEMENTS 15x speedup
CHOLESKY 12x speedup
Total
13x speedup
10
1
4
16
Servers
KSIAM 2011 @ Jeju
Very FAST!!
18
Acceleration by Multiple Threading
 Modern Processors
have multi-cores
 Multiple Threading is
becoming common
Processor1:Thread1
Processor2:Thread1
Processor1:Thread2
Processor2:Thread2
Processor1:Thread1
2 Processors
x2 Threads on each processor
B
Processor2:Thread1
Processor1:Thread2
Processor2:Thread2
Two-level Parallel Computing
KSIAM 2011 @ Jeju
19
Comparison with PCSDP
 developed by Ivanov & de Klerk
SDP: B.2P Quantum Chemistry (m = 7230, SCM = 100%)
Xeon X5460, 3.16GHz x2 (8core), 48GB memory
Time unit is second
Servers
PCSDP
SDPARA
1
2
4
53,768 27,854 14,273
5983
3002
1680
8
16
7995
4050
901
565
SDPARA is 8x faster by MPI & Multi-Threading
(Two-level parallization)
KSIAM 2011 @ Jeju
20
Extremely Large-Scale SDPs
Other solvers can handle only m  30,000
m
Esc32_b(QAP)
SCM
198,432 100%
time
129,186
second
(1.5days)
 16 Servers [Xeon X5670(2.93GHz) , 128GB Memory]
The LARGEST solved SDP
in the world
KSIAM 2011 @ Jeju
21
Numerical Accuracy
 One weakpoint of PDIPM
 X *Y *  O, lim ( X k , Y k )  ( X * , Y * )
 .( X * , Y * , z* ) optimal 
k 
 PDIPM requires ( X k ) 1 & (Y k ) 1
for example,


Bij  XAiY 1  Aj
 Eventually, numerical trouble
(often, Cholesky fails)
KSIAM 2011 @ Jeju
22
Numerical Precision
 Ordinal double precision in C or C++
a
b
c
64bit = 1bit(sign) + 11bit(exponent)+53bit(fraction);
accuracy =
10 16
 1a  2b  1  c 
 arbitrary precision in GMP library
a
b
c
We can arbitrary set the bit number of fraction part.
(for example, 200bit = 10 53 )
Replace BLAS(Basic Linear Algebra Sytems)
SDPA-GMP
by MPLAPACK (Multiple precision LAPACK)
Numerically Hard problem
 Test Problem
min : C  X
s.t.
eeT  X   , ei ei  X  1(i  1,, n), X O
T
 PDIPM is stable if Slater’s condition
X  O s.t. Ak  X  bk (k  1,, m)
 Graph Partition Problem   0
has no interior
X : ee
T

 X  0, ei eiT  X  1, X  O  
 Small  ⇒ Numerically Hard
KSIAM 2011 @ Jeju
24
Numerical Results of SDPA-GMP
 Small  ⇒ Numerically Hard

1.0e-1
1.0e-15
0
Solver
SDPA
SDPA-GMP
SDPA
SDPA-GMP
SDPA
SDPA-GMP
Accuracy Time(second)
1.08e-8
2.03
4.80e-48
77760.19
1.63e-7
2.26
2.97e-48
82115.52
5.26e-9
2.36
7.29e-24 105325.74
24digits for even no-interior case
SDPA-GMP uses 300 digits
KSIAM 2011 @ Jeju
25
Conclusion
 SDPARA ⇒ How Fast & How Large
100times & m  200,000
 SDPA-GMP ⇒ How Accurate 10 48
 http://sdpa.sf.net/ & Online solver
Thank you very much for your attention.
KSIAM 2011 @ Jeju
26