Designing Polylibreries to Speed Up Linear Algebra

Transcript Designing Polylibreries to Speed Up Linear Algebra

Advances in the Optimization
of Parallel Routines (II)
Domingo Giménez
Departamento de Informática y Sistemas
Universidad de Murcia, Spain
dis.um.es/~domingo
18 July 2015
Universidad Politécnica de
Valencia
1
Outline










A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries’ hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Hybrid programming
Peer to peer computing
18 July 2015
Universidad Politécnica de Valencia
2
Polylibraries

Different basic libraries can be
available:
Reference BLAS, machine specific BLAS,
ATLAS, …
 MPICH, machine specific MPI, PVM, …
 Reference LAPACK, machine specific
LAPACK, …
 ScaLAPACK, PLAPACK, …
To use a number of different basic libraries to
develop a polylibrary


18 July 2015
Universidad Politécnica de Valencia
3
Polylibraries
Typical parallel linear algebra libraries hierarchy
ScaLAPACK
LAPACK
BLAS
PBLAS
BLACS
MPI, PVM, ...
18 July 2015
Universidad Politécnica de Valencia
4
Polylibraries
A possible parallel linear algebra polylibraries hierarchy
ScaLAPACK
LAPACK
PBLAS
BLACS
ref. BLAS
mac. BLAS
MPI, PVM, ...
ATLAS
18 July 2015
Universidad Politécnica de Valencia
5
Polylibraries
A possible parallel linear algebra polylibraries hierarchy
ScaLAPACK
LAPACK
PBLAS
BLACS
ref. BLAS
mac. BLAS
mac. MPI
MPICH
LAM
PVM
ATLAS
18 July 2015
Universidad Politécnica de Valencia
6
Polylibraries
A possible parallel linear algebra polylibraries hierarchy
ref. LAPACK
ScaLAPACK
mac. LAPACK
ESSL
PBLAS
BLACS
ref. BLAS
mac. BLAS
mac. MPI
MPICH
LAM
PVM
ATLAS
18 July 2015
Universidad Politécnica de Valencia
7
Polylibraries
ref. ScaLAPACK
mac. ScaLAPACK
ref. LAPACK
ESSL
mac. LAPACK
ESSL
PBLAS
BLACS
ref. BLAS
mac. BLAS
mac. MPI
MPICH
LAM
PVM
ATLAS
18 July 2015
Universidad Politécnica de Valencia
8
Polylibraries

The advantage of Polylibraries

A library optimised for the system might not be available

The characteristics of the system can change



18 July 2015
Which library is the best may vary according to the routines and
the systems
Even for different problem sizes or different data access schemes
the preferred library can change
In parallel system with the file system shared by processors of
different types
Universidad Politécnica de Valencia
9
Architecture of a Polylibrary
Library_1
18 July 2015
Universidad Politécnica de Valencia
10
Architecture of a Polylibrary
Library_1
Installation
LIF_1
18 July 2015
Universidad Politécnica de Valencia
11
Architecture of a Polylibrary
Library_1
Installation
LIF_1
Routine: DGEMM
m
18 July 2015
20
40
80
20
X Mflops
X Mflops
X Mflops
n 40
X Mflops
X Mflops
X Mflops
80
X Mflops
X Mflops
X Mflops
Universidad Politécnica de Valencia
12
Architecture of a Polylibrary
Library_1
Installation
LIF_1
Routine: DROT
Leading dimension
18 July 2015
1
100
200
100
X Mflops
X Mflops
X Mflops
n 200
X Mflops
X Mflops
X Mflops
400
X Mflops
X Mflops
X Mflops
Universidad Politécnica de Valencia
13
Architecture of a Polylibrary
Library_2
Library_1
Installation
LIF_1
18 July 2015
Universidad Politécnica de Valencia
14
Architecture of a Polylibrary
Library_2
Library_1
Installation
LIF_1
18 July 2015
Installation
LIF_2
Universidad Politécnica de Valencia
15
Architecture of a Polylibrary
Library_2
Library_1
Installation
LIF_1
18 July 2015
Library_3
Installation
LIF_2
Universidad Politécnica de Valencia
16
Architecture of a Polylibrary
Library_2
Library_1
Installation
LIF_1
18 July 2015
Library_3
Installation
Installation
LIF_2
Universidad Politécnica de Valencia
LIF_3
17
Architecture of a Polylibrary
Library_2
Library_1
Installation
Library_3
LIF_2
LIF_1
Installation
Installation
LIF_3
PolyLibrary
interface routine_1
interface routine_2
...
18 July 2015
Universidad Politécnica de Valencia
18
Architecture of a Polylibrary
Library_2
Library_1
Installation
PolyLibrary
interface routine_1
interface routine_2
...
Installation
Installation
LIF_2
LIF_1
18 July 2015
Library_3
LIF_3
interface routine_1
if n<value
call routine_1 from Library_1
else
depending on data storage
call routine_1 from Library_1
or
call routine_1 from Library_2
...
Universidad Politécnica de Valencia
19
Polylibraries

Combining Polylibraries with other
Optimisation Techniques:


Polyalgorithms
Algorithmic Parameters



18 July 2015
Block size
Number of processors
Logical topology of processors
Universidad Politécnica de Valencia
20
Experimental Results
Routines of different levels in the hierarchy:

Lowest level:


Medium level:


GEMM: matrix-matrix multiplication
LU and QR factorisations
Highest level:


18 July 2015
a Lift-and-Project algorithm to solve the inverse additive
eigenvalue problem
an algorithm to solve the Toeplitz least square problem
Universidad Politécnica de Valencia
21
Experimental Results
The platforms:



SGI Origin 2000
IBM-SP2
Different networks of processors



18 July 2015
SUN Workstations + Ethernet
PCs + Fast-Ethernet
PCs + Myrinet
Universidad Politécnica de Valencia
22
Experimental Results: GEMM
Routine: GEMM (matrix-matrix multiplication)
Platform: five SUN Ultra 1 / one SUN Ultra 5
Libraries:
refBLAS
ATLAS1
macBLAS
ATLAS2
ATLAS5
Algorithms and Parameters:
Strassen

By blocks

Direct method
18 July 2015
base size
block size
Universidad Politécnica de Valencia
23
Experimental Results: GEMM
MATRIX-MATRIX MULTIPLICATION INTERFACE:
if processor is SUN Ultra 5
if problem-size<600
solve using ATLAS5 and Strassen method with base size half of
problem size
else if problem-size<1000
solve using ATLAS5 and block method with block size 400
else
solve using ATLAS5 and Strassen method with base size half of
problem size
endif
else if processor is SUN Ultra 1
if problem-size<600
solve using ATLAS5 and direct method
else if problem-size<1000
solve using ATLAS5 and Strassen method with base size half of
problem size
else
solve using ATLAS5 and direct method
endif
18 July
2015
Universidad Politécnica de Valencia
endif
24
Experimental Results: GEMM
n
200
600
1000
1400
1600
Low
Time
Library
Method
Parameter
0.04
ATL5
direct
1.06
ATL5
direct
4.68
ATL5
Strass
2
12.53
ATL2
Strass
2
20.03
ATL5
blocks
400
Mod
Time
Library
Method
Parameter
0.04
ATL5
Strass
2
1.11
ATL5
blocks
400
4.68
ATL5
Strass
2
12.58
ATL5
Strass
2
26.57
ATL5
Strass
2
0.04
1.06
4.83
13.50
31.02
ATLAS5
Direct
18 July 2015
Time
Universidad Politécnica de Valencia
25
Experimental Results: LU
Routine: LU factorisation
Platform: 4 PentiumIII + Myrinet
Libraries:
ATLAS
BLAS for Pentium II
BLAS for Pentium III
18 July 2015
Universidad Politécnica de Valencia
26
Experimental Results: LU
The cost of parallel block LU factorisation:
TARI
2
n3 r  c
1
 k 3, gemm

bk3,trsm n 2  b 2 k 2, getf 2 n
3
p
p
3
TCOM
2nd
2n 2 d
 ts
 tw
b
p
Tuning Algorithmic Parameters:
block size: b
2D-mesh of p proccesors: p = r c d=max(r,c)
System Parameters:
cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm
communication parameters: ts tw
18 July 2015
Universidad Politécnica de Valencia
27
Experimental Results: LU
n
512
ATLAS 1024
1536
512
BLAS-II 1024
1536
512
BLAS-III 1024
1536
18 July 2015
the.
time
0.13
0.79
2.36
32
32
32
low.
time
0.12
0.74
2.21
32
32
64
mod.
time
b
0.12
32
0.74
32
2.27
32
0.13
0.77
2.30
32
32
32
0.11
0.71
2.13
32
32
32
0.11
0.71
2.13
32
32
32
0.13
0.77
2.30
32
32
32
0.11
0.70
2.13
32
32
32
0.11
0.70
2.13
32
32
32
b
b
Universidad Politécnica de Valencia
28
Experimental Results: QR
Routine: QR factorisation
Platform: 8 PentiumIII + Fast-Ethernet
Libraries:
ATLAS
BLAS for Pentium II
BLAS for Pentium III
18 July 2015
Universidad Politécnica de Valencia
29
Experimental Results: QR
The cost of parallel block QR factorisation:
T ARI
TCOM
4 3
1 2
1 2
n k 3, gemm
n bk3,trmm n 2 bk
n bk2,larft
2
,
geqr
2
 3
4

2
p
c
r
r
 n 2  2r  1 log r log r 

n

 t s  2  3b  log r  2 log c   t w  



nb
log
p


2
b
2
r
c
r



 

Tuning Algorithmic Parameters:
block size: b
2D-mesh of p proccesors: p = r c
System Parameters:
cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm
communication parameters: ts tw
18 July 2015
Universidad Politécnica de Valencia
30
Experimental Results: QR
n
mod.
Time
r c
b
Library
low.
Time
r c
b
library
18 July 2015
128
256
384
1024
2048 3072
0.025
12
8/16
BLASII
0.087
18
8/16
BLASII
0.178
18
8
BLASII
1.92
18
8
BLASII
9.90
18
16
BLASII
28.00
18
32
BLASII
0.025
12
16
BLASII
0.086
18
16
BLASIII
0.176
1x8
8
BLASIII
1.92
18
8
BLASII
9.40
24
32
BLASII
25.11
24
32
BLASII
Universidad Politécnica de Valencia
31
Experimental Results: L&P
Routine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob
Platform: dual Pentium III
Libraries combinations:
La_In+B_In
LAPACK and BLAS installed in the system and supposedly
optimized for the machine
La_Re+B_III
reference LAPACK and a freely available BLAS for PentiumIII
La_Re+B_II
reference LAPACK and a freely available BLAS for Pentium II
La_Re+B_In
reference LAPACK and the installed BLAS
La_In_Th+B_In_Th
LAPACK and BLAS installed for the use of threads
La_Re+B_II_Th
reference LAPACK and a freely available BLAS for Pentium II
using threads
La_Re+B_In_Th
18 July 2015
reference Universidad
LAPACK and
the BLAS installed which uses threads
Politécnica de Valencia
32
Experimental Results: L&P
The theoretical model of the sequential algorithm cost:
 22
 3
iter k syev  2k 3, gemm  k 3,diaggemm n 
 3

iter2k1,dot  k1,scal  k1,axpy n 2 L  2k1,dot n 2 L2  k sum nL2
System Parameters:
ksyev
k3, gemm k3, diaggemm
k1,dot k1,scal k1,axpy
18 July 2015
 LAPACK
 BLAS-3
 BLAS-1
Universidad Politécnica de Valencia
33
Experimental Results: L&P
18 July 2015
Universidad Politécnica de Valencia
34
Experimental Results: L&P
TRACE
ADK
EIGEN
MATEIG
MATMAT
ZKAOA
TOTAL
La_In
B_In
1.69
12.86
165.81
0.94
98.79
14.22
294.32
La_Re
B_III
1.16
14.87
210.85
0.83
26.70
10.46
264.89
La_Re
B_II
1.16
15.65
255.20
0.86
10.52
10.44
293.85
La_Re
B_In
1.69
16.41
336.49
1.21
123.73
18.03
497.59
Lowest no threads
1.16
12.86
165.81
0.83
10.52
10.44
201.64
La_In_Th
B_In_Th
1.10
13.92
266.63
0.66
14.13
12.34
308.80
La_Re
B_II_Th
1.16
15.68
254.34
0.79
6.66
9.99
288.66
La_Re
B_In_Th
1.10
13.71
249.59
0.62
13.74
11.90
290.68
Lowest with threads
1.10
13.71
249.59
0.62
6.66
9.99
281.70
Lowest
1.10
12.86
165.81
0.62
6.66
9.99
197.06
18 July 2015
Universidad Politécnica de Valencia
35
Polylibraries




The method can be applied to sequential and parallel
algorithms
It can be combined with other methods of computation
speed up.
The LIF contains the cost of an operation for each one of
the routines. These costs may be different for different
data sizes or access schemes.
Could be applied to help in the development of efficient
parallel libraries in other fields.
18 July 2015
Universidad Politécnica de Valencia
36
Outline










A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries’ hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Hybrid programming
Peer to peer computing
18 July 2015
Universidad Politécnica de Valencia
37
Algorithmic schemes

To study ALGORITHMIC SCHEMES, and not
individual routines. The study could be
useful to:

Design libraries to solve problems in different
fields.


Divide and Conquer, Dynamic Programming, Branch
and Bound (La Laguna)
Develop SKELETONS which could be used in
parallel programming languages.

Skil, Skipper, CARAML, P3L, …
18 July 2015
Universidad Politécnica de Valencia
38
Dynamic Programming


There are different Parallel Dynamic
Programming Schemes.
The simple scheme of the “coins problem” is
used:


A quantity C and n coins of values v=(v1,v2,…,vn),
and a quantity q=(q1,q2,…,qn) of each type.
Minimize the quantity of coins to be used to give C.
But the granularity of the computation has been
varied to study the scheme, not the problem.
18 July 2015
Universidad Politécnica de Valencia
39
Dynamic Programming
Sequential scheme:
for i=1 to number_of_decisions
for j=1 to problem_size
obtain the optimum solution with i
decisions and problem size j
endfor
Complete the table with the formula:
2v
Change[i,j]  min k  0,1,...,j/vi k  Change[i-1, j - k vi] 
endfor

i
vi
1
1
2
….
i
…
n
2
.
18 July 2015
.
.
.
.
.
.
.
j
.
.
.
.
Universidad Politécnica de Valencia
.
N
40
Dynamic Programming
Parallel scheme:
for i=1 to number_of_decisions
In Parallel:
for j=1 to problem_size
1 2
obtain the optimum
1
solution with
2
...
i decisions
and problem size j i
…
endfor
n
endInParallel
PO
endfor

18 July 2015
3 vi
2 vi
vi
.
.
P1
.
.
.
.
P2
.
j
......
Universidad Politécnica de Valencia
.
PS
.
...
.
.
PK-1
.
PK
41
Dynamic Programming
Message-passing scheme:
In each processor Pj
for i=1 to number_of_decisions
communication step
obtain the optimum
1 2 .
.
.
.
.
.
.
.
j
.
solution with
1
2
i decisions
and the problem ...
i
sizes Pj has
…
assigned
n
endfor
PO
P1
P2
....................
endInEachProcessor
18 July 2015
Universidad Politécnica de Valencia

.
.
.
.
PK-1
N
PK
42
Dynamic Programming

Theoretical model:
one step
t parallel  tarith,1  tcomm,1  tarith,2  tcomm,2  ...
Sequential
 2
cost: o C 
 2v 
 i
Computational parallel cost (qi large):
  j    C2 
C 1      o p 
 vi    vi 
j C  1 
C
p
Communication cost:


p( p  1)
C ( p  1) Process Pp

ts
tw
2
2
The only AP is p
The SPs are tc , ts and tw
18 July 2015
Universidad Politécnica de Valencia
43
Dynamic Programming

How to estimate arithmetic SPs:
Solving a small problem

How to estimate communication SPs:



Using a ping-pong (CP1)
Solving a small problem varying the number of
processors (CP2)
Solving problems of selected sizes in systems of selected
sizes (CP3)
18 July 2015
Universidad Politécnica de Valencia
44
Dynamic Programming

Experimental results:

Systems:



SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster)
+ Ethernet
PenET: seven Pentium III + FastEthernet
Varying:



The problem size C = 10000, 50000, 100000, 500000
Large value of qi
The granularity of the computation (the cost of a computational
step)
18 July 2015
Universidad Politécnica de Valencia
45
Dynamic Programming

Experimental results:

CP1:



CP2:




ping-pong (point-to-point communication).
Does not reflect the characteristics of the system
Executions with the smallest problem (C =10000) and varying the
number of processors
Reflects the characteristics of the system, but the time also changes
with C
Larger installation time (6 and 9 seconds)
CP2:


Executions with selected problem (C =10000, 100000) and system (p
=2, 4, 6) sizes, and linear interpolation for other sizes
Larger installation time (76 and 35 seconds)
18 July 2015
Universidad Politécnica de Valencia
46
Dynamic Programming
Parameter selection
10.000
50.000
100.000
500.000
SUNEt
gra
LT
CP1
CP2
CP3
LT
CP1
CP2
CP3
LT
CP1
CP2
CP3
LT
CP1
CP2
CP3
10
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
50
6
6
1
6
1
6
1
4
1
6
1
4
1
6
1
4
100
6
6
1
6
6
6
1
5
5
6
1
5
1
6
1
5
PenFE
gra
LT
CP1
CP2
CP3
LT
CP1
CP2
CP3
LT
CP1
CP2
CP3
LT
CP1
CP2
CP3
10
1
6
1
1
1
6
1
1
1
6
1
1
1
6
1
1
50
5
7
1
7
7
7
1
6
4
7
1
6
7
7
1
6
100
6
7
5
7
7
7
7
7
6
7
7
7
6
7
7
7
18 July 2015
Universidad Politécnica de Valencia
47
Dynamic Programming
Quotient between the execution time with the parameter
selected by each one of the selection methods and the
lowest execution time, in SUNEt:
1,8
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
CP1
CP2
CP3
10
-1
0
10 000
-5
10 000
0
-1
00
10 00
-5
0
00
50 000
-1
0
50 000
-5
50 000
0
-1
00
50 00
0
-5
00
10 00
0- 0
1
10 000
0- 0
10 500
00
10 0
10 00
0- 00
50
00
0
av 0
er
ag
e

18 July 2015
Universidad Politécnica de Valencia
48
Dynamic Programming
Quotient between the execution time with the parameter
selected by each one of the selection methods and the
lowest execution time, in PenFE:
2,5
2
CP1
1,5
CP2
1
CP3
0,5
0
10
-1
0
10 000
-5
10 000
0
-1
00
10 00
-5
0
00
50 000
-1
0
50 000
-5
50 000
0
-1
00
50 00
0
-5
00
10 00
0- 0
1
10 000
0- 0
10 500
00
10 0
10 00
0- 00
50
00
0
av 0
er
ag
e

18 July 2015
Universidad Politécnica de Valencia
49
Dynamic Programming

Three types of users are considered:

GU (greedy user):


CU (conservative user):


Uses all the available processors.
Uses half of the available processors
EU (expert user):

Uses a different number of processors depending on the
granularity:



1 for low granularity
Half of the available processors for middle granularity
All the processors for high granularity
18 July 2015
Universidad Politécnica de Valencia
50
Dynamic Programming
Quotient between the execution time with the parameter
selected by each type of user and the lowest execution
time, in SUNEt:
4
3,5
3
2,5
2
1,5
GU
CU
EU
CP3
1
0,5
0
10
-1
0
10 000
-5
10 000
0
-1
00
10 00
-5
0
00
50 000
-1
0
50 000
-5
50 000
0
-1
00
50 00
0
-5
00
10 00
0- 0
1
10 000
0- 0
10 500
00
10 0
10 00
0- 00
50
00
0
av 0
er
ag
e

18 July 2015
Universidad Politécnica de Valencia
51
Dynamic Programming
Quotient between the execution time with the parameter
selected by each type of user and the lowest execution
time, in PenFE:
2,5
2
GU
1,5
CU
1
EU
CP3
0,5
0
10
-1
0
10 000
-5
10 000
0
-1
00
10 00
-5
0
00
50 000
-1
0
50 000
-5
50 000
0
-1
00
50 00
0
-5
00
10 00
0- 0
1
10 000
0- 0
10 500
00
10 0
10 00
0- 00
50
00
0
av 0
er
ag
e

18 July 2015
Universidad Politécnica de Valencia
52
Outline










A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries’ hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Hybrid programming
Peer to peer computing
18 July 2015
Universidad Politécnica de Valencia
53
Heterogeneous algorithms

Necessary new algorithms with unbalanced
distribution of data:
Different SPs for different processors
APs include
vector of selected processors
vector of block sizes
Gauss elimination
18 July 2015
Universidad Politécnica de Valencia
b0
b1
b2
b0
b1
b2
b0
b1
b2
b0
54
Heterogeneous algorithms

Parameter selection:



RI-THE: obtains p and b from the formula
(homogeneous distribution)
RI-HOM: obtains p and b through a reduced number of
executions (homogeneous distribution)
RI-HET:


obtains p and b through a reduced number of executions
and each
s
bi 
i
s
j 1
18 July 2015
pb
p
j
Universidad Politécnica de Valencia
55
Heterogeneous algorithms

Quotient with respect to the lowest
experimental execution time:
RI-THEO
RI-HOMO
RI-HETE
2
2
2
1,5
1,5
1,5
1
1
1
0,5
0,5
0,5
0
0
0
Homogeneous system:
Five SUN Ultra 1
18 July 2015
Hybrid system:
Five SUN Ultra 1
One SUN Ultra 5
Heterogeneous system:
Two SUN Ultra 1 (one
manages the file system)
One SUN Ultra 5
Universidad Politécnica de Valencia
56
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
R
U
N
T
I
M
E
Installation-File
Estimation
of Static-SP
Static-SP-File
Universidad Politécnica de Valencia
57
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
R
U
N
T
I
M
E
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Installation-File
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
58
Parameter selection at running
time
The NWS is called and it reports:
LAR
the
fraction of available
CPU (fCPU)
Modelling
D
E
S
I
G
N
the LAR
MODEL time
the current word sending
(tw_current) for a specific
n and AP
Implementation
of SPvalues (n0, AP0). Estimators
R
U
N
T
I
M
E
SP-Estimators
I
Then
the fraction of available
N
Basic Libraries
Installation-File
S
network
is calculated:
T
A
L
L
A
T
I
O
N
18 July 2015
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
59
Parameter selection at running
time
node1
node2
node3
CPU avail.
tw-current
100%
100%
100%
CPU avail.
tw-current
80%
80%
80%
0.8sec
CPU avail.
tw-current
60%
60%
60%
1.8sec
CPU avail.
tw-current
60%
60%
60%
1.8sec
CPU avail.
tw-current
60%
60%
60%
1.8sec
18 July 2015
node4 node5
Situation A
100%
100%
0.7sec
Situation B
80%
100%
Situation C
60%
100%
node6
node7
node8
100%
100%
100%
100%
100%
0.7sec
100%
100%
100%
0.7sec
100%
Situation D
60%
100%
100%
0.7sec
Situation E
60%
100%
100%
0.7sec
Universidad Politécnica de Valencia
80%
80%
0.8sec
50%
50%
4.0sec
60
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
R
U
N
T
I
M
E
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Installation-File
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
61
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Installation-File
Current-SP
Dynamic
Adjustment
of SP
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
62
Parameter selection at running
time
LAR
The values of the SP
are tuned,
Modelling
D
E
the LAR situation:
according
to the current
S
I
G
N
MODEL
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Installation-File
Current-SP
Dynamic
Adjustment
of SP
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
63
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Installation-File
Current-SP
Dynamic
Adjustment
of SP
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
64
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
Selection
of Optimum AP
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Optimum-AP
Installation-File
Current-SP
Dynamic
Adjustment
of SP
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
65
Parameter selection at running
time
Block size
D
E
S
I
G
N
n
1024
2048
3072
LAR
Situation of the Platform Load
Optimum-AP
A
B
C
D
E
MODEL
Selection 64
32
32
64
64
of Optimum AP
64
64
64
128 128
Implementation
of SP64
64
128
128
Current-SP128
Modelling
the LAR
Estimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Number of nodes to use p = r  c
Dynamic
Adjustment
Situation
of
the
Platform
Load
of SP
Basic Libraries
Installation-File
n
A
B
C
D
E
NWS
Estimation
1024
42
42
22
22
Information21
of Static-SP
2048
42 42 22 22 21
Static-SP-File
Call to NWS
3072
42 42 22 22 21
18 July 2015
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
66
Parameter selection at running
time
LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
Selection
of Optimum AP
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Optimum-AP
Installation-File
Current-SP
Dynamic
Adjustment
of SP
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
67
Parameter selection at running
time
LAR
Execution
of LAR
Modelling
the LAR
D
E
S
I
G
N
MODEL
Selection
of Optimum AP
Implementation
of SPEstimators
SP-Estimators
I
N
S
T
A
L
L
A
T
I
O
N
Basic Libraries
18 July 2015
Optimum-AP
Installation-File
Current-SP
Dynamic
Adjustment
of SP
Estimation
of Static-SP
NWS
Information
Static-SP-File
Call to NWS
Universidad Politécnica de Valencia
R
U
N
T
I
M
E
68
Parameter selection at running
time
Static Model
Dynamic Model
n = 1024
n = 2048
100%
n = 3072
160%
100%
90%
90%
140%
80%
80%
120%
70%
70%
60%
100%
50%
80%
50%
60%
40%
60%
40%
30%
30%
40%
20%
20%
20%
10%
0%
10%
0%
A
B
C
D
E
0%
A
B
C
D
E
A
B
C
D
E
Situation of the platform load
18 July 2015
Universidad Politécnica de Valencia
69
Work distribution

There are different possibilities in heterogeneous
systems:


Heterogeneous algorithms (Gauss elimination).
Homogeneous algorithms and assignation of:


One process to each processor (LU factorization)
A variable number of processes to each processor, depending on
the relative speed
The general assignation problem is NP  use of heuristics
approximations
18 July 2015
Universidad Politécnica de Valencia
70
Work distribution

Dynamic Programming (the coins problem scheme)
Homogeneous algorithm +
distribution
Heterogeneous algorithm
1 2
.
.
.
.
.
.
.
j
.
.
.
.
.
1 2
1
1
2
2
...
...
i
i
…
…
n
n
P0
P1
P2
18 July 2015
...... PS
...
PK-1
PK
.
.
.
.
.
.
.
j
.
.
.
.
.
p0 p1 p2 p3
p4 p5
...
ps ... pr-1 pr
P0 P0 P1 P3
P3 P3
...
PS ... PK PK
Universidad Politécnica de Valencia
71
Work distribution

The model:
t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d))



Problem size:
 n number of types of coins
 C value to give
 v array of values of the coins
 q quantity of coins of each type
Algorithmic parameters:
 p number of processes
 b block size (here n/p)
 d processes to processors assignment
System parameters:
 tc cost of basic arithmetic operations
 ts start-up time
 tw word-sending time
18 July 2015
Universidad Politécnica de Valencia
72
Work distribution

Theoretical model:
The same as for the homogeneous case because the same
homogeneous algorithm is used
 2
Sequential cost: o C 
 2v 
 i
Computational parallel cost (qi large):
  j    C2 
C 1      o p 
 vi    vi 
j C  1 
C
p
Communication cost:


p( p  1)
C ( p  1) Process Pp

ts
tw
2
2
There is a new AP: d
SPs are now unidimensional (tc) or bidimensional (ts ,tw )
tables
18 July 2015
Universidad Politécnica de Valencia
73
Work distribution

Assignment tree (P types of processors and p processes):
P processors
1
2
2
3 ... P
2
3 ... P
3
... P
3 ... P
P
...
p processes
1
Some limit in the height of the tree (the number of processes)
is necessary
18 July 2015
Universidad Politécnica de Valencia
74
Work distribution
Assignment tree (P types of processors and p processes):

P =2 and p =3: 10 nodes
1
1
1
1
1
1
1
2
3
1
3
4
1
6
4 1
5 10 10 5 1
in general:
 P  1  P 
 P  p  1  P  p 

     ...  
  

p
 0  1

  p 
18 July 2015
Universidad Politécnica de Valencia
75
Work distribution
Assignment tree. SUNEt P=2 types of processors (five SUN1
+ one SUN5):
nodes:
2 processors
( p  2)( p  1)
U5
U1
2
U5 U1
U1
U5 U1 U1
U1
U1
U1
...
p processes

U1
U1
18 July 2015
one process to each processor
when more processes than
available processors are
assigned to a type of
processor, the costs of
operations (SPs) change
Universidad Politécnica de Valencia
76
Work distribution

Assignment tree. TORC, used P=4 types of processors:
one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1
one 1.2 Ghz AMD Athlon. Type 2
one 600 Mhz single Pentium III. Type 3
eight 550 Mhz dual Pentium III. Type 4
4 processors
1
1
2
...
p processes
not in the tree
18 July 2015
2
3
4
2
3
3
4
3
the values of SPs change
Universidad Politécnica de Valencia
4
4
4
two
consecutive
processes
are assigned
to a same
node
77
Work distribution

Use Branch and Bound or Backtracking (with nodes
elimination) to search through the tree:

Use the theoretical execution model to estimate the cost at each
node with the highest values of the SPs between those of the types
of processors considered, through multiplying the values by the
number of processes assigned to the processor of this type with
more charge:



 d i  

max
t c i 1,...,p / d i 0 np  t ci 


i
ts 
 
max
t 
i , j 1,...,p / d ,d  0  si , j 
i
j
18 July 2015
tw 


max


t
i , j 1,...,p / d ,d  0  wi , j 
i
j
Universidad Politécnica de Valencia
78
Work distribution

Use Branch and Bound or Backtracking (with nodes
elimination) to search through the tree:

Use the theoretical execution model to obtain a lower bound for each
node
For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4),
with relative speeds si, and array of assignations a=(2,2,3), the array of
possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum
p
achievable speed is
s   pa s
T
i 1
i
i
the minimum arithmetic cost is obtained from this speed, and the lowest
communication costs are obtained from those between processors in the
array of assignations
ts 
 
max
t 
i , j 1,...,p / a ,a  0  si , j 
i
j
18 July 2015
tw 


max


t
i , j 1,...,p / a ,a  0  wi , j 
i
j
Universidad Politécnica de Valencia
79
Work distribution

Theoretical model:
one step
t parallel  tarith,1  tcomm,1  tarith,2  tcomm,2  ...
Sequential
2


C
cost: o t

 c 2v 
i

Computational parallel cost (qi large):
Communication cost:
2
  j 


C



t c C 1      o t c p 
vi 
 vi  

j  C  1 
C
p
p( p  1)
C ( p  1)

ts
tw
2
2
Maximum values


The APs are p and the assignation array d
The SPs are the unidimensional array tc , and the
bidimensional arrays ts and tw
18 July 2015
Universidad Politécnica de Valencia
80
Work distribution

How to estimate arithmetic SPs:
Solving a small problem on each type of processors

How to estimate communication SPs:

Using a ping-pong between each pair of processors, and
processes in the same processor (CP1)


Does not reflect the characteristics of the system
Solving a small problem varying the number of
processors, and with linear interpolation (CP2)

Larger installation time
18 July 2015
Universidad Politécnica de Valencia
81
Work distribution

Three types of users are considered:

GU (greedy user):


CU (conservative user):


Uses all the available processors, with one process per processor.
Uses half of the available processors (the fastest), with one process per
processor.
EU (user expert in the problem, the system and heterogeneous
computing):

Uses a different number of processes and processors depending on the
granularity:



1 process in the fastest processor, for low granularity
The number of processes is half of the available processors, and in the
appropriate processors, for middle granularity
A number of processes equal to the number of processors, and in the
appropriate processors, for large granularity
18 July 2015
Universidad Politécnica de Valencia
82
Work distribution
Quotient between the execution time with the parameters
selected by each one of the selection methods and the
modelled users and the lowest execution time, in SUNEt:
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
CP1
CP2
EU
CU
GU
10
-1
0
10 000
-5
10 000
0
-1
00
10 00
-5
0
00
50 000
-1
0
50 000
-5
50 000
0
-1
00
50 00
0
-5
00
10 00
0- 0
1
10 000
0- 0
10 500
00
10 0
10 00
0- 00
50
00
0
av 0
er
ag
e

18 July 2015
Universidad Politécnica de Valencia
83
Work distribution

Parameters selection, in TORC, with CP2:
18 July 2015
C
gra
LT
CP2
50000
10
(1,2)
(1,2)
50000
50
(1,2)
(1,2,4,4)
50000
100
(1,2)
(1,2,4,4)
100000
10
(1,2)
(1,2)
100000
50
(1,2)
(1,2,4,4)
100000
100
(1,2)
(1,2,4,4)
500000
10
(1,2)
(1,2)
500000
50
(1,2)
(1,2,3,4)
500000
100
(1,2)
(1,2,3,4)
Universidad Politécnica de Valencia
84
Work distribution

Parameters selection, in TORC (without the 1.7 Ghz Pentium 4),
C
gra
LT
CP2
with CP2:
one 1.2 Ghz AMD Athlon. Type 1
one 600 Mhz single Pentium III. Type 2
eight 550 Mhz dual Pentium III. Type 3
18 July 2015
50000
10
(1,1,2)
(1,1,2,3,3,3,3,3,3)
50000
50
(1,1,2)
(1,1,2,3,3,3,3,3,3,3,3)
50000
100
(1,1,3,3)
(1,1,2,3,3,3,3,3,3,3,3)
100000
10
(1,1,2)
(1,1,2)
100000
50
(1,1,3)
(1,1,2,3,3,3,3,3,3,3,3)
100000
100
(1,1,3)
(1,1,2,3,3,3,3,3,3,3,3)
500000
10
(1,1,2)
(1,1,2)
500000
50
(1,1,2)
(1,1,2,3)
500000
100
(1,1,2)
(1,1,2)
Universidad Politécnica de Valencia
85
Work distribution

Quotient between the execution time with the parameters
selected by each one of the selection methods and the
modelled users and the lowest execution time, in TORC:
2,5
2
1,5
1
0,5
0
CP1
CP2
EU
to
ta
l
e
ag
av
er
00
00
GU
10
0-
50
00
00
0
10
010
0-
50
00
0
00
10
-5
00
00
50
50
-1
00
00
-5
0
00
0
18 July 2015
50
10
-5
00
00
0
00
00
-1
10
10
-5
00
00
CU
Universidad Politécnica de Valencia
86
Work distribution

Quotient between the execution time with the parameters selected by
each one of the selection methods and the modelled users and the
lowest execution time, in TORC (without the 1.7 Ghz Pentium 4):
2,5
2
1,5
1
0,5
0
CP1
CP2
EU
GU
10
to
ta
l
00
050
00
00
av
er
ag
e
00
0
010
00
10
10
050
00
00
50
-5
0
00
-1
0
50
-5
00
00
00
00
18 July 2015
50
10
-5
0
00
00
00
-1
0
10
10
-5
00
00
CU
Universidad Politécnica de Valencia
87
Outline










A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries’ hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Hybrid programming
Peer to peer computing
18 July 2015
Universidad Politécnica de Valencia
88
Hybrid programming
OpenMP
MPI
Fine-grain parallelism
Efficient in SMP
Sequential and parallel codes
are similar
Tools for development and
parallelisation
Allows run time scheduling
Memory allocation can reduce
performance
Coarse-grain parallelism
More portable
Parallel code very different
from sequential
Development and debugging
more complex
Static assigment of processes
Local memories, which
facilitates efficient use
18 July 2015
Universidad Politécnica de Valencia
89
Hybrid programming
Advantages of Hybrid Programming






To improve scalability
When too many tasks produce load imbalance
Applications with fine and coarse-grain parallelism
Redution of the code development time
When the number of MPI processes is fixed
In case of a mixture of functional and data
parallelism
18 July 2015
Universidad Politécnica de Valencia
90
Hybrid programming
Hybrid Programming in the literature



Most of the papers are about particular
applications
Some papers present hybrid models
No theoretical models of the execution time are
available
18 July 2015
Universidad Politécnica de Valencia
91
Hybrid programming
Systems





Networks of Dual Pentiums
HPC160 (each node four processors)
IBM SP
Blue Horizon (144 nodes, each 8 processors)
Earth Simulator (640x8 vector processors)
…
18 July 2015
Universidad Politécnica de Valencia
92
Hybrid programming
18 July 2015
Universidad Politécnica de Valencia
93
Hybrid programming
Models
MPI+OpenMP
OpenMP used for loops parallelisation
OpenMP+MPI
Unsafe threads
MPI and OpenMP processes in SPMD model
Reduces cost of communications
18 July 2015
Universidad Politécnica de Valencia
94
Hybrid programming
18 July 2015
Universidad Politécnica de Valencia
95
Hybrid programming
program main
!$OMP PARALLEL DO REDUCTION (+:sum)
PRIVATE (x)
include 'mpif.h'
do 20 i = myid+1, n, numprocs
double precision mypi, pi, h, sum, x, f, a
x = h * (dble(i) - 0.5d0)
integer n, myid, numprocs, i, ierr
sum = sum + f(x)
f(a) = 4.d0 / (1.d0 + a*a)
20 enddo
call MPI_INIT( ierr )
!$OMP END PARALLEL DO
call MPI_COMM_RANK( MPI_COMM_WORLD,
myid, ierr )
mypi = h * sum
call MPI_COMM_SIZE( MPI_COMM_WORLD,
numprocs, ierr )
call
MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISIO
N, &MPI_SUM,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(n,1,MPI_INTEGER,0, &
MPI_COMM_WORLD,ierr)
call MPI_FINALIZE(ierr)
h = 1.0d0/n
sum = 0.0d0
18 July 2015
stop
end
Universidad Politécnica de Valencia
96
Hybrid programming
It is not clear if with hybrid programming the execution time
would be lower
Lanucara, Rovida: Conjugate-Gradient
18 July 2015
Universidad Politécnica de Valencia
97
Hybrid programming
It is not clear if with
hybrid programming
the execution time
would be lower
Djomehri, Jin:
CFD Solver
18 July 2015
Universidad Politécnica de Valencia
98
Hybrid programming
It is not clear if with
hybrid programming
the execution time
would be lower
Viet, Yoshinaga, Abderazek,
Sowa: Linear system
18 July 2015
Universidad Politécnica de Valencia
99
Hybrid programming

Matrix-matrix multiplication:
N0
N0
N1
N1
N2
N2
p0
p1
p0
p1
p0
p1
N0
N0
N1
N1
N2
N2
N0
N0
N1
N1
N2
N2
p0
p1
p0
p1
p0
p1
N0
18 July 2015
N1
N2
p0
p1
p0
p1
p0
p1
MPI
SPMD MPI+OpenMP
decide which is preferable
MPI+OpenMP:
less memory
fewer communications
may have worse memory use
Universidad Politécnica de Valencia
100
Hybrid programming

In the time theoretical model more Algorithmic
Parameters appear:
8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1
p=rxs, 1x4, 2x2, 4x1
q=uxv, 1x2, 2x1
total 6 configurations
16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1
p=rxs, 1x4, 2x2, 4x1
q=uxv, 1x4, 2x2, 4x1
total 9 configurations
18 July 2015
Universidad Politécnica de Valencia
101
Hybrid programming

And more System Parameters:



The cost of communications is different inside and
outside a node (similar to the heterogeneous case with
more than one process per processor)
The cost of arithmetic operations can vary when the
number of threads in the node varies
Consequently, the algorithms must be recoded and
new models of the execution time must be
obtained
18 July 2015
Universidad Politécnica de Valencia
102
Hybrid programming
… and the formulas change:
P0 P1 P2 P3 P4 P5 P6
Node
Node
Node
Node
Node
Node
1
2
3
4
5
6
communications
18 July 2015
synchronizations
The formula changes,
for some systems
6x1 nodes and
1x6 threads could
be better, and for
others 1x6 nodes and
6x1 threads
Universidad Politécnica de Valencia
103
Hybrid programming

Open problem



Is it possible to generate automatically MPI+OpenMP
programs from MPI programs? Maybe for the SPMD
model.
Or at least for some type of programs, such as matricial
problems in meshes of processors?
And is it possible to obtain the execution time of the
MPI+OpenMP program from that of the MPI program
and some description of how the time model has been
obtained?
18 July 2015
Universidad Politécnica de Valencia
104
Outline










A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries’ hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Hybrid programming
Peer to peer computing
18 July 2015
Universidad Politécnica de Valencia
105
Peer to peer computing

Distributed systems:


They are inherently heterogeneous and dynamic
But there are other problems:



Higher communication cost
Special middleware is necessary
The typical paradigms are master/slave,
client/server, where different types of processors
(users) are considered.
18 July 2015
Universidad Politécnica de Valencia
106
Peer to peer computing
Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan
Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins,
Zhichen Xu. HP Laboratories Palo Alto. 2002
18 July 2015
Universidad Politécnica de Valencia
107
Peer to peer computing

Peer to peer:




All the processors (users) are at the same level (at least
initially)
The community selects, in a democratic and continuous
way, the topology of the global network
Would it be interesting to have a P2P system for
computing?
Is some system of this type available?
18 July 2015
Universidad Politécnica de Valencia
108
Peer to peer computing

Would it be interesting to have a P2P system for
computing?



I think it would be interesting to develop a system of
this type
And to leave the community to decide, in a democratic
and continuous way, if it is worthwhile
Is some system of this type available?

I think there is no pure P2P dedicated to computation
18 July 2015
Universidad Politécnica de Valencia
109
Peer to peer computing

… and other people seem to think the same:


Lichun Ji (2003): “… P2P networks seem to outperform other
approaches largely due to the anonymity of the participants in
the peer-network, low network costs and the inexpensive diskspace. Trying to apply P2P principles in the area of distributed
computation was significantly less successful”
Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria
(2004): “… current approaches to utilizing desktop resources
require either centralized servers or extensive knowledge of
the underlying system, limiting their scalability”
18 July 2015
Universidad Politécnica de Valencia
110
Peer to peer computing

There are a lot of tools for Grid Computing:
Globus (of course), but does Globus provide
computational P2P capacity or is it a tool with which
P2P computational systems can be developed?
 Netsolve/Gridsolve. Uses a client/server structure.
 PlanetLab (at present
387 nodes and 162 sites).
In each site one Principal
Researcher and one System
Administrator.

18 July 2015
Universidad Politécnica de Valencia
111
Peer to peer computing

For Computation on P2P the shared
resources are:


Information: books, papers, …, in a typical way.
Libraries. One peer takes a library from another peer.


Necessary description of the library and the system to
know if the library fulfils our requests.
Computation. One peer colaborates to solve a
problem proposed by another peer.

This is the central idea of Computation on P2P…
18 July 2015
Universidad Politécnica de Valencia
112
Peer to peer computing

Two peers collaborate in the solution of a
computational problem using the hierarchy of
parallel linear algebra libraries
Peer 1
Peer 2
PLAPACK
Mac. LAPACK
Ref. LAPACK
BLAS
ATLAS
Reference
MPI
18 July 2015
Universidad Politécnica de Valencia
ScaLAPACK
PBLAS
BLACS
Machine MPI
113
Peer to peer computing

There are


Different global hierarchies
Different libraries
Peer 1
Peer 2
PLAPACK
Mac. LAPACK
Ref. LAPACK
BLAS
ATLAS
Reference
MPI
18 July 2015
Universidad Politécnica de Valencia
ScaLAPACK
PBLAS
BLACS
Machine MPI
114
Peer to peer computing

And the installation information varies, which
makes the efficient use of the theoretical model
more difficult than in the heterogeneous case
Peer 1
Peer 2
PLAPACK
Inst. Inform.
ScaLAPACK
Inst. Inform.
Mac. LAPACK
Ref. LAPACK
PBLAS
Inst. Inform.
Inst. Inform.
Inst. Inform.
BLAS
ATLAS
BLACS
Inst. Inform.
Inst. Inform.
Inst. Inform.
Reference
MPI
18 July 2015
Inst. Inform.
Universidad Politécnica de Valencia
Machine MPI
Inst. Inform.
115
Peer to peer computing

Trust problems appear:




Does the library solve the problems we require to be
solved?
Is the library optimized for the system it claims to be
optimized for?
Is the installation information correct?
Is the system stable?
There are trust-algorithms for P2P systems; are
they (or some modification) applicable to these
trust problems?
18 July 2015
Universidad Politécnica de Valencia
116
Peer to peer computing

Each peer would have the possibility of establishing a
policy of use:



The use of the resources could be payable
The percentage of CPU dedicated to computations for the
community
The type of problems it is interested in
And the MAIN PROBLEM: is it interesting to
develop a P2P system for the management and
optimization of computational codes?
18 July 2015
Universidad Politécnica de Valencia
117

Designing Polylibreries to Speed Up Linear Algebra

Transcript Designing Polylibreries to Speed Up Linear Algebra

Directory