Matrix multiplication implemented in data flow technology

Download Report

Transcript Matrix multiplication implemented in data flow technology

Matrix multiplication
implemented in data flow
technology
Aleksandar Milinković
Belgrade University, School of Electrical Engineering
[email protected]
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Introduction
Problem with big data
Need to change computing paradigm
Data flow instead of control flow
Achieved by construction of graph
Graph nodes (vertices) perform computations
Each node is one deep pipeline
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Dataflow computation
Dependencies are resolved at compile time
No new dependencies are made
The whole mechanism is in deep pipeline
Pipeline levels perform parallel computations
Data flow produces one result per cycle
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Matrix multiplication
Data flow doesn’t suit all situations
However, it is applicable in lot of cases:
Partial differential equations
3D finite differences
Finite elements method
Problems in bioinformatics, etc.
Most of them contain matrix multiplications
Goal: realization on FPGA, using data flow
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Project realizations
Two solutions:
Maximal utilization of on-chip matrix part
• Matrices with small dimensions
• Matrices with large dimensions
Multiplication using parallel pipelines
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Pipe 1
Pipe 0
Good chip utilization A
X00
X01
X02 ... X0(n-2)
X0(n-1)
y00
y01
y02 ... y0(n-2)
y0(n-1)
X10
X11
X12 ... X1(n-2)
...
X1(n-1)
y10
y11
y12 ... y1(n-2)
...
y1(n-1)
X(n-2)0 X(n-2)1 X(n-2)2 ... X(n-2)(n-2) X(n-2)(n-1)
y(n-2)0 y(n-2)1 y(n-2)2 ... y(n-2)(n-2) y(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 ... X(n-1)(n-2) X(n-1)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 ... y(n-1)(n-2) y(n-1)(n-1)
FMem capacity
Set of columns on the chip until they are fully used
Every pipe calculates 48 sums at the time
Equivalent to 2 processors with 48 cores
Additional parallelization possible
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Good chip utilization A
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Good chip utilization A
Chip utilization and acceleration
LUTs:
195345/297600 (65,64%)
FFs:
290689/595200 (48.83%)
BRAMs: 778/1064 (73.12%)
DSPs:
996/2016 (49,40%)
Matrix: 2304 x 2304
Intel: 42.5 s
MAX3: 2.38 s
Acceleration at kernel clock 75 MHz: ≈18 x
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Good chip utilization B
Pipe 1
Pipe 0
FMem capacity
X00
X01
X02 ... X0(n-2)
X0(n-1)
y00
y01
y02 ... y0(n-2)
y0(n-1)
X10
X11
X12 ... X1(n-2)
...
X1(n-1)
y10
y11
y12 ... y1(n-2)
...
y1(n-1)
X(n-2)0 X(n-2)1 X(n-2)2 ... X(n-2)(n-2) X(n-2)(n-1)
y(n-2)0 y(n-2)1 y(n-2)2 ... y(n-2)(n-2) y(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 ... X(n-1)(n-2) X(n-1)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 ... y(n-1)(n-2) y(n-1)(n-1)
Part of matrix Y is on chip during computation
Each pipe calculates 48 sums at the time
Equivalent to 2 processors with 48 cores
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Good chip utilization B
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Good chip utilization B
Chip utilization and acceleration
LUTs:
201237/297600 (67,62%)
FFs:
302742/595200 (50.86%)
BRAMs: 782/1064 (73.50%)
DSPs:
1021/2016 (50,64%)
Matrix: 2304 x 2304
Intel: 42.5 s
MAX3: 2.38 s
Matrix: 4608 x 4608
Intel: 1034 s
MAX3: 58.41 s
Acceleration at kernel clock 75 MHz: ≈ 18x
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Multiple parallel pipelines
Pipe 0
Pipe 1
Pipe 2
Pipe 46
Pipe 47
X00
X01
X02 ... X0(n-2)
X0(n-1)
y00
y01
y02 ... y0(n-2)
y0(n-1)
X10
X11
X12 ... X1(n-2)
...
X1(n-1)
y10
y11
y12 ... y1(n-2)
...
y1(n-1)
X(n-2)0 X(n-2)1 X(n-2)2 ... X(n-2)(n-2) X(n-2)(n-1)
y(n-2)0 y(n-2)1 y(n-2)2 ... y(n-2)(n-2) y(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 ... X(n-1)(n-2) X(n-1)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 ... y(n-1)(n-2) y(n-1)(n-1)
Matrices are exclusively in a big memory
Each pipe calculates one sum at the time
Equivalent to 48 processors with one core
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Multiple parallel pipelines
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Multiple parallel pipelines
Chip utilization and acceleration
LUTs:
166328/297600 (55,89%)
FFs:
248047/595200 (41.67%)
BRAMs: 430/1064 (40.41%)
DSPs:
489/2016 (24,26%)
Matrix: 2304 x 2304
Intel: 42.5 s
MAX3: 4,08 s
Matrix: 4608 x 4608
Intel: 1034 s
MAX3: 98,48 s
Acceleration at kernel clock 150 MHz: > 10x
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Comparison of solutions
First solution:
Good chip utilization
Shorter execution time
Drawback: matrices up to 8GB
Second solution: matrices up to 12GB
Drawback: longer execution time
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Conclusions
Matrix multiplication is operation with complexity O(n3)
Part of complexity moved from time to space
That produces acceleration (shorter execution time)
Achieved by application of data flow technology
Developed using tool chain from Maxeler Technologies
Calculations order of magnitude faster than Intel Xeon
Belgrade University, School of Electrical Engineering
Aleksandar Milinković
Matrix multiplication
implemented in data flow
technology
Aleksandar Milinković
Belgrade University, School of Electrical Engineering
[email protected]
Belgrade University, School of Electrical Engineering
Aleksandar Milinković