The Riverside Optimizing Compiler for Configurable

Download Report

Transcript The Riverside Optimizing Compiler for Configurable

COMPUTER
SCIENCE &ENGINEERING
Compiled code acceleration on
FPGAs
W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra
Computer Science & Engineering
University of California Riverside
Why?
Are FPGA: A New HPC Platform?
Comparison of
(dp) Gflop/s
 a dual core Opteron (2.5 GHz) to
Virtex 4 & 5 FPGA on dp fp
Opt
V-4
V-5
 Balanced allocation of adders,
multipliers and registers
MAc
10
15.9
28.0
Mult
5
12.0
19.9
 Use both DSP and logic for
multipliers, run at lower speed
Add
5
23.9
55.3
 Logic & wires for I/O interfaces
David Strensky, FPGAs Floating-Point
Performance -- a pencil and paper evaluation, in
HPCwire.com
28 September 2007
Future of Computing - W. Najjar
Watts
Opt V-4
V-5
95
~35
25
2
ROCCC
Riverside Optimizing Compiler for Configurable
Computing
Code acceleration
 By mapping of circuits to FPGA
 Achieve same speed as hand-written VHDL codes
Improved productivity
 Allows design and algorithm space exploration
Keeps the user fully in control
 We automate only what is very well understood
28 September 2007
Future of Computing - W. Najjar
3
Challenges
FPGA is an amorphous mass of logic
 Structure provided by the code being accelerated
 Repeatedly applied to a large data set: streams
Languages reflect the von Neumann execution model:
 Highly structured and sequential (control driven)
 Vast randomly accessible uniform memory
28 September 2007
CPUs (& GPUs)
FPGAs
Temporal computing
Spatial computing
Sequential
Parallel
Centralized storage
Distributed storage
Control flow driven
Data flow driven
Future of Computing - W. Najjar
4
ROCCC Overview
Procedure, loop
and array
optimizations
Instruction scheduling
Pipelining and storage
optimizations
C/C++
Java
High level
Hi-CIRRF
transformations
Low level
transformations
Lo-CIRRF
Code
generation
VHDL
FPGA
SystemC
CIRRF
Compiler Intermediate
Representation for
Reconfigurable Fabrics
DSP
CPU
Binary
Limitations on the code:
•No recursion
•No pointers
28 September 2007
Future of Computing - W. Najjar
GPU
Custom
unit
5
A Decoupled Execution Model
 Decoupled memory access
from datapath
 Parallel loop iterations
 Pipelined datapath
 Smart buffer (input) does
data reuse
 Memory fetch and store
units, data path configured
by compiler
 Off chip accesses platform
specific
28 September 2007
Input memory
(on or off chip)
Mem Fetch
Unit
Input Buffer
Multiple loop bodies
Unrolled and pipelined
Output memory
(on or off chip)
Future of Computing - W. Najjar
Output Buffer
Mem Store
Unit
6
So far, working compiler with …
Extensive optimizations and transformations
 Traditional and FPGA specific
 Systolic array, pipelined unrolling, look-up tables
Compile + hardware support for data reuse
 > 98% reduction in memory fetches on image codes
Efficient code generation and pipelining
 Within 10% of hand-optimized HDL codes
Import of existing IP cores
 Leverages huge wealth, integrated with C source code
Support for dynamic partial reconfiguration
28 September 2007
Future of Computing - W. Najjar
7
Example: 3-tap FIR
Indices of A[]
#define N 516
void begin_hw();
void end_hw();
int main()
coefficients
{
int i;
const int T[5] = {3,5,7};
int A[N], B[N];
begin_hw();
L1: for (i=0; i<=(N-3); i=i+1)
{
B[i] = T[0]*A[i] +
T[1]*A[i+1] + T[2]*A[i+2];
}
end_hw(); }
28 September 2007
Future of Computing - W. Najjar
8
RC Platform Models
Memory interface
FPGA
CPU
1
Memory interface
CPU
2
FPGA
CPU
3
Fast Network
CPU Memory
28 September 2007
FPGA
CPU Memory
Future of Computing - W. Najjar
FPGA
9
What we have learned so far
Big speedups are possible
 10x to 1,000x on application codes, over Xeon and
Itanium, molecular dynamics, bio-informatics, etc.
 Works best with streaming data
New paradigms and tools
 For spatio-temporal concurrency
 Algorithms, languages, compilers, run-time systems
etc
28 September 2007
Future of Computing - W. Najjar
10
Future? Very wide use of FPGAs
Why?
 High throughput (> 10x) AND low power (< 25%)
How?
 Mostly in Models 2 and 3, initially
 Model2: See Intel QuickAssist, Xtremedata & DRC
 Model 3: SGI, SRC & Cray
Contingency
 Market brings price of FPGAs down
 Availability of some software stack
 for savvy programmers, initially
Potential
 Multiple “killer apps” (to be discovered)
28 September 2007
Future of Computing - W. Najjar
11
Conclusion
We as a research community should be ready
Stamatis was
Thank you
28 September 2007
Future of Computing - W. Najjar
12