Transcript Document

Implementing Algorithms in FPGA-Based
Reconfigurable Computers Using C-Based
Synthesis
Doug Johnson, Technical Marketing Manager
NCSA/OSC Reconfigurable Systems Summer Institute
Urbana, Illinois, July 11-13 2005
Celoxica
UK-Based System design company




2
Provider of design tools, IP & services for Digital Imaging & Signal Processing

Image Processing

Video Processing

Sonar/ Radar signal processing

Biometrics

Massively parallel data mining and matching
Complete solutions for Electronic Level System (ESL) Design

System/ algorithm acceleration

Co-design partitioning

Co-simulation & co-verification (C/ C++/ SystemC/ Handel-C/ Matlab/ VHDL/ Verilog)

Hardware compilation & C synthesis to reconfigurable architectures
Consulting and professional services

Systems analysis and design strategy

System implementation capability
NCSA/OSC Reconfigurable Systems Summer Institute
Presentation Objectives

Prerequisites



Objectives





3
Motivations for using FPGAs in RC and HPC
HPC and RC FPGA systems hardware and infrastructure
HPC algorithms and Considerations for Reconfigurable Computing (RC)
Share a perspective on the State-of-the-Art for C-based HW design
Describe the C to FPGA Flow
Illustrate with code examples …
Look forward to some critical debate…
NCSA/OSC Reconfigurable Systems Summer Institute
Agenda

Reconfigurable Computing


Considerations, core algorithm relationships, commercial applications
C-based design





4
The solution space (its place in EDA)
Nature of C for HW design
The Design Flow
Summary
JPEG2000 Design Example
NCSA/OSC Reconfigurable Systems Summer Institute
Agenda

Reconfigurable Computing (RC)


Considerations, core algorithm relationships, commercial applications
C-based design




The solution space (its place in EDA)
Nature of C for HW design
The Design Flow
Summary
“RC = Using FPGAs for (algorithmic) computation”
1. Embedded: Well established – body of knowledge/experience
2. Enterprise: Some
3. HPC: Starting Out
5
NCSA/OSC Reconfigurable Systems Summer Institute
Reconfigurable Computing
Commercial C-to-FPGA tools
FPGAs
Closely Coupled Systems
Partitioning Frameworks
Intimately Coupled Systems
Advanced Compilers
First RC Successes
1980


20X0?
Algorithm Acceleration
 Exploit parallelism to increase performance with custom HW implementation
Algorithm Offload
 Free CPU resource by offloading bottleneck processes
BIG Challenges




6
2000
Promised Opportunities


1990
Development complexity
 Design framework and methods, deployment and integration/middleware
Coupling to coprocessor/data bandwidth
Price/Performance/Power!
Choosing the right applications!
NCSA/OSC Reconfigurable Systems Summer Institute
FPGA Computing and Methodology

High Performance Embedded and Reconfigurable Computing



C-based design for FPGAs


7
Why FPGA Computing?
 Moore’s Law showing signs of strain
 Ability to parallelize in HW
 Price/GOPS coming down rapidly
 Hard IP blocks – excellent density
Example: Floating Point Performance
 Maximum for Virtex-4 – 50 GFLOPS (Courtesy of Dave Bennett, Xilinx Labs)
 Maximum for Virtex-2 – 17.5 GFLOPS “
“
“
“
“
“
 “Can fit 10’s of FPUs on 2 Xilinx Virtex-4’s” (Courtesy of Justin Tripp, LANL)
 Use of hard macros for functions is mandatory (example DSP48 on Virtex-4)
Several offerings on commercial marketplace or in research
 Commercial – Celoxica, Mentor Graphics, Impulse Technologies, Mitrion…
 Research – Sandia, UC Riverside, LANL
RTL/HDL is the most widely used way to get to FPGAs but is not usable by SW
engineers
NCSA/OSC Reconfigurable Systems Summer Institute
2005
Conventional Wisdom for RC

1. Small data objects


2. Modest arithmetic







Fewer Issues with Latency in HPC
Streaming Applications – most successful
5. Simple Control

8
Essential
Parallelism essential - FPGA clocks order of magnitude slower than CPUs
Fine grain
- wide data widths
Medium grain - operation/function routine
Course grain - multiple instantiations of application processes
4. Pipeline-ability


C-based design
Difficult to design and implement complex algorithms in HW
Integer/fixed precision calculations
Floating point too resource expensive
High Density Devices
3. Data-parallelism


Closely coupled systems
Data transfer overhead to coprocessor, High operation to byte ratio
Soft Cores/C-based design
Difficult to design complex scheduling schemes in Parallel HW
NCSA/OSC Reconfigurable Systems Summer Institute
Further Considerations

6. Exploiting “Soft” programmable HW


9
Configurable Applications
 Schedule and load HW content prior to HW execution
Reconfigurable Applications
Few Compelling Examples in HPC
 Dynamically change HW content during HW execution
NCSA/OSC Reconfigurable Systems Summer Institute
Commercial RC Applications
…using C-based design

Well established in embedded systems:

Digital Video Technology and Image Processing




“PROCESSING AT THE SENSOR” versus local and/or remote processing
3D LCD display development and test
Real-time verification of HDTV image processing algorithms
Robust image matching - product tracking and production line control
Defense & Security

Digital Signal Processing




Communications and Networking




Consumer

Automotive & Industrial
Internet reconfigurable multimedia terminal, MP3, VoIP etc.
Ground traffic simulation testbed for broadband satellite network communications
Satellite based Internet data tracking system
Rapid Systems Prototyping

10
Engine control unit for 3-phase motors
Radar and sonar beamforming and spatial filtering
Computer aided tomography security system
Automotive safety system incorporating sensor fusion
Robotic vision system for object detection and robot guidance
NCSA/OSC Reconfigurable Systems Summer Institute
Commercial RC Applications
…using C-based design

Enterprise Computing


High Performance Computing





11
Content processing solutions
 XML parsing, virus checking
 Packet/Pattern Matching/Filtering
 Compression/decompression
 Security/Encryption – DES/3-DES, SHA, MD5, AES/Rijndael
Image processing
 CT scan analysis, 3D modeling, Ray Tracing
Finite element analysis and simulation
Custom Vector Engines
Genome calculations
Seismic data processing
NCSA/OSC Reconfigurable Systems Summer Institute
Core Algorithm Relationships in HPC
Rational
Nanotechnology Drug Design
Tomographic
Fracture
Mechanics
Diffraction
Inversion
Problems
Atomic
Scattering
Condensed Matter
Electronic Structure
Astrophysics
Military
Logistics
Transportation
Systems
Data
Assimilation
Electronic
Structure
Actinide
Chemistry
Cosmology
Population
Genetics
Economics
Air Traffic
Control
VLSI
Design
Pipeline Flows
Flow in
Porous Media
Chemical
Reactors
Plasma
Processing
Transport
CFD
Basic
Algorithms
&
Numerical
Methods
Discrete
Events
Monte
Carlo
Pattern
Matching
Computer
Vision
Multimedia
Collaboration
Tools
Radiation
Graph
Theoretic
n-body
Genome
Processing
Virtual
Reality
Computational
Steering
Scientific
Visualization
Signal
Processing
Raster
Graphics
Neutron
Transport
Virtual
Prototypes
Electrical Grids
Fourier
Methods
Nuclear Structure
QCD
Distribution Networks
Reservoir
Modelling
Biosphere/Geosphere
Cloud Physics
Combustion
Quantum
Chemistry
Manufacturing
Systems
Neural Networks
MRI Imaging
Molecular
Modeling
Chemical
Dynamics
PDE
Boilers
Chemical
Reactors
CVD
Multiphase Flow
Weather and Climate
Seismic
Processing
Multibody
Dynamics
Fields
Geophysical
Fluids
Ecosystems
Economics
Models
Symbolic
Processing
Cryptography
Electromagnetics
Aerodynamics
Orbital
Mechanics
Astrophysics
Intelligent
Search
Databases
Intelligent
Agents
Reaction-Diffusion
Structural Mechanics
ODE
Computer
Algebra
Data Mining
CAD
12
Phylogenetic Trees
Biomolecular
Reconstruction
Dynamics
Crystallography
Automated
Deduction
NCSA/OSC Reconfigurable Systems Summer Institute
Magnet Design
Number Theory
Source: Rick Stevens - ANL
Core Algorithm Relationships in HPC
Rational
Nanotechnology Drug Design
Tomographic
Fracture
Mechanics
Diffraction
Inversion
Problems
Atomic
Scattering
Condensed Matter
Electronic Structure
Astrophysics
Military
Logistics
Transportation
Systems
Data
Assimilation
Electronic
Structure
Actinide
Chemistry
Cosmology
Population
Genetics
Economics
Discrete
Events
Monte
Carlo
VLSI
Design
Raster
Graphics
Neutron
Transport
Pipeline Flows
Flow in
Porous Media
Chemical
Reactors
Plasma
Processing
CFD
Basic
Algorithms
&
Numerical
Methods
Pattern
Matching
Computer
Vision
Multimedia
Collaboration
Tools
Radiation
Graph
Theoretic
Transport
Genome
Processing
Virtual
Reality
Computational
Steering
Scientific
Visualization
Signal
Processing
n-body
Air Traffic
Control
Virtual
Prototypes
Electrical Grids
Fourier
Methods
Nuclear Structure
QCD
Distribution Networks
Reservoir
Modelling
Biosphere/Geosphere
Cloud Physics
Combustion
Quantum
Chemistry
Manufacturing
Systems
Neural Networks
MRI Imaging
Molecular
Modeling
Chemical
Dynamics
PDE
NCSA/OSC Reconfigurable
Reaction-Diffusion
Boilers
Chemical
Reactors
CVD
Multiphase Flow
Weather and Climate
Structural Mechanics
Seismic
Processing
ODE
Multibody
Dynamics
Fields
Geophysical
Fluids
Ecosystems
Economics
Models
Symbolic
Processing
Cryptography
Electromagnetics
Aerodynamics
Orbital
Mechanics
Astrophysics
Intelligent
Search
Databases
Data Mining
CAD
13
Phylogenetic Trees
Biomolecular
Reconstruction
Dynamics
Crystallography
Automated
Deduction
Intelligent
Agents
Systems Summer Institute
Computer
Algebra
Magnet
How
do Design
we map out
the right Apps?
Number Theory
Source: Rick Stevens - ANL
Exploiting FPGA in HPC

Hardware:
How do we select and benchmark?



“Enterprise Quality” co-processor system products (Cray XD1, SGI RASC)
Robust PCI/PCIx/VME-based FPGA card solutions for development
A software design methodology is essential:

SW dominated application sector



Complete designs can be specified in a C environment


Porting to HW implementations simplified
Platform abstractions through API’s and Libraries

14
Target developers have a SW background
Register Transfer Level (RTL), Hardware Description Languages (HDL) are foreign
Simplified Specification, Development, Deployment
NCSA/OSC Reconfigurable Systems Summer Institute
Agenda

Reconfigurable Computing


Considerations, core algorithm relationships, commercial applications
C-based design





15
The solution space (its place in EDA – Electronic Design Automation)
Nature of C for HW design
The Design Flow
Summary
JPEG2000 Design Example
NCSA/OSC Reconfigurable Systems Summer Institute
Embedded Hardware (HW) Design
Specification
Function
Algorithm
Design
Block
Block
Design
Design
Fixed
FixedPoint
Point extraction
extraction
DSP
DSP IP
IP
TLM
API’s/Libraries
Frameworks
Implementation
Implementation IP
IPModels
Models
Architecture
Fast
Mixed
Mixed
Simulation
Simulation
Architecture
Exploration
Design Analysis
HW
HWAccelerated
Accelerated Simulation
Simulation
Custom
Custom Processors
Processors
C-Based
HLL Synthesis
Synthesis
Interface
Interface Synthesis
Synthesis
Implementation
Reconfigurable
FPGA/SoPC
Prototypes
Implementation
Implementation IP
IP
Emulation
Emulation Platforms
Platforms
RTL
RTL Verification
Verification
RTL
RTL
C to FPGA/SoPC
16
Physical Design
NCSA/OSC Reconfigurable Systems Summer Institute
C to FPGA Accelerated System
Function & Architecture
AL
C/C++
CA
C for HW
Specification Model
Design
Algorithm
Design
Testbench
Software
Model
System Model
Partitioning
API’s/Libraries
HW
Mixed Simulation
COMMS
SW
Architecture
Exploration
Design Analysis
Optimization
C-Based Synthesis
BSP
BSP
RTL
EDIF
OBJ
Synthesis
P&R
Implementation
FPGA
17
NCSA/OSC Reconfigurable Systems Summer Institute
Processor
Challenges for C-based synthesis

Concurrency (Parallelism)



Timing




Annotations, additional or C++
Communication

18
Constraints
Explicit
Rules-based
Data Types


Compiler-determined (behavioral synthesis)
Explicit
Additional or C-like
NCSA/OSC Reconfigurable Systems Summer Institute
Two Approaches to C-based Design
C Algorithm to FPGA
SoC (System-on-a-Chip)
Prototyping/Verification
SystemC
Core Libraries
SCV, TLM, Master/Slave …
Handel-C
Core Libraries
TLM (PAL/DSM), Fixed/Floating point …
Standard Channels for Various MOC
Kahn Process Networks, Static Dataflow…
Primitive Channels
Signal, Timer, Mutex, Semaphore, FIFO, etc
Core Language
Data Types
Core Language
Data Types
par{…}, seq{…},
Interfaces, Channels,
Bit Manipulation,
RAM & ROM
Single cycle assignment
Bits and bit-vectors
Arbitrary width integers
Signals
Modules, Ports,
Processes, Events,
Interfaces, Channels
Event Driven Sim Kernel
4-valued logic/vectors
Bits and bit-vectors
Arbitrary width integers
Fixed-point
C++ user-defined types
ANSI/ISO C Language Standard
ANSI/ISO C++ Language Standard
19
NCSA/OSC Reconfigurable Systems Summer Institute
Agenda

Reconfigurable Computing


Considerations, core algorithm relationships, commercial applications
C-based design





20
The solution space (its place in EDA)
Nature of C for HW design
The Design Flow
Summary
JPEG2000 Design Example
NCSA/OSC Reconfigurable Systems Summer Institute
System Design Refinement
Function
• System Function
• Course grain parallelism
A
C
•
•
•
•
Parallel algorithm design
Fine-grain parallism
Bit/cycle true processes
Algorithm Testbench
A
C
Architecture
• Add interfaces
• Signal/cycle accurate test
A
C
B
D
B
D
B
D
par{ processA(…);
processB(…);
processC(…);
processD(…); }
void processD(…){
unsigned 9 a,b,c;
par{ a=1; b=2; }
c=3;
};
void main(){
interface port_in…
interface port_out…
…
}
EDIF/RTL
21
NCSA/OSC Reconfigurable Systems Summer Institute
AL
C/C++
CP
Handel-C
CA
Handel-C
CA
Handel-C
Systems Integration
Implementation
•
•
•
•
Complete system design
Interface to pins
Multi-Clock domain
IP Integration
A
C
CLK
RST
A
B
D
B
EDIF (Electronic Design Interface Format)
RTL from HDL IP
Data
C
D
set clock = external “CLK”;
set reset = external “RST”;
interface Data(…)…
void main() {
par{ processA(…);
{ interface processB(…)…};
processB(…);
processC(…);
processD(…); } { interface processD(…)…};
}
EDIF/RTL
22
NCSA/OSC Reconfigurable Systems Summer Institute
Parallel Debug in C environment
Algorithm
Design
23
NCSA/OSC Reconfigurable Systems Summer Institute
Resource Usage/Speed Estimations
Architecture
Exploration
24
NCSA/OSC Reconfigurable Systems Summer Institute
FPGA Support
Technology mapping
Optimizations
25
NCSA/OSC Reconfigurable Systems Summer Institute
Handel-C Template Multiplier
set clock = external "clk";
void main()
{
…
while(1) par
{
…
process();
}
}
void process()
{
unsigned W A, B, C;
while(1) par
{
…
Multiply(A, B, &C);
…
}
void Multiply(unsigned W A,
unsigned W B, unsigned W *C)
{
static unsigned W a[W], b[W], c[W];
par{
a[0] = A;
b[0] = B;
c[0] = a[0][0] == 0 ? 0 : b[0];
par (i = 1; i < W; i++)
{
a[i] = a[i-1] >> 1;
b[i] = b[i-1] << 1;
c[i] = c[i-1] + (a[i][0] == 0 ? 0 :
b[i]);
}
*C = c[W-1];
}
}
}
Pipelined
26
NCSA/OSC Reconfigurable Systems Summer Institute
Agenda

Reconfigurable Computing


Considerations, core algorithm relationships, commercial applications
C-based design





27
The solution space (its place in EDA)
Nature of C for HW design
The Design Flow
Summary
JPEG2000 Design Example
NCSA/OSC Reconfigurable Systems Summer Institute
Summary

Commercial C-based design is a reality
For the HPC and RC communities it offers:

Fastest route to accelerating SW designs in FPGA





Deterministic and quality results


State of the art tools used by embedded systems designers
RC platforms for rapid prototyping

28
Lower barrier to adoption than RTL technologies
Greater customization and productivity than block based approaches
Complete integration with RTL/block based approaches for “Power
users”
Simple migration, development to deployment with full library support
NCSA/OSC Reconfigurable Systems Summer Institute
Design Example
JPEG2000 Image Compression Algorithm
Example Design
JPEG 2000 Compressor
Original
Image
Pre processing
Five Steps to HW Platform:


RGB to YUV
conversion


Quantization
Tier-1 Encoder
Tier-2 Encoder

Direct Synthesis C to EDIF
5. HW Platform

30
Optimization
4. Implementation Model

Coded Image
System Estimations
3. Architecture and Communication Model


Algorithm Profiling
2. Functional System Model

DWT
Rate
Control
1. Specification Model
Board level integration
NCSA/OSC Reconfigurable Systems Summer Institute
1. Specification Model
Function & Architecture
22 *.c and *.h files
C/C++
AL
Specification Model
1468 lines of code
Original
Image
DWT
Algorithm Profiling
- Memory
- Processing Time
- Data Flow
Quantization
Tier-1 Encoder
Coded Image
Tier-2 Encoder
DWT/Tier1 are the compute intensive blocks
31
Testbench
Pre processing
RGB to YUV
conversion
Rate
Control
Design
Software
Model
NCSA/OSC Reconfigurable Systems Summer Institute
Memory Usage (x86) MB
6
5
4
3
2
1
0
Curr
ent
Sum
2. Functional System Model
Function & Architecture
AL
C/C++
CA
Handel-C
Original
Image
Specification Model
Design
Pre processing
Testbench
Software
Model
System Model
Partitioning
RGB to YUV
conversion
HW
SW
DWT
Rate
Control
quantization
/*Handel-C*/
extern “C” sw_block(…);
Tier-1 Encoder
Coded Image
Cycles/speed/area…
32
Tier-2 Encoder
void main(void){
while(1) par{
sw_block(…);
hw_block(…);
} }
void hw_block(…)
{ … }
NCSA/OSC Reconfigurable Systems Summer Institute
/* C */
void sw_block(…)
{
…
}
3. Architecture and Communication Model
Function & Architecture
AL
C/C++
CA
Handel-C
Original
Image
Pre processing
RGB to YUV
conversion
DWT
Rate
Control
quantization
FIFO
FIFO
Tier-1 Encoder
DsmPortH2S
Coded Image
Tier-2 Encoder
DsmRead(…)
DsmWrite(…)
DsmFlush(…)
Dataflow/Cycles/speed/area…
33
NCSA/OSC Reconfigurable Systems Summer Institute
4. Implementation Model
A
C
B
D
EDIF
Device Family
Implementation
RTL
34
EDIF
NCSA/OSC Reconfigurable Systems Summer Institute
void main(){
interface port_in…
interface port_out…
…
}
Estimations from Synthesis
DWT ~ 6% VII1000
35
NCSA/OSC Reconfigurable Systems Summer Institute
5. Hardware Platform
From P&R Report for VII1000-4
A
B
uP
HW
uP
DWT
HW
C
D
uP
HW
uP
RAM
HW
RAM
Board Level Integration
Specific I/O Implementations
Pin Location constraints
Slices: 758
Device utilization : 7%
Speed (MHz): 151
Lines of code: 395
Implementation Model Estimations
DWT ~6%
Implementation
• Microblaze + Xilinx FPGA
• Nios + Altera FPGA
• Xilinx V2Pro
• Toshiba MeP + FPGA
• PowerPC + PLB + FPGA
• PC + FPGA PCI Card
•…etc
36
EDIF
P&R
FPGA
NCSA/OSC Reconfigurable Systems Summer Institute
JPEG2000 DWT Implementation

Example taken from a “Xilinx Design Challenge”


Comparison made with HDL approach
See Article in Xcell Volume 46
http://www.xilinx.com/publications/xcellonline/xcell_46/xc_celoxica46.htm
C-Based Design 1st pass
Slices
2nd pass
Final
646
546
758
800
6%
5%
7%
7%
Speed (MHz)
110
130
151
128
Lines of code
386
386
395
435
Design time (days)
6
7 (6+1)
7 (6+1)
20*
5 mins
20 mins
+6 hours
Device utilization
Simulation time
5 mins
* Lena used as testbench throughout,
input bit width12, max 1K image width
37
HDL
NCSA/OSC Reconfigurable Systems Summer Institute
* Doesn’t include
partitioning spec.
development
Observations
Comparable
Using C faster
Using C quicker
Expert vs Novice
JPEG2000 MQ coder Implementation
>
Celoxica 1st Pass
Celoxica Final
HDL
Slices
1.347
1,999
620
Device utilization
12%
18%
6%
Speed (MHz)
89.5
115.5
76
Lines of code
310
330
800
Design time (days)
10
12 (10+2)
30*
Simulation time for Lena jpeg
5 mins
5 mins
Hours
* Doesn’t include
partitioning spec.
development
>
Common language base eased porting to hardware of the MQ coder source & DSM
allowed partition, co verification & data to be moved between hardware & software
>
Optimizations included adding parallelism, replacing for() loops with while() loops,
& simplifying loop control.
>
Design developed in a unified design environment
38
NCSA/OSC Reconfigurable Systems Summer Institute
Observations
HDL Smaller
HC Faster
HC Quicker
Expert vs Novice