High Productivity Computing System Program
Download
Report
Transcript High Productivity Computing System Program
Implementation of Image
Processing Kernels on SRC and
SGI Reconfigurable Computers
Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2
1The
George Washington University,
Mason University
2George
{esam, mtaher, tarek}@gwu.edu, [email protected]
Introduction
What are Reconfigurable Computers
(RCs)?
RCs are computing systems based
on the close system-level
integration of one or more generalpurpose processors and one or
more Field Programmable Gate
Array (FPGA) chips
Benefits of RCs
A trade-off between traditional hardware and software
Hardware-like performance with software-like flexibility
Hardware can be modified on-the-fly
The programming model is aimed at shielding programmers from the details
of the hardware description
Orders of magnitude performance improvements over traditional systems
El-Araby
2
1008 / MAPLD2005
Introduction (cnt’d)
Status of RCs
An important research subject due to the recent fast growth of FPGAs
technology
Evolved from:
“glue logic” between components to
Accelerator boards to
Stand-alone general-purpose RCs to
Parallel reconfigurable computers
However, there exist multiple challenges that must be resolved
El-Araby
3
1008 / MAPLD2005
Challenges
Performance
I/O Bandwidth
Significant Configuration Latency
Some systems spend 25% to 98% of their execution time performing
reconfiguration
Need for Efficient OS and Run-Time Reconfiguration Management
Reconfiguration methods in current systems are not fully dynamic
Ease of Use
Compilers/Languages
HDLs (VHDL and Verilog) are hard to use by application scientists
HLLs and simple interfaces
Debuggers
El-Araby
4
1008 / MAPLD2005
SRC Architecture
(Hi-BarTM Based Systems)
Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier
Up to 256 input and 256 output ports with two tiers of switch
Common Memory (CM) has controller with DMA capability
Controller can perform other functions such as scatter/gather
Up to 8 GB DDR SDRAM supported per CM node
SRC Hi-Bar Switch
SNAP™
SNAP
Memory
Disk
MAP
Memory
P
PCI-X
MAP®
P
Gig Ethernet
etc.
Storage Area
Network
Chaining
GPIO
PCI-X
Common
Memory
SRC-6
Wide Area
Network
Local Area
Network
Customers’ Existing Networks
El-Araby
Common
Memory
5
Source: [SRC]
1008 / MAPLD2005
SRC Programming Environment
Application sources
HDL
sources
.c or .f files
User
Macro sources
.vhd or .v files
.v files
.edf files
P Compiler
Logic synthesis
MAP Compiler
..edf files
Object
files
.o files
.o files
Place & Route
Linker
.bin files
Configuration
bitstreams
Application
executable
El-Araby
6
1008 / MAPLD2005
SRC Application Simulation Process
HLL
source
Compiler
“Front-end”
CFGDFG
Macro
Verilog
Code
Macro
Definition
Verilog
Generator
DFG
Behavioral
Simulation
Verilog
Synthesis
User Chip
Level
Simulation
Place
and
Route
logic.bin
Macro
Emulation
C Code
(Info File)
Macro
Verilog Code
Optional
Source: [SRC]
El-Araby
7
1008 / MAPLD2005
Steps to Final Logic
DFG Simulation
Verifies memory
allocation
HLL Source
Verifies data movement
Uses real run time
environment
DFG Simulation
Emulates the CM & OBM
relationships
Simulates User Logic
Source: [SRC]
El-Araby
8
1008 / MAPLD2005
Steps to Final Logic
DFG Simulation
User Logic Simulation
Test user developed macros
HLL Source
The application becomes the
“test bench”
Push the generated logic one
step closer to the actual
hardware implementation
DFG Simulation
UL Simulation
Requires “logic designer
mentality” for debugging
Gives full visibility into the logic
Source: [SRC]
El-Araby
9
1008 / MAPLD2005
Steps to Final Logic
DFG Emulation
User Logic Simulation
MAP Hardware Execution
HLL Source
Full execution using
ComList and User Logic
on MAP
DFG Simulation
UL Simulation
MAP Execution
Source: [SRC]
El-Araby
10
1008 / MAPLD2005
SGI Systems
(System Architecture)
R
NUMAlink system interconnect
C
General-purpose compute nodes
RASC
C
C
RASC
Integrated graphics/visualization
Specific Computing
El-Araby
11
R
R
RASC
C
C
C
V
C
C
Reconfigurable Application
IO
C
IO Peer-attached general purpose I/O
V
IO
RASC
C
R
IO
R
IO
C
RASC
C
1008 / MAPLD2005
RASC Architecture
El-Araby
12
1008 / MAPLD2005
RASC Architecture (cnt’d)
El-Araby
13
1008 / MAPLD2005
Design Flow
(HDLs)
Design iterations
Design Verification
Design Entry
(Verilog, VHDL)
.v, .vhd
.v, .vhd
IA-32 Linux
.v, .vhd
Machine
Behavioral Simulation
(VCS, Modelsim)
Design Synthesis
(Synplify Pro,
Amplify)
.edf
Metadata
Processing
(Python)
.cfg
Altix
El-Araby
Design
Implementation
(ISE)
.ncd, .pcf
Static Timing Analysis
(ISE Timing Analyzer)
.bin
Device Programming
.c
(RASC Abstraction Layer,
Device Manager, Device Driver)
14
Real-time
Verification
(gdb)
1008 / MAPLD2005
Design Flow
(HLLs)
HLL Design Entry
(Handel-C, Impulse C, Mitrion C, Viva)
Design Verification
RTL Generation and
Integration with Core Services
IA-32
Linux
Machine
.v,
.vhd
.v, .vhd
.v,
.vhd
Metadata
Processing
(Python)
.cfg
Behavioral Simulation
(VCS, Modelsim)
Design Synthesis
(Synplify Pro,
Amplify)
.edf
.ncd,
.pcf
Design Implementation
(ISE)
Static Timing Analysis
(ISE Timing Analyzer)
.bin
Altix
El-Araby
Device Programming
.c
(RASC Abstraction Layer,
Device Manager, Device Driver)
15
Real-time
Verification
(gdb)
1008 / MAPLD2005
Application Programming Interface
Rasclib:
Resource allocation in conjunction with the RASC Device
Manager
Data movement to/from the COP via DMA engines
Algorithm control (start, stop, single step, stepN)
Automatic scaling across multiple devices
Interfaces necessary for debugging
El-Araby
16
1008 / MAPLD2005
Abstraction Layer: Algorithm API
The Abstraction Layer’s algorithm API mirrors the
COP API with a few additions that enable wide scaling,
Algorithm
Input Data
Output Data
COP
COP
Application
COP
and deep scaling.
Input Data
Application
El-Araby
Output Data
Algorithm
COP
COP
17
1008 / MAPLD2005
RASC Debugging
Based on Open Source Gnu Debugger (GDB)
Uses extensions to current command set
Can debug host application and FPGA
Provides notification when FPGA starts or stops
Supplies information on FPGA characteristics
Can “single-step” or “run N steps” of the algorithm
Dumps data regarding the set of “registers” that
are visible when the FPGA is active
El-Araby
18
1008 / MAPLD2005
Applications of DWT
Pattern recognition
Feature extraction
Metallurgy: characterization of rough surfaces
Trend detection:
Finance: exploring variation of stock prices
Perfect reconstruction
Communications: wireless channel signals
De-noising noisy data
FBI fingerprint compression
Detecting self-similarity in a time series
Video compression – JPEG 2000
Hyperspectral Dimension Reduction
Image Registration
El-Araby
19
1008 / MAPLD2005
Multi-Resolution DWT Decomposition
(Mallat Algorithm)
The input image is first convolved along the rows by the two filters L and H and
decimated along the columns by two resulting in two "column-decimated"
images L and H
Each of the two images, L and H, is then convolved along the columns by the
two filters L and H and decimated along the rows by two
This decomposition results into four images, LL, LH, HL and HH
The LL image is taken as the new input to perform the next level of
decomposition
El-Araby
20
1008 / MAPLD2005
P ackage List
DWT Implementation (Top-Level)
P ackageDeclarations
List
LOW : (7:0 )
data_in
data_in
LP00
c oeff00
data_out
data_in : (7 :0)
HIGH : (7:0)
LP 00 : (3:0)
out_en
LP01
c oeff01
LP 01 : (3:0)
LP02
c oeff02
LP 02 : (3:0)
LP03
c oeff03
LP04
c oeff04
LP05
c oeff05
LOW
data_in_H data_out_H
HIGH
out_en
LP 03 : (3:0)
LP 04 : (3:0)
LP06
c oeff06
LP 05 : (3:0)
LP07
c oeff07
LP 06 : (3:0)
LP08
c oeff08
LP 07 : (3:0)
LP 08 : (3:0)
data_in_L data_out_L
c lk
LP_filter
rs t
c lk
rs t
data_in _L
D1
Q1
data_out_ L
data_in _H
D2
Q2
data_out_ H
dwt_filter
data_in
data_out
en
HP00 : (3:0 )
HP01 : (3:0 )
HP02 : (3:0 )
HP03 : (3:0 )
HP00
c oeff00
HP01
c oeff01
HP02
c oeff02
HP03
c oeff03
HP04 : (3:0 )
HP04
c oeff04
HP05 : (3:0 )
HP05
c oeff05
HP06 : (3:0 )
HP06
c oeff06
HP07
c oeff07
HP08
c oeff08
HP07 : (3:0 )
out_en
decimator_2
clk
rst
D
Q
Q
out_en
clk
rst
HP_filter
clk
rst
HP08 : (3:0 )
clk
rst
El-Araby
c lk
rs t
c lk
rs t
21
1008 / MAPLD2005
FIR Module
(Transposed Form)
Pac ka ge Lis t
data_in
coeff08
MUL
MUL
coeff07
D
Q
c lk
rs t
ADD
MUL
coeff06
D
Q
c lk
rs t
ADD
MUL
coeff05
D
Q
c lk
rs t
ADD
MUL
coeff04
D
Q
ADD
c lk
rs t
MUL
coeff03
D
Q
c lk
rs t
ADD
MUL
coeff02
D
Q
c lk
rs t
ADD
MUL
coeff01
D
Q
c lk
rs t
ADD
MUL
coeff00
D
Q
ADD
data_out
c lk
rs t
clk
rst
El-Araby
22
1008 / MAPLD2005
DWT End-to-End Throughput
(SRC-6 & SGI-RASC vs. P4)
SRC-6
SGI RASC
(MB/sec)
(MB/sec)
Daub1(Haar)
199
130
12
Daub2
199
130
9.98
Daub3
199
130
8.73
Filter Type
P4
(MB/sec)
Image Size = 512 X 512 pixels
El-Araby
23
1008 / MAPLD2005
Conclusions
DWT is implemented on both SRC-6 and SGI-RASC
systems
Similarities and differences are analyzed with regard to:
System hardware architecture
Ease of programming
Programming
model
Development time
Hardware/software libraries
Performance
The
speed-up vs. microprocessor is reported
Primary bottlenecks limiting the performance of both
systems are recognized
The capability to share and port applications between the
SRC and SGI systems is explored
El-Araby
24
1008 / MAPLD2005