High Productivity Computing System Program

Transcript High Productivity Computing System Program

Implementation of Image
Processing Kernels on SRC and
SGI Reconfigurable Computers
Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2
1The
George Washington University,
Mason University
2George
{esam, mtaher, tarek}@gwu.edu, [email protected]
Introduction
 What are Reconfigurable Computers
(RCs)?
 RCs are computing systems based
on the close system-level
integration of one or more generalpurpose processors and one or
more Field Programmable Gate
Array (FPGA) chips
 Benefits of RCs
 A trade-off between traditional hardware and software
 Hardware-like performance with software-like flexibility
 Hardware can be modified on-the-fly
 The programming model is aimed at shielding programmers from the details
of the hardware description
 Orders of magnitude performance improvements over traditional systems
El-Araby
2
1008 / MAPLD2005
Introduction (cnt’d)
 Status of RCs
 An important research subject due to the recent fast growth of FPGAs
technology
 Evolved from:




“glue logic” between components to
Accelerator boards to
Stand-alone general-purpose RCs to
Parallel reconfigurable computers
 However, there exist multiple challenges that must be resolved
El-Araby
3
1008 / MAPLD2005
Challenges
 Performance
I/O Bandwidth
Significant Configuration Latency

Some systems spend 25% to 98% of their execution time performing
reconfiguration
Need for Efficient OS and Run-Time Reconfiguration Management

Reconfiguration methods in current systems are not fully dynamic
 Ease of Use
Compilers/Languages


HDLs (VHDL and Verilog) are hard to use by application scientists
HLLs and simple interfaces
Debuggers
El-Araby
4
1008 / MAPLD2005
SRC Architecture
(Hi-BarTM Based Systems)

Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier

Up to 256 input and 256 output ports with two tiers of switch

Common Memory (CM) has controller with DMA capability

Controller can perform other functions such as scatter/gather

Up to 8 GB DDR SDRAM supported per CM node
SRC Hi-Bar Switch
SNAP™
SNAP
Memory
Disk
MAP
Memory
P
PCI-X
MAP®
P
Gig Ethernet
etc.
Storage Area
Network
Chaining
GPIO
PCI-X
Common
Memory
SRC-6
Wide Area
Network
Local Area
Network
Customers’ Existing Networks
El-Araby
Common
Memory
5
Source: [SRC]
1008 / MAPLD2005
SRC Programming Environment
Application sources
HDL
sources
.c or .f files
User
Macro sources
.vhd or .v files
.v files
.edf files
P Compiler
Logic synthesis
MAP Compiler
..edf files
Object
files
.o files
.o files
Place & Route
Linker
.bin files
Configuration
bitstreams
Application
executable
El-Araby
6
1008 / MAPLD2005
SRC Application Simulation Process
HLL
source
Compiler
“Front-end”
CFGDFG
Macro
Verilog
Code
Macro
Definition
Verilog
Generator
DFG
Behavioral
Simulation
Verilog
Synthesis
User Chip
Level
Simulation
Place
and
Route
logic.bin
Macro
Emulation
C Code
(Info File)
Macro
Verilog Code
Optional
Source: [SRC]
El-Araby
7
1008 / MAPLD2005
Steps to Final Logic
 DFG Simulation
Verifies memory
allocation
HLL Source
Verifies data movement
Uses real run time
environment
DFG Simulation
Emulates the CM & OBM
relationships
Simulates User Logic
Source: [SRC]
El-Araby
8
1008 / MAPLD2005
Steps to Final Logic
 DFG Simulation
 User Logic Simulation
Test user developed macros
HLL Source
The application becomes the
“test bench”
Push the generated logic one
step closer to the actual
hardware implementation
DFG Simulation
UL Simulation
Requires “logic designer
mentality” for debugging
Gives full visibility into the logic
Source: [SRC]
El-Araby
9
1008 / MAPLD2005
Steps to Final Logic
 DFG Emulation
 User Logic Simulation
 MAP Hardware Execution
HLL Source
Full execution using
ComList and User Logic
on MAP
DFG Simulation
UL Simulation
MAP Execution
Source: [SRC]
El-Araby
10
1008 / MAPLD2005
SGI Systems
(System Architecture)
R
NUMAlink system interconnect
C
General-purpose compute nodes
RASC
C
C
RASC
Integrated graphics/visualization
Specific Computing
El-Araby
11
R
R
RASC
C
C
C
V
C
C
Reconfigurable Application
IO
C
IO Peer-attached general purpose I/O
V
IO
RASC
C
R
IO
R
IO
C
RASC
C
1008 / MAPLD2005
RASC Architecture
El-Araby
12
1008 / MAPLD2005
RASC Architecture (cnt’d)
El-Araby
13
1008 / MAPLD2005
Design Flow
(HDLs)
Design iterations
Design Verification
Design Entry
(Verilog, VHDL)
.v, .vhd
.v, .vhd
IA-32 Linux
.v, .vhd
Machine
Behavioral Simulation
(VCS, Modelsim)
Design Synthesis
(Synplify Pro,
Amplify)
.edf
Metadata
Processing
(Python)
.cfg
Altix
El-Araby
Design
Implementation
(ISE)
.ncd, .pcf
Static Timing Analysis
(ISE Timing Analyzer)
.bin
Device Programming
.c
(RASC Abstraction Layer,
Device Manager, Device Driver)
14
Real-time
Verification
(gdb)
1008 / MAPLD2005
Design Flow
(HLLs)
HLL Design Entry
(Handel-C, Impulse C, Mitrion C, Viva)
Design Verification
RTL Generation and
Integration with Core Services
IA-32
Linux
Machine
.v,
.vhd
.v, .vhd
.v,
.vhd
Metadata
Processing
(Python)
.cfg
Behavioral Simulation
(VCS, Modelsim)
Design Synthesis
(Synplify Pro,
Amplify)
.edf
.ncd,
.pcf
Design Implementation
(ISE)
Static Timing Analysis
(ISE Timing Analyzer)
.bin
Altix
El-Araby
Device Programming
.c
(RASC Abstraction Layer,
Device Manager, Device Driver)
15
Real-time
Verification
(gdb)
1008 / MAPLD2005
Application Programming Interface
Rasclib:
Resource allocation in conjunction with the RASC Device
Manager
Data movement to/from the COP via DMA engines
Algorithm control (start, stop, single step, stepN)
Automatic scaling across multiple devices
Interfaces necessary for debugging
El-Araby
16
1008 / MAPLD2005
Abstraction Layer: Algorithm API
The Abstraction Layer’s algorithm API mirrors the
COP API with a few additions that enable wide scaling,
Algorithm
Input Data
Output Data
COP
COP
Application
COP
and deep scaling.
Input Data
Application
El-Araby
Output Data
Algorithm
COP
COP
17
1008 / MAPLD2005
RASC Debugging
 Based on Open Source Gnu Debugger (GDB)
 Uses extensions to current command set
 Can debug host application and FPGA
 Provides notification when FPGA starts or stops
 Supplies information on FPGA characteristics
 Can “single-step” or “run N steps” of the algorithm
 Dumps data regarding the set of “registers” that
are visible when the FPGA is active
El-Araby
18
1008 / MAPLD2005
Applications of DWT
 Pattern recognition
 Feature extraction
 Metallurgy: characterization of rough surfaces
 Trend detection:
 Finance: exploring variation of stock prices
 Perfect reconstruction
 Communications: wireless channel signals
 De-noising noisy data
 FBI fingerprint compression
 Detecting self-similarity in a time series
 Video compression – JPEG 2000
 Hyperspectral Dimension Reduction
 Image Registration
El-Araby
19
1008 / MAPLD2005
Multi-Resolution DWT Decomposition
(Mallat Algorithm)
 The input image is first convolved along the rows by the two filters L and H and
decimated along the columns by two resulting in two "column-decimated"
images L and H
 Each of the two images, L and H, is then convolved along the columns by the
two filters L and H and decimated along the rows by two
 This decomposition results into four images, LL, LH, HL and HH
 The LL image is taken as the new input to perform the next level of
decomposition
El-Araby
20
1008 / MAPLD2005
P ackage List
DWT Implementation (Top-Level)
P ackageDeclarations
List
LOW : (7:0 )
data_in
data_in
LP00
c oeff00
data_out
data_in : (7 :0)
HIGH : (7:0)
LP 00 : (3:0)
out_en
LP01
c oeff01
LP 01 : (3:0)
LP02
c oeff02
LP 02 : (3:0)
LP03
c oeff03
LP04
c oeff04
LP05
c oeff05
LOW
data_in_H data_out_H
HIGH
out_en
LP 03 : (3:0)
LP 04 : (3:0)
LP06
c oeff06
LP 05 : (3:0)
LP07
c oeff07
LP 06 : (3:0)
LP08
c oeff08
LP 07 : (3:0)
LP 08 : (3:0)
data_in_L data_out_L
c lk
LP_filter
rs t
c lk
rs t
data_in _L
D1
Q1
data_out_ L
data_in _H
D2
Q2
data_out_ H
dwt_filter
data_in
data_out
en
HP00 : (3:0 )
HP01 : (3:0 )
HP02 : (3:0 )
HP03 : (3:0 )
HP00
c oeff00
HP01
c oeff01
HP02
c oeff02
HP03
c oeff03
HP04 : (3:0 )
HP04
c oeff04
HP05 : (3:0 )
HP05
c oeff05
HP06 : (3:0 )
HP06
c oeff06
HP07
c oeff07
HP08
c oeff08
HP07 : (3:0 )
out_en
decimator_2
clk
rst
D
Q
Q
out_en
clk
rst
HP_filter
clk
rst
HP08 : (3:0 )
clk
rst
El-Araby
c lk
rs t
c lk
rs t
21
1008 / MAPLD2005
FIR Module
(Transposed Form)
Pac ka ge Lis t
data_in
coeff08
MUL
MUL
coeff07
D
Q
c lk
rs t
ADD
MUL
coeff06
D
Q
c lk
rs t
ADD
MUL
coeff05
D
Q
c lk
rs t
ADD
MUL
coeff04
D
Q
ADD
c lk
rs t
MUL
coeff03
D
Q
c lk
rs t
ADD
MUL
coeff02
D
Q
c lk
rs t
ADD
MUL
coeff01
D
Q
c lk
rs t
ADD
MUL
coeff00
D
Q
ADD
data_out
c lk
rs t
clk
rst
El-Araby
22
1008 / MAPLD2005
DWT End-to-End Throughput
(SRC-6 & SGI-RASC vs. P4)
SRC-6
SGI RASC
(MB/sec)
(MB/sec)
Daub1(Haar)
199
130
12
Daub2
199
130
9.98
Daub3
199
130
8.73
Filter Type
P4
(MB/sec)
Image Size = 512 X 512 pixels
El-Araby
23
1008 / MAPLD2005
Conclusions
 DWT is implemented on both SRC-6 and SGI-RASC
systems
 Similarities and differences are analyzed with regard to:
System hardware architecture
Ease of programming
 Programming
model
 Development time
 Hardware/software libraries
Performance
 The
speed-up vs. microprocessor is reported
 Primary bottlenecks limiting the performance of both
systems are recognized
 The capability to share and port applications between the
SRC and SGI systems is explored
El-Araby
24
1008 / MAPLD2005

High Productivity Computing System Program

Transcript High Productivity Computing System Program

Directory