Transcript Document

Reconfigurable Computing Systems:
An Overview
Presented by:
Gurwant Kaur Koonar
Vijay Pandya
14th March 2003
Introduction



Reconfigurable Computing (RC) is an emerging
paradigm for digital systems design. The key feature
of which is the ability to perform computations in
hardware to achieve performance of ASIC and
flexibility of GP processors.
Technology improvements have made possible new
programmable logic devices (FPGAs, CPLDs).
Objective of the talk: Give an overview and the
hardware architectures of reconfigurable computing,
and the software that targets these machines, such as
compilation tools.
Definition

Reconfigurable Computing (RC) is a computing
paradigm in which algorithms are implemented
as a temporally and spatially ordered set of very
complex tasks. These tasks are executed on a
large set of interconnected programmable
hardware elements
Definition(cont’d)




computing paradigm - defines the basic RC computing
model without reference to implementation.
very complex tasks – commonly referred to as
configurations RC tasks require more time than general
purpose computing instructions and more area than the
typical general purpose execution unit.
Spatial and temporal partitioning – algorithms are
decomposed into tasks in both the space and time
domains.
hardware elements - at their core RC devices consist of
a very large set of simple programmable elements
collectively called Reconfigurable Execution Unit
(REU)

General Characteristics of RC





Stored configuration algorithms
No software
Pipeline architectures are common
Real-time applications
Advantages

Flexible


Cost comparable to GPP




Algorithm parallelism exploited in custom architecture
Problem specific operators and control
High-performance


Hardware is readily available
Shorter development cycle than ASICs
Parallelism


Configurable
Reduced memory dependence and exploit fine-grained algorithm parallelism.
Timesharing

Hardware can be time multiplexed by multiple applications
Disadvantages


Additional area requirements
 Configuration memory (internal/external),
Internal switches and other hardware overhead
Time Overhead
 Device configuration, and internal switches
Traditional Computing

Using Application-Specific Integrated Circuits
(ASICs) to “hard-wire” an algorithm in hardware.







Extremely fast
Require less Silicon area
Less power hungry than GP architectures
Extremely inflexible
Expensive both in design and fabrication
Errors are difficult to correct
Examples:Consumer Electronics,
Telecommunications, Automotive Industry
Traditional Computing(Cont'd)

General-purpose hardware, combined with
application-specific software





Extremely flexible due to versatile instruction set.
Much less expensive to develop.
Poor performance compared to ASICs.
Errors can be dynamically patched.
Examples: Commodity PC hardware running
commercial software.
Reasons for Poor Software
Performance





Fetching of instructions
Interpretation of instructions
Scheduling of instructions
Wrong mix of hardware resources to suit a
particular application’s needs
Therefore Reconfigurable computing is intended
to fill the gap between HW and SW.
Flexibility and Efficiency
Tradeoffs
Can we call FPGA’s to be
Reconfigurable Processing unit ?



Traditional FPGAs are configurable, but not
run-time reconfigurable
Traditional FPGAs expect to read their
configuration out of a serial EEPROM, one
bit at a time.
Therefore, FPGA must be reprogrammed in
its entirety and that its previous internal
state cannot be captured beforehand.
Features for Reconfigurable
Hardware



On-the-Fly Reprogrammability
Partial Reprogrammability
Externally-Visible Internal State
Kress ALU Array-III(KrAA-III)








instruction level parallelism
transparently scalable
fast routing and placement (seconds only)
dynamically and partially reconfigurable
(microseconds)
suitable for full custom design
on microprocessor chip: much higher acceleration than
by caches
on microprocessor chip: fast and low power by full
custom design
acceleration by massive run time to compile time
migration
Kress ALU Array-III(KrAA-III)


KrAA-III consists of PEs
called rDPU-III
(reconfigurable DataPath
Unit III) arranged in a
NEWS network.
Figure shows the
KrAAIII chip containing 9
rDPUs.
Basic Architecture of today’s
commercial reconfigurable processor
Devices which combined FPGA with
Standard processor core




Triscend’s E5 and A7
Altera’s two Excalibur families
Atmel’s FPSLIC
Chameleon Systems’ CS2000
Zippy Architecture



It is used to develop
reconfigurable processor
technology for domain
of handheld and
wearable computing.
To investigate new trade
offs between
performance, power
consumption and system
cost
It is an international
research effort lead by
Swiss Federal Institute
of Technology
Reconfigurable Computing Merging
Efficiency and Versatility
Hardware Design steps
Examples
SPLASH II
Multi FPGA parallel computer with orchestrated systolic
communications to perform inter- FPGA data transfer
Garp
For general purpose loop acceleration
CMC Rapid Prototyping Platform
RC Applications

RC has demonstrated >10x performance density advantage
over microprocessors and DSPs





Pattern matching
Data encryption
Data compression
Video and image processing
Commercial Push




Handheld devices - PDAs, mobile Phones, specialized tools
Networks - telecom switches, network routers, network bridges
High-performance Computing – super computers, medical
appliances, robot navigation and planning
Defense – Ballistic Missiles, KV navigation, Spacecraft
processing
RC Implementations

Hardware
Catalina Research Incorporated http://www.catalinaresearch.com/Chameleon
 Annapolis Microsystems http://www.annapmicro.com/Wildstar
 Alpha Data Parallel Systems - http://www.alpha-data.com
Tools
 Celoxica - http://www.celoxica.com
 Star Bridge Systems - http://www.starbridgesystems.com
 Annapolis Microsystems http://www.annapmicro.com/CoreFire

Content



Coupling Approaches (Reconfigurable Hardware
with General Processor)
Granularity of the FPGA as an RCS
Implementation Approaches





Compile Time Reconfiguration
Run Time Reconfiguration
Some more advantages
Challenges
Software like Design environment
Coupling Approaches for Reconfigurable
Hardware (RH)
RH can be coupled to GP as:




A functional unit (Tight Coupling)
A Co-processor
An Attached processing unit
A Standalone processing unit (Loosely
coupled)
Coupling Approaches Cont’d

As a Functional Unit:



Within a host processor (General purpose: GP)
Uses data-path of a host machine
As a Coprocessor:




Without constant supervision of the GP
GP initializes the RH
Independent parallel computation
Less communication overhead
Coupling Approaches Cont’d

As an attached processing unit:




Behaves as an additional processor
Memory Cache not visible
Independent Computation but high
communication overhead
As a Standalone:



The most loosely coupled to GP
Infrequent Communication with the GP
Independent computation for very long period of
time
Different levels of coupling
Workstation
Coprocessor
Attached
Processing Unit
Standalone
Processing Unit
CPU
FU
Memory
Caches
I/O
Interface
Pros and Cons of different coupling
approaches

The tight integration




Very less communication overhead
RH can not operate “alone” for long period of
time
Amount of Reconfig. Logic is limited
The loose integration


Greater parallelism
Higher communication overhead
Logic Block Granularity


Referred to the size and complexity of the
CLB
Fine grained logic block





Less complex, Altera Flex 10k consists of single 4
input LUT with flip-flop
Useful for bit-level manipulation
Exceed the performance of GP in case of operation
on variable bit data width
Smaller area, high amount of computation
(Compact)
Encryption and image processing application
Logic Block Granularity cont’d

Coarse grained logic block






Larger granularity of the CLB
Helps perform more complex operations
Four 2-bit inputs (GARP) and multiplier in each
logic block for 4 x 4 multiplication
Finite State Machine
Word-width (16 bit) data path circuits
implementation in Very coarse-grained structure
Logic block closer to small processor
Implementation Approaches

Compile Time Reconfiguration (CTR)





Static implementation strategy
Single system wide configuration
Configuration doesn’t change during computation
Similar to using ASIC for application acceleration
Run Time Reconfiguration (RTR)



Dynamic implementation strategy
Multiple time-exclusive configurations
Dynamic hardware allocation (run-time)
RTR


Main Task: Dividing algorithm into timeexclusive segments
Global RTR



Allocates whole FPGA resources for each
configuration
Single system wide configuration for each phase
Local RTR



Locally reconfigure subsets of logic at run-time
Partial reconfiguration, flexibility
Functional division of labor
RTR Cont’d
Global RTR
LOAD
A
EXE.
A
LOAD
B
EXE.
B
LOAD
C
EXE.
C
Local RTR
A
A
B
D
EXE.
EXE.
C
Implementation Issues





Temporal partitions a iterative process
Possibly inefficient usage of FPGA resources
in global RTR
Simulation
Efficient usage of hardware in local RTR
Current CAD tools: poor match for local RTR

(Examples of Local RTR: RRANN-2 and DISC )
Power Savings in RC system



Exploitation of numerical properties of an
application
Higher number of operations per clock due to
deep pipelines
Sensor/actuator pre-conditioning and “glue
logic” functions on chip
Some Challenges




Access to the development of RCS
restricted to hardware developers
Run-time environment, RTR scheduling
Difficulties in routing for RC hardware
having large number of CLBs
Connection scheme in multi-FPGA
system
Software Aspect

Software like design environment





System C (Synopsys), Handel C (Celoxica)
Hardware-Software co-design (ARM Rapid
Prototyping Platform (RPP)
Generation of detail gate level description
(netlist) by HLL (High level language)
Technology mapping, Placement and Routing
Generation of .bit files (language of the
FPGA)
Software Aspect Cont’d




Programming language/HDL
SoC consists 50 to 90% software
Wide acceptability of C/C++
Simulation timing


Simulation takes long time in current CAD
tools
C/C++ debugger very efficient
RC1000 Celoxica platform


DK1 design suite (handel C)
RC1000 plug-in card, PCI bus interfacing


Xilinx Virtex-1000 FPGA (1 million gates)
Design Flow
Handel C
Source Files
Compile
Generate
VHDL/Verilog
Generate
EDIF (netlist)
Simulate & netlist
Place & Route
Tools
Generation
BitStream
Hardware-Software Co-design
Amdahl’s Law
T=
1
(1 – a) + a / s
T = Overall speedup
a = Fraction of the original program that could
be enhanced by transferring to h/w
s = Speedup obtained for particular fraction of
program
Summary


RCS to bridge the gap between Software and
hardware (flexibility and performance)
FPGA ideal candidate for an RH






Spatial Execution
Reprogrammability
Design time
Design and synthesis flow for CAD tools
Hybrid Architecture
Recent advancement in CAD tools
Questions?????????????