Wormhole Runtime Reconfiguration FPGA with Distributed

Download Report

Transcript Wormhole Runtime Reconfiguration FPGA with Distributed

Wormhole RTR FPGAs with
Distributed Configuration
Decompression
CSE-670 Final Project Presentation
A Joint Project Presentation by:
Ali Mustafa Zaidi
Mustafa Imran Ali
Introduction

“An FPGA configuration architecture supporting
distributed control and fast context switching.”

Aim of Joint Project:


Explore the potential for dynamically reconfiguring FPGAs
by adapting the WRTR Approach
Enable High Speed Reconfiguration using:



Optimized Logic-Blocks and
Distributed Configuration Decompression techniques
Study focuses on Datapath-oriented FPGAs
Project Methodology



≈

In depth study of issues.
Definition of basic Architecture models.
Definition of Area models (for estimation of relative overhead w.r.t other
RTR schemes).
Design of Reconfigurable systems around designed Architecture
models.
Identification of FPGA resource allocation methods



Selection of Benchmarks for testing and simulation of various
approaches.
Experimentation with WRTR systems and corresponding PRTR system
without distributed configuration (i.e. single host/application) for
comparison of baseline performance.


For distributing FPGA area between multiple applications/hosts (at runtime,
compile-time, or system design-time).
Evaluate resource utilization, and normalized reconfiguration overhead etc.
Experimentation with all systems with distributed configurations (i.e.
multiple hosts/applications). The PRTR systems’ configuration port will
be time multiplexed between the various applications.

Evaluate resource utilization, and normalized reconfiguration overhead etc.
Configuration Issues with Ultra-high
Density FPGAs

FPGA densities rising dramatically with process technology
improvements.

Configuration time for serial method becoming prohibitively large.

FPGAs are increasingly used as compute engines, implementing
data-intensive portions of applications directly in Hardware.

Lack of efficient method for dynamic reconfiguration of large
FPGAs will lead to inefficient utilization of the available
resources.
Scalability Issues with Multi-context RTR

Concept: While one plane is operating, configure the other
planes serially.
 Latency hidden by overlapping configuration with computation

As FPGA size, and thus configuration time grows, Multi context
becomes less effective in hiding latency
 Configuring more bits in parallel for each context is only a stopgap solution.
 Only so many pins can be dedicated to configuration

Overheads in the Multi-context approach:
 Number of SRAM cells used for configuration grow linearly with
number of contexts
 Multiplexing Circuitry associated with each configurable unit
 Global, low-skew context select wires.
Scalability Issues with Partial RTR

Concept: The Configuration memory is Addressable like standard
Random Access Memory.

Overheads in PRTR Approach:

Long global cell-select and data busses required –



Vertical and Horizontal Decoding circuitry – represent centralized control
resource.



area overhead,
issues with single cycle signal-transmission as wires grow relative to logic.
Can be accessed only sequentially by different user applications (one app at a
time)
Potential for underutilization of hardware as FPGA density increases.
One solution could be to design the RAM as a multi-ported memory



But area of RAM increases quadratically with increase in number of ports.
Only so many dedicated configuration ports can be provided
Not a long-term scalable solution.
What is Wormhole RTR

WRTR is a method for reconfiguring a
configurable device in an entirely distributed
fashion.

Routing and configuration handled at local
instead of global level.
Advertised Benefits of WRTR

WRTR is a distributed paradigm
 Allows different parts of same resource to be independently
configured simultaneously.

Dramatically increases the configuration bandwidth of a device.

Lack of centralized controller means:
 Fewer single point failures that can lead to total system failure
(e.g. a broken configuration pin)
 Increased resilience: routing around faults – improving chip yields
?
 Distributed control provides scalability
 Eliminates configuration bottleneck.
Origins of Wormhole RTR

Concept Developed in late 90s at Virginia Tech.

Intended as a method of rapidly creating and
modifying ‘custom computational pathways’ using a
distributed control scheme.

Essence of WRTR concept:



Independent self-steering streams.
Streams carried both programming information as well as
operand data
Streams interact with architecture to perform computation.
(see DIAGRAM)
Origins of Wormhole RTR
Origins of Wormhole RTR

Programming information configures both the
pathway of stream through the system, as well as
the operations performed by computational
elements along the path.

Heterogeneity of architectures is supported by these
streams.

Runtime determination of path of stream is possible,
allowing allocation of resources as they become
available.
Adapting WRTR for conventional FPGAs

Our aims



To achieve Fast, Parallel Reconfiguration
With minimum area overhead
And minimum constraints imposed on the underlying FPGA
Architecture.


Configuration Architecture is completely decoupled from the
FPGA Architecture.
WRTR model is used as inspiration for developing a
new paradigm for dynamic reconfiguration

Not necessary that WRTR method is followed to the letter
Issues Associated with using WRTR for
Conventional FPGAs

Original WRTR was intended for Coarse-grained
dataflow architectures with localized
communications


Thus operand data was appended to the streams
immediately after the programming header.
In conventional FPGAs, dataflow patterns are
unrelated to configuration flow, and there is no
restriction of localizing communications.

Therefore Wormhole routing is used only for configuration
(cannot be used for data).
Issues Associated with using WRTR for
Conventional FPGAs

The original model was intended to establish
linear pipelines through system.


This makes run-time direction determination
feasible.
However, for conventional FPGAs, the
functions implemented have arbitrary
structures.

Configuration stream can not change direction
arbitrarily (i.e. fixed at compile-time).
Issues Associated with using WRTR for
Conventional FPGAs

Due to the need for large number of configuration
ports, I/O ports must be shared/multiplexed



thus active circuits may need to be stalled to load
configurations.
Should not be a severe issue for high-performance
computing oriented tasks.
Should impose minimum constraints on the
underlying FPGA architecture

Constraints applicable in an FPGA with WRTR are same as
those for any PRTR architecture.
A Possible System Architecture for a
WRTR FPGA

Many configuration/IO ports, divided between
multiple host processors. (See Diagram)

Internally, FPGA divided into partitions,
useable by each of the hosts.

Partition boundaries may be determined at system
design time, or at runtime, based on requirements
of each host at any given time
The various WRTR Models derived

Our aim was to devise a high-speed, distributed
configuration model with all the benefits of the
original WRTR concept, but with minimum
overhead.

To this end, 3 models have been devised:



Basic: with Full Internal Routing
Second: with Perimeter-only Routing.
Third: Packetized, or parallel configuration streams, with no
Internal Routing.
Basic WRTR Model: with Internal Routing

Each configurable block or “tile” is accompanied by
a simple configuration Stream Router. See Diagram


Overhead scales linearly with FPGA Size.
Expected Issues with this model



Complicated router, arbitration overhead and prioritization,
potential for deadlock conditions etc.
May be restricted to coarser grained designs.
Without data routing, do we really need internal routing?
Second: WRTR with Perimeter-only
Routing

Primary requirement for achieving parallel configuration is multiple input
ports.


So why not restrict routing to chip boundary? (See Diagram)


Overhead scaling improved (similar to PRTR Model)
Highlights:



Internal Routing not a mandatory requirement.
Finer granularity for configuration achievable
Significantly lower overheads as FPGA sizes grow (ratio of perimeter to
area)
Issues


Longer time required to reach parts of FPGA as FPGA Size grows.
Reduced configuration parallelism because of potentially greater arbitration
delays at boundary Routers.
Third: Packet based distribution of
Configuration

One solution to the increased boundary arbitration
issues: use packets instead of streams. (See
Diagram)

A single configuration from each application is
generated as a stream (a worm) similar to previous
models.

Before entering device, configuration packets from
different streams are grouped according to their
target rows.
Third: Packet based distribution of
Configuration

Benefit: No need at all for Routers in the fabric itself.

Drawbacks:


Increases overhead on the host system
Implies a centralized external controller



Or a limited crossbar interconnect within the FPGA
Parallel Reconfiguration still possible, but with limited
multitasking.
This model may be considered for embedded
systems, with low configuration parallelism, but high
resource requirements.
Basic Area Model

Baseline model required to identify overheads associated with
each PRTR model.

Basic Building block: (See Diagram)
 A basic Array of SRAM Cells
 Configured by arbitrary number of scan chains.

Assumptions for a fair comparison of overhead:
 Each RTR model studied has exactly the same amount of Logic
resources to configure (rows and columns).
 Each model can be configured at exactly the same granularity.

The given array of logic resources (see Diagram) has an area
equivalent to a serially configurable FPGA.
 AREA of Basic Model = (A* B) * (x * y)
The PRTR FPGA Area Model

Please See Diagram

Configuration Granularity decided by A and B

AREA = Area of Basic model + Overheads

Overheads =



Area of ‘log2(x)-to-x’ Row-select decoder +
Area of ‘log2(y)-to-y’ n-bit Column De-multiplexer +
Area of 1 n-bit bus * y
The basic WRTR FPGA Area Model

Please See Diagram

AREA = Area of Basic Model + Overheads

Overheads =



Area of 1 n-bit bus * 2x +
Area of 1 n-bit bus * 2y +
Area of 1 4-D Router block * [x * y].
The Perimeter Routing WRTR FPGA
Area Model

Please See Diagram

AREA = Area of Basic Model + Overheads

Overheads =




Area of 1 n-bit bus * 2x +
Area of 1 n-bit bus * 2y +
Area of 1 3-D Router block * [2(x + y) – 1]
This model can also be made one dimensional to
further reduce overheads. (other constraints will
apply)
The Packet based WRTR FPGA Area
Model

Please See Diagram

AREA = Area of Basic Model + Overheads

Overheads =
 Area of 1 n-bit bus * 2x +
 Area of 1 n-bit bus * 2y

Additional overheads may appear in host system.

This model can also be made one dimensional to further reduce
overheads. (other constraints will apply)
Parameters defined and their Impact

The number of Busses (x and y)
 Number of Busses varies with reconfiguration granularity.
 For fixed logic capacity, A and B increase with decreasing x and
y, i.e. coarser granularity.

Impact of coarser granularity:
 Reduced overhead ?
 Reduced reconfiguration flexibility
 Increased reconfiguration time per block
 Thus it is better to have finer granularity

The Width of the busses (n-bits)
 Smaller the width, smaller the overhead (for fixed number of
busses)
 Longer Reconfiguration times.
Parameters defined and their Impact

It is possible to achieve finer granularity without
increasing overhead at the cost of bus width (and
hence reconfiguration time per block)

Impact of Coarse grained vs. Fine grained
configurability – Methods of Handling hazards in the
underlying FPGA fabric.


Coarse grained configuration places minimum constraints
on FPGA architecture
Fine-grained reconfiguration is subject to all issues
associated with Partial RTR Systems.
Approaches to Router Design

Active Routing Mechanism
 Similar to conventional networks
 Routing of streams depends on stream-specified destination, as
well as network metrics (e.g. congestion, deadlock)
 Hazards and conflicts may be dealt with at Run-time.
 Significantly complicated routing logic required.
 Most likely will be restricted to very coarse grained systems.

Passive Routing Mechanism
 Routing of streams depends only on stream-specified direction
 Hazards and conflicts avoided by compile time optimization.

We have selected the Passive Routing Mechanism for our WRTR
Models
Passive Router Details

Must be able to handle streams from 4 different
directions.


Includes mechanisms for stalling streams in case of
conflict etc.


Streams from different directions only stalled if there is a
conflict in outgoing direction.
Detecting and Applying back-pressure
Routing Circuitry for one port is defined (see
Diagram)


For a 4D router, this design is replicated 4 times
For a 3D router, this design is replicated 3 times
Utilizing Variable Length
Configuration Mechanisms
Support Hardware and Logic Block
Issues
Configuration Overhead

Configuration data is huge for large FPGAs

Has to be reduced in order to have fast
context switching
Configuration Overhead Minimization

Initial pointers in this direction

Variable Length Configurations

Default Configuration on Power-up
Variable Length Configurations

Ideally



Change only minimum number of bits for a new
configuration
Utilize the idea of short configurations for
frequently used configurations
Start from a default configuration and change
minimum bits to reconfigure to a new state
Hurdles



Logic blocks always require full
configurations to be specified – configuration
sizes cannot be varied
Knowing only what to change requires
keeping track of what was configured before
– difficult issue in multiple dynamic
applications switching
A default power-up configuration can hardly
be useful for all application cases
Configuration Overhead Minimization

How to do it?


Remove redundancy in configuration data or
“compact” the contents of configuration stream
Result?
This will minimize the information required to be
conveyed during configuration or reconfiguration

Configuration Compression

Applying “some” sort of compression to the
configuration data stream
Configuration Decompression
Approaches

Centralized Approach


Decompress the configuration stream at the
boundary of the FPGA
Distributed Approach – New Paradigm

Decompress the stream at the boundary of the
Logic Blocks or Logic Cluster
Centralized Approach

Advantage




Requires hardware only at the boundary of the device from
where the configuration data enters the device
Significant reduction in configuration size can be achieved
Runlength Coding and Lemple-Ziv based compression
used
Examples


Atmel 6000 Series
Zhiyuan Li, Scott Hauck, “Configuration Compression for Virtex
FPGAs”, IEEE Symposium on FPGAs for Custom Computing
Machines, 2001
Centralized Approach

Limitations


More efficient variable length coding not easy to
use because of the large number of symbol
possibilities
It is difficult to quantify symbols in the
configuration stream of heterogeneous devices
which can have different types of blocks
Decentralized Approach

Advantages



Decompressing at the logic block boundary
enables configurations to be easily symbolized
and hence VLC to be used
In other words, we know what exactly we are
coding so Huffman like codes can be used based
on the frequency of configuration occurrences
Also has advantages specific to Wormhole RTR –
discussed next
Decentralized Approach

Limitations



The decompression hardware has to be replicated
Optimality Issue: Decompression hardware should
be amortized over how much programmable logic
area?
In other words, granularity of the logic area should
be determined for optimal cost/benefit ratio
Suitability of Decentralized Approach to
WRTR



If worms are decompressed at the boundary,
large internal worms lengths will result
This leads to greater internal worm lengths
and greater issues to arbitration and worm
blockages
Decentralized approach thus favors shorter
worm lengths and parallel worms to traverse
with less blockages
Variable Length Configuration

Overall idea


Frequently used configurations of a logic block
should have small sized codes
Variable length coding such as Huffman
coding can be adapted
Configuration Frequency Analysis

How to decide upon the frequency?


Hardwired? By the designer through benchmarks
analysis?
Generic? Done by software generating the
configuration stream
Continued…



Hardwired determination will be inferior – no
large benefit gained due to variations in
applications
Software that generates the configuration can
optimally identify a given number of
frequently used configuration according to set
of applications to be executed
Code determination should be done by
software generating the configurations for
optimal codes
Decoding Hardware Approaches

Huffman coding the configurations


A Hardwired Huffman Decoder
Adaptive Decoder (code table can be changed)

Using a Table of Frequently used configurations and
address Decoder

Huffman Coding the Table Addresses


Static Coding
Adaptive Coding
Decoding Hardware Features

Static Huffman Decoders


Coding Configurations


Requires a very wide decoder
Using an Address Decoder only


Lower compression
Reduced hardware but less compression (fixed
sized codes)
Coding the Decoder Inputs

Requires a relatively smaller Huffman decoder
Some points to Note

Decompression approach is decoupled from
any specific logic block architecture


Though certain logic blocks will favor more
compression (discussed later)
Not every possible configuration will be
coded. Especially random logic portions will
require all the bits to be transmitted

A special code will prefix the random logic
configuration to identify it to be handled separately
Logic Block Selection

High Level Issues




Should be Datapath Oriented
Efficient support for random logic implementation
High functionality with minimum configuration bits
to support dense implementation with
reconfiguration overhead reduction
Well defined datapath functionality (configuration)
to aid in the quantification of frequently used
configuration idea
Chosen Hi-Functionality Logic Block
A Low Functionality Logic Block
Logic Block Considerations


High functionality blocks  good for datapath
implementations
Low Functionality blocks  less dense
datapath implementations
Logic Block Considerations



Low functionality blocks have lesser
configurations bits/block and vice versa
Frequently Used Configurations memory cost
depends on the size of configurations stored
So decoder hardware overhead will be less
for low functionality blocks
Logic Block Considerations

What about the configuration time overhead?


Less dense functionality means more blocks to
configure
This leads to longer configuration streams
Logic Block Considerations


Assumption: Random Logic does not benefit
from one block or the other
Consequence: Datapath oriented designs will
require fewer blocks to configure with high
functionality blocks and even higher
compression and larger overhead for random
logic implementations than low-functionality
blocks
Logic Block Issue Conclusions

Proper logic block selection for a particular
application affects



Decoder hardware size
Configuration compression ratio
Since Random logic is not compressed, using
high functionality blocks for less datapath
oriented applications will result in


A high decoder overhead unutilized
Less compression and longer configuration
streams
Huffman decoder hardware

Basic Huffman hardware is sequential in
nature and is variable input rate variable
output rate
Huffman Hardware

Sequential decoding not suitable for WRTR




Worm will have to be stalled
Negates the benefit of fast reconfiguration
Hardware should be able to process N bits at
a time where N=bus width
This requires a constant input rate
architecture with variable number of codes
processed per cycle
Constant Input Rate PLA based
Architecture


Input rate K bits/cycle
PLA undertakes table lookup process
•Input bits
Determine one unique
path along Huffman
tree
•Next state
feed back to the input
of PLA to indicate the
final residing state
•Indicator
The number of
symbols decoded in the
cycle
The constant-input rate PLA-based
architecture for the VLC decoder
PLA Based Architecture

Ref: “VLSI Designs for High Speed Huffman Decoders”,
Shihfu Chang and David G. Messerschimit

Decoder model-FSM
FSM Implemention



ROM
PLA


Lower complexity
high speed
Implementation Results

The area is a function of number of inputs
and outputs along with input rate
Hardware Area Estimates


The number of inputs and outputs depend
upon the maximum codelength and the
symbol sizes
Typically, for a 16 entry table


Code Size range from 1 to 8 bits
Symbol size will equal the decoder input i.e. 4
Handling Multiple Symbols Outputs



More than one code will be decoded with a
max of N codes per cycle
To take full advantage of parallel decoding,
multiple configuration chains can be
employed.
A counter can be used to cycle between the
chains and output one configuration per chain
with a maximum of N chains.
Decoder Hardware

For parallel decoding of N codes and M-bit
decoded symbols




Huffman Decoder (discussed before)
N M-bit decoders
N port ROM table
N parallel configuration chains
Concluding Points

The hardware overhead of the Huffman
decoding mechanism discussed has to be
incorporated in the area model discussed.

Empirical determination of reconfiguration
speed-up versus area overhead
determination for a sampling of benchmarks