Design of Logic Block Architectures for Digital Signal

Download Report

Transcript Design of Logic Block Architectures for Digital Signal

A Survey of Logic
Block Architectures
For Digital Signal Processing
Applications
Presentation Outline

Considerations in Logic Block Design
 Computation
Requirements
 Why Inefficiencies?

Representative Logic Block Architectures
 Proposed
 Commercial

Conclusions: What is suitable Where?
Why DSP??? The Context



Representative of computationally intensive
class of applications  datapath oriented and
arithmetic oriented
Increasingly large use of FPGAs for DSP 
multimedia signal processing, communications,
and much more
To study the “issues” in reconfigurable fabric
design for compute intensive applications 
What is involved in making a fabric to accelerate
multimedia reconfigurable computing possible?
Elements of a Reconfigurable
Architecture

Logic Block/Processing Element
 Differing
Grains Fine>>Coarse>>ALUs
Routing
 Dynamic Reconfiguration

So what’s wrong with the typical
FPGA?
Meant to be general purpose  lower
risks
 Toooo Flexible!  Result: Efficiency Gap
 Higher Implementation Cost, Larger Delay,
Larger Power Consumption than ASICs
 Performance vs. Flexibility Tradeoff 
Postponing Mapping and Silicon Re-use

Solution? See how FPGAs are
Used?
FPGAs are being used for “classes” of
applications  Encryption, DSP,
Multimedia etc.
 Here lies the Key  Design FPGAs for a
class of applications
 Application Domain Characterization 
Application Domain Tuning

Domain Specialization
COMPUTATION
 defines 
ARCHITECTURE

Target Application Characteristics known
beforehand? Yes
1.
Characterize the application domain
Determine a balance b/w flexibilty vs efficiency
Tune the architecture according
2.
3.
Categorizing the “Computation”
Control  Random Logic Implementation
 Datapath  Processing of Multi-bit Data
 Conflicting Requirements???

Datapath Element Requirements
Operates on Word Slices or Bit Slices
 Produces multi-bit outputs
 Requires many smaller elements to
produce each bit output  i.e. multiple
small LUTs

Control Logic Requirements
Produces a single output from many single
bit inputs
 Benefits from large grain LUT as logic
levels gets reduced

Logic Block Design:
Considerations
“How much” of “what kinds” of
computations to support?
 Tradeoff: Generality vs Specialization

How much of What? Applications
benchmarking
So what do we have to support?





Datapath functionality, in particular arithmetic, is
dominant in DSP.
The datapath functions have different bit-widths.
DSP designs heavily use multiplexers of various
size. Thus, an efficient mapping of multiplexers
should be supported.
DSP functions do contain random logic. The
amount of random logic varies per design.
Some DSP designs use wide boolean functions.
DSP Building Blocks


Some techniques widely used to achieve areaspeed efficient DSP implementations
Bit Serial Computations
 Routing
Efficient
 Bit Level Pipelining Increases throughput even more

Digit Serial Computation
 Combining
“Area efficiency” of bit-serial and with
“Time efficiency” of Bit-parallel
Classes of DSP-optimized FPGA
Architectures
1.
Architectures with Dedicated DSP Logic

Homogeneous
 Hetrogeneous
 Globally Homogeneous, Locally
Heterogenous
2.
3.
Architectures of Coarser Granularity
With DSP Specific Improvements (e.g.
Carry Chains, Input Sharing, CBS)
Some Representative
Architectures
Bit-Serial FPGA with SR LUT
Bit-serial paradigm suites the existing
FPGA so why not optimize the FPGA for it!
 Logic block to support efficient
implementation of bit-serial data path and
bit-level pipelining
 LUTs can be used for combinational logic
as well as for Shift Registers

A Bit-Serial Adder
A Bit-Serial Adder
which processes two
bits at a time
Interface Block Diagram
A Bit-Serial Multiplier Cell
The Proposed Bit Serial Logic
Block Architecture

4x4-input LUTs and 6 flip-flops.

The two multiplexers in front of the LUTs are targeted mainly
for carry-save operations which are frequently used in bitserial computations.

There are 18 signal inputs and 6 signal outputs, plus a clock
input.

Feed-back inputs c2, c3, c4, c5 can be connected to either
GND or VDD or to one of the 4 outputs d0, d1, d2, d3.
Therefore, each LUT can implement any 4-input functions
controlled by inputs a0, a1, a2, a3 or b0, b1, b2, b3.

Programmable switches connected to inputs a4 and b4 control
the functionality of the four multiplexers at the output of LUTs.
As a result, 2 LUTs can implement any 5-input functions.

The final outputs d0, d1, d2, d3 can either be the direct outputs
from the multiplexers or the outputs from flip-flops. All bit-serial
operators use the outputs from flip-flops; therefore the
attached programmable switches are actually unnecessary.
They are only present in order to implement any other logic
functions other than bit-serial datapath circuits.

Two flip-flops are added (inputs c0 and c1) to implement shift
registers which are frequently used in bit-serial operations.
The Modified LUT Implementing a
Shift Register
Performance Results
Digit-Serial Logic Block
Architecture
Digit–Serial Architectures process one
digit (N=4 bits) at a time
 They offer area efficiency similar to bitserial architectures and time-efficiency
close to bit-parallel architectures
 N=4 bits can serve as an optimal
granularity for processing larger digit sizes
(N=8,16 etc)

Digit-Serial Building Blocks
A Digit-Serial Adder
A Digit-Serial Unsigned Multiplier
Digit-Serial Building Blocks
A Pipelined Digit-Serial Unsigned Multiplier For Y=8 bits
Digit-Serial Signed Multiplier Blocks
First Stage Module
Middle Stages Module
Last Stage Module
Signed Digit-Serial Multiplier
A Digit-Serial Signed Booth’s Pipelined Multiplier with Y=8
Proposed Digit-Serial Logic Block
Detailed Structure of Digit-Serial
Logic Block
The Basic Logic Module (LM)
Table of Functions Implemented
The Structure of the LM
Examples of Implementations
N=4 Unsigned
Multiplier
N=4 Signed
Multiplier
Two N=2
Multipliers
Bit-Level
Pipelined
Area Comparison with Xilinx 4000
Series
Mixed-Grain Logic Block
Architecture
Exploits the adder inverting property
 Efficiently implements both datapath and
random logic in the same logic block
design

Adder Inverting Property
Full Adder and Equations Showing
The Inverting Property
An optimal structure derived from
the property
LUT Bits Utilization in Datapath and
Logic Modes
Structure of a Single Slice
Complete Logic Block
Modified ALU Like Functionality
Comparison Results
Comparison Results (Cont…)
Comparison Results (cont…)
Coarser ALU Like
Architectures
CHESS Architecture
CHESS ALU Based Logic Block
Structure of a Switch Box
Comparison Results
Computation Field Programmable
Architecture
A Heterogeneous architecture with cluster
of datapath logic blocks
 Separate LUT Based Logic Blocks for
supporting random logic mapping
 Basic Logic Block called a Partial Adder
Subtraction Multiplier (PASM) Module

PASM Logic Block of CFPA
Cluster of PASM Logic Blocks
Comparison Results
Some Industry
Architectures Designs
Altera APEX II Logic Element
Altera MAX II Logic Element
LE Configuration in Arithmetic Mode
LE in Random Logic Implementation
Altera Stratix Logic Element
Altera Stratix II Architecture
Stratix II Adaptive Logic Module
Stratix II ALM in Arithmetic Mode
Various Configurations in an ALM of
Stratix II
Multiplier Resources in Stratix II
Structure of a DSP Block in Stratix II
XILINX Virtex II Pro Architecture
Basic Logic Element of Virtex II Pro
Dedicated Multipliers in Virtex II Pro
ProcessorProgrammable Logic
Coupled Architecture
PiCoGA Architecture Coupled with
a VLIW processor
PiCoGA Logic Block
Conclusions




Traditional general purpose FPGA inefficient for
data path mapping
Logic blocks with DSP specific enhancements
seem a promising solution
Coarse Grained Logic can achieve better
application mapping for data path but sacrifice
flexibility
Dedicated Blocks (Multipliers) increase
performance but also increases cost significantly
Conclusions
PDSPs with embedded FPGA can achieve
a good balance between performance and
power consumption
 So…Which approach is the best?  No
single best exists

Suitability of Approaches
Highly computationally intensive
applications with large amounts of
parallelism can use platform FPGAs where
often large resources are required and
power consumption is not an issue.
 Here cost/function will be lowest

Suitability of Approaches

Field Programmable Logic based
coprocessors can benefit from coarse
grained blocks where most control
functions are implemented by the PDSP
itself
Suitability of Approaches

Higher flexibility and lower cost can be
achieved with logic blocks with DSP
specific enhancements but flexibility to
implement control logic in an efficient
manner.