Digital Filtering In Hardware Adnan Aziz Slide 1

Download Report

Transcript Digital Filtering In Hardware Adnan Aziz Slide 1

Digital Filtering In
Hardware
Adnan Aziz
Slide 1
Introduction
 Digital filtering vs Analog filtering
– More robust (process variations, temperature),
flexible (bit precision, program), store & recover
– Lower performance (esp high freq), more
area/power, cannot sense, need data-converters
 Can perform digital filtering in hardware or software
– Software (DSP/generic microprocessors): flexible,
less up-front cost
– Hardware (ASIC/FPGA): customized, cheaper in
volume, lower area/power
Slide 2
Applications
 Applications: noise filtering, equalization, image
processing, seismology, radar, ECC, audio/image
compression
 Focus: implementing difference equations
– No feedback – FIR, feedback – IIR
– Assume coefficient synthesis done
– Operate almost exclusively in time domain (FFT
done)
Slide 3
Evolution
Slide 4
Various Representations
 3-tap FIR: y (n )  a. x (n )  b. x (n  1)  c. x (n  2)
 Non terminating, repeatedly execute same code
– iteration: Execute all operations, iteration period: time to
perform iteration, iteration rate: inverse of iteration period
– sampling rate (aka throughput): number of samples per
second, critical path: max combinational delay (no wave
pipelining!)
 Block Diagram
– Close to actual hardware – interconnected functional
blocks, potentially with delay elements between blocks
Slide 5
Block Diagram
Slide 6
Block Diagram
Slide 7
Signal Flow Graph
 Unique source, sink (input and output)
– Edges represent const multiplier, delay
– Nodes represent I/O, adder, mult
– Useful for wordlength effects, less for architecture
design
Slide 8
SFG
Slide 9
Dataflow Graph
 DFG
– Nodes: computations (functions, subtasks)
– Edges: datapaths
• Capture data-driven nature of DSP, intraiteration and inter-iteration constraints
 Very general: nonlinear, multirate, asynchronous,
synchronous
 Difference from block diagram:
– Hardware not allocated, scheduled in DFG
Slide 10
DFG
Slide 11
DFG
Slide 12
Multirate DFG
Slide 13
Iteration Bound
 In DFG, execution each node once in an iteration
– All nodes executed: iteration
 Critical path: combinational path with maximum total
execution time (Note: we’re reserving the term delay
for sequential delay)
 Loop (=cycle): path beginning and ending at same
node
– Loop bound for loop L = TL/WL
 Iteration Bound: maximum of all loop bounds
– Lower bound on execution time for DFG
(assuming only pipelining, retiming, unfolding)
Slide 14
Iteration Bound
Slide 15
Iteration Bound
Slide 16
Iteration Bound
Slide 17
2.3
Slide 18
2.4
Slide 19
2.5
Slide 20
2.6
Slide 21
2.7
Slide 22
Pipeline and Parallelize
 Pipelining: insert delay elements to reduce critical
path length
– Faster (more throughput), lower power
– Added latency, latches/clocking
 Parallelism: compute multiple outputs in a single
clock cycle
– Faster, lower power
– Added hardware, sequencing logic
Slide 23
Pipelining
 General: applicable to microprocessor architectures,
logic circuits, DFGs
– Have to place delays (=flops) carefully
– On “feed forward” cutsets
Slide 24
Pipelining
Slide 25
Pipelining & Parallel
Slide 26
Pipelining
Slide 27
Feed-forward Cutset
Slide 28
Transposition
Slide 29
Transposition
Slide 30
Data Broadcast
Slide 31
Fine-grain Pipelining
Slide 32
Parallel Processing
 Process blocks at a time
– Clock period =L * Sample Rate
Slide 33
Parallelism
Slide 34
Parallelism
Slide 35
Components
Slide 36
Need for Parallelism
Slide 37
Parallelism
 Why not use pipelining?
– May have a single large delay element that
cannot be divided (communication between
chips)
 Can use in conjunction with pipelining
– Relatively less efficient than pipelining (area cost
and power savings)
 Note that we’ve skirted the issue of parallizing
general DFGs
– Loops make life hard
Slide 38
Parallelize + Pipeline
Slide 39
Area Efficiency
Slide 40
Pipelining Processors
 Classic DLX processor
– ISA: Load/Store or Mem Access
– 5 stages: IF, ID, EX, MEM, WB
 Pipelining processors is hard
– Data hazards:
• ADD r1, r2, r3; SUB r4, r5, r1
• Solution: Use bypass logic
• LD r1, [r2]; ADD r4, r1, r2
• Solution?
– Branch hazards
• PC not changed till end of ID
• Solution: redo IF (only) if branch taken
 Pipelining DFGs is easy (no control flow!)
Slide 41
Pipelining Processors
Slide 42
Retiming
 Basic idea (for logic circuits)
– Move flops back and forth across gates
– Use for clock period reduction, flop minimization,
power minimization, resynthesis
 Same idea holds for DFGs
– Examples
– Algorithm
– C-slow retiming
Slide 43
Retiming
Slide 44
Retiming
Slide 45
Cutset Retiming
Slide 46
Cutset Retiming
Slide 47
C-Slow Retiming
Slide 48
Min Delay Retiming
 Formalize: use notion of “retiming function” on nodes
– Amount of delay pushed back of node (can be
negative – think of as retardation function)
 Want to know if cycle time TC is feasible
– set up constraints
• Long paths have to be broken
• No negative delays on edges
– Solve using a custom ILP
• Uses efficient graph algorithms
Slide 49
Unfolding
 Analagous to loop unrolling for programs
– for (I=1; I< 5; I++) { a[I] = b[I]+c[I]; }
– Many benefits, at the price of potential increase in
code size
 Look at 2-unfolding of
– Y(n) = x(n) + a y(n-9)
 General algorithm for J-unfolding a DFG
– Uses J nodes for each original node, new delay
values
– Nontrivial fact: algorithm works
Slide 50
Unfolding
Slide 51
Unfolding
Slide 52
Applications
 Meet iteration bound
– When a single node has large execution time
– When IB is nonintegral
Slide 53
Applications: IB
Slide 54
Application: fractional IB
Slide 55
Applications: Parallelize
 Recall in Chapter 3,
we never gave a
systematic way of
generating parallel
circuits
– Loop unfolding
gives a way
Slide 56
Applications: Bit-Digit
 Convert a bit-serial architecture to a digit-serial
architecture
Slide 57
Folding
 Trade area for time
– Use same hardware unit for multiple nodes in the
DFG
 Example: y(n) = a(n) + b(n) + c(n)
 Need general systematic approach to folding
– Math formulation: folding orders, folding sets,
folding factors
Slide 58
Folding
Slide 59
Folding
Slide 60
Folding
Slide 61
Folding
Slide 62
Folding
Slide 63
Folding
Slide 64
Register Minimization
 Consider DSP program that produces 3 variables:
– a: {1,2,3,4}
– b: {2,3,4,5,6,7}
– c: {5,6,7}
 Number of live variables: {1,2,2,2,2,2,2}
– Intuitively, should be able to get by with 2
registers
 However, DSP programs are periodic
– May have variable live across iterations
Slide 65
Linear Lifetime Chart
Slide 66
Lifetime Analysis: Matrix
Slide 67
Lifetime Chart: Matrix
Slide 68
Register Allocation Table
Slide 69
Reg Assignment: Matrix
Slide 70
Reg Assignment: Biquad
Slide 71
Reg Assignment: Biquad
Slide 72
Pipelined & Parallel IIR
 Feedback loops makes pipelining and parallelism
very hard
– Impossible to beat iteration bound without
rewriting the difference equation
 Example
– Pipeline interleaving of y(n+1) = a y(n) + b u(n)
– Note that IB goes up, but can run multiple
streams in parallel
Slide 73
Pipeline Interleaved IIR
Slide 74
Pipeline Interleaved IIR
Slide 75
Pipeline Interleaved IIR
Slide 76
Pipelining 1-st Order IIR
 Y(n) = a y(n) + u(n)
– Sample rate set by multiply and add time
 Can do better by “look ahead pipelining”
– Basically, changing the difference equation to get
more delays in the loop
• Key – functionality unchanged
– Best understood in terms of Z-transforms
Slide 77
Pipelining 1-st Order IIR
Slide 78
Pipelining 1-st Order IIR
Slide 79
Pipelining High Order IIR
 Three basic approaches
– Clustered look-ahead
– Scattered look-ahead
– Direct synthesis with constraints
Slide 80
Pipelining High Order IIR
Slide 81
Pipelining High Order IIR
Slide 82
Pipelining High Order IIR
Slide 83
Pipelining High Order IIR
Slide 84
Pipelining High Order IIR
Slide 85
Pipelining High Order IIR
Slide 86
Pipelining High Order IIR
Slide 87
Slide 88
Slide 89
Slide 90
Slide 91
Slide 92
Slide 93
Slide 94
Slide 95
Slide 96
Slide 97
Slide 98
Slide 99
Slide 100
Slide 101
Slide 102