High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose,
Download ReportTranscript High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose,
High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006 UC Santa Barbara ICCD 2006 Outline Introduction FIR filter implementation Traditional Methods New method Add and Shift method and CSE (Common Subexpresssion Elimination) Experiments and results MAC (Multiply Accumulate) implementation DA (Distributed Arithmetic) implementation Resource utilization Power consumption Conclusion UC Santa Barbara ICCD 2006 Introduction Extensive use of FPGAs in computationally intensive applications such as DSP More available logic resources in current FPGAs Broad applications of FIR filters in multimedia and communications Need to efficient design methods to save area/power Research motivation Develop a more efficient implementation method for FIR filters that consumes less area at comparable performance. Develop a unified tool for performing redundancy elimination, scheduling and module assignment. Perform physically aware optimizations. Architecture design exploration for ASIC and FPGA implementations (Distributed Arithmetic based, adder-shifter based, multiplier-adder based). UC Santa Barbara ICCD 2006 FIR Filter MAC Implementation L tap FIR filter Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series. y[n] = ∑ h[k] x[n-k] X [n] x hL-1 x x hL-2 k= 0, 1, ..., L-1 x hL-3 x h1 h0 y [n] z-1 + z-1 + z-1 ... + z-1 + z-1 Disadvantages Large area on FPGA due to multipliers and the fact that full flexibility of general purpose multipliers are not required Limited number of embedded resources such as MAC engines, multipliers, etc. in FPGAs UC Santa Barbara ICCD 2006 FIR Filter DA (Distributed Arithmetic) Implementation An alternative to MAC implementation which is the most common FPGA FIR implementation due to the LUT rich architecture of FPGAs. y[n] = ∑ c[n] ∙ x[n] n = 0, 1, …, N-1 Variable x[n] can be represented by: x [n] = ∑ xb [n] ∙ 2b b=0, 1, …, B-1 xb [n] € [0, 1] where xb [n] is the bth bit of x[n] and B is the input width. The inner product can be rewritten as follows: UC Santa Barbara ICCD 2006 FIR Filter DA (Distributed Arithmetic) Implementation (cont’d) y = ∑ c[n] ∑ xb [k] ∙ 2b = c[0] (xB-1 [0]2B-1 + xB-2 [0] 2B-2 + … + x0 [0]20 ) + c[1] (xB-1 [1] 2B-1 + xB-2 [1] 2B-2 + … + x0 [1] 20 ) +… + c[N-1] (xB-1 [N-1] 2B-1 + xB-2 [0] 2B-2 + … + x0 [N-1] 20 ) = (c[0] xB-1 [0] + c[1] xB-1 [1] + … + c[N-1] xB-1 [N-1]) 2B-1 +(c[0] xB-1 [0] + c[1] xB-2 [1] + … + c[N-1] xB-2 [N-1]) 2B-2 +… + (c[0] x0 [0] + c[1] x0 [1] + … + c[N-1] x0 [N-1]) 20 = ∑ 2b ∑ c[n] ∙ xb [k] where UC Santa Barbara n=0, 1, …, N-1 and b=0, 1, …, B-1 ICCD 2006 DA (Distributed Arithmetic) Implementation Serial A Serial DA Filter Block Diagram n+1 clock cycles are needed for an n but input symmetrical filter to generate the output. Performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed The tradeoff here is performance for area UC Santa Barbara scaling accumulator x0[i] x1[i] x2[i] x3[i] << LUT + x4[i] x5[i] x6[i] x7[i] + D SET CLR Q Q LUT Address Data 0000 0 0001 C0 0010 C1 … … 1111 C0+C1+C2+C3 ICCD 2006 DA (Distributed Arithmetic) Implementation Parallel The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups Increasing the number of bits sampled has a significant effect on resource utilization on FPGA. More LUTs Larger size scaling accumulator x0[i] x1[i] x2[i] x3[i] LUT + x4[i] x5[i] x6[i] x7[i] scaling accumulator << LUT + x0[i+1] x1[i+1] x2[i+1] x3[i+1] + D SET CLR Q Q LUT + x4[i+1] x5[i+1] x6[i+1] x7[i+1] LUT A 2 bit parallel DA Filter Block Diagram UC Santa Barbara ICCD 2006 CSE (Common Subexpression Elimination) Linear systems can be modeled using polynomials. Expressions consist of +,-,<< operators. Polynomial formulation C × X = (±X×Li) (14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1 UC Santa Barbara ICCD 2006 CSE Example Y0 = Y1 = Y2 = Y3 = Y0 Y1 Y2 Y3 UC Santa Barbara = Y0 = Y1 = Y2 = Y3 = X0 + X1 + X2 + X3 2X0 + X1 – X2 – 2X3 X 0 – X 1 – X 2 + X3 X0 – 2X1 + 2X2 – X3 1 2 1 1 1 1 -1 -2 1 -1 -1 2 1 -2 1 -1 X 0 + X1 + X2 + X3 X0 L + X1 – X 2 – X 3 L X 0 – X 1 – X 2 + X3 X0 – X 1 L + X2 L – X 3 X0 X1 X2 X3 ICCD 2006 CSE Example Y0 Y1 Y2 Y3 = = = = X0 + X1 + X2 + X3 X0L + X1 - X2 - X3L X0 - X1 - X2 + X3 X0 - X1L + X2L - X3 Y0 Y1 Y2 Y3 = = = = D0 + X1 + X2 X0L + X1 - X2 - X3L D0 - X1 - X2 X0 - X1L + X2L - X3 UC Santa Barbara D0 = (X0 + X3) D1 = (X1 – X2) ICCD 2006 CSE Example Y0 Y1 Y2 Y3 = = = = D0 + X1 + X2 X0L + D1 - X3L D0 - X1 - X2 X0 - D1L - X3 Y0 Y1 Y2 Y3 = = = = D0 + D2 X0L + D1 - X3L D0 - D2 X0 - D1L - X3 UC Santa Barbara D2 = (X1 + X2) D3 = (X0 – X3) ICCD 2006 CSE Example Y0 Y1 Y2 Y3 = = = = X0 + X1 + X2 + X3 X0L + X1 - X2 - X3L X0 - X1 - X2 + X3 X0 - X1L + X2L - X3 8 additions D0 = D1 = D2 = D3 = X 0 + X3 X 1 – X2 X 1 + X2 X 0 - X3 12 additions 4 shifts Y0 = D 0 + D2 Y1 = D1 + D 3L Y2 = D 0 - D 2 Y3 = D3 – D 1L 2 shifts UC Santa Barbara ICCD 2006 FIR Filter Add/Shift Implementation Replacing Constant Multiplication by Multiplier Block X [n] x x hL-1 x hL-2 x hL-3 x h1 h0 y [n] z-1 + z-1 ... z-1 + z-1 + + z-1 X[n] + + + Multiplier Block + + y0 Delay Block UC Santa Barbara + + y2 y1 Z-1 + Z-1 + yL-1 Z-1 Z-1 + y[n] ICCD 2006 FIR Filter Add/Shift Implementation Registered Adder at no Additional Cost s + X X y X1 y1 + s' z-1 y LUT D SET CLR Q s1 X1 y1 LUT D Q SET CLR Logic Block 2 Q Logic Block 2 carry X0 y0 LUT D SET CLR Logic Block 1 UC Santa Barbara s'1 Q Q carry s0 X0 y0 LUT D Q SET CLR Q s'0 Q Logic Block 1 ICCD 2006 Extracting Common Subexpressions F1 = A + B + C + D F2 = A + B + C + E Optimization Extracting Common Expression (A + B + C) Unoptimized Expression Trees Extracting Common Expression (A + B) UC Santa Barbara ICCD 2006 Synchronization Extra registers are needed to synchronize the intermediate values, such that new values for A,B,C,D,E,F can be read in every clock cycle Calculating registers required for fastest evaluation UC Santa Barbara ICCD 2006 Experiment Results Resource Utilization/Performance Filter (# taps) Slices LUTs FFs Performance (Msps) Filter (# taps) Slices LUTs FFs Performance (Msps) 6 264 213 509 251 6 524 774 1012 245 10 474 406 916 222 10 781 1103 1480 222 13 386 334 749 252 13 929 1311 1775 199 20 856 705 1650 250 20 1191 1631 2288 199 28 1294 1145 2508 227 28 1774 2544 3381 199 41 2154 1719 4161 223 41 2475 3642 4748 222 61 3264 2591 6303 192 61 3528 5335 6812 199 119 6009 4821 11551 203 119 6484 9754 12539 205 151 7579 6098 14611 180 151 8274 12525 15988 199 Filter Implementation Using Add and Shift Method UC Santa Barbara Filter Implementation Using Xilinx Coregen (PDA) ICCD 2006 Experiment Results Resource Utilization Reduction in Resources 80 % Reduction 70 60 50 SLICEs 40 LUTs 30 FFs 20 10 0 6 10 13 20 28 41 61 119 152 # of Taps UC Santa Barbara ICCD 2006 Experiment Results Power Consumption Power (mw) Dynamic Power Consumption 1600 1400 1200 1000 800 600 400 200 0 Add/Shift Coregen 6 10 13 20 28 41 61 119 Filter size (# of taps) UC Santa Barbara ICCD 2006 Creating MAC Filters Using Xilinx Coregen UC Santa Barbara ICCD 2006 Experiment Results Comparison with MAC Filters Using Multiplier Blocks Filter (# taps) Add Shift Method MAC filter Slices Msps Slices Msps 6 264 296 219 262 10 475 296 418 253 13 387 296 462 253 20 851 271 790 251 28 1303 305 886 251 41 2178 296 1660 243 61 3284 247 1947 242 119 6025 294 3581 241 151 7623 294 7631 215 UC Santa Barbara ICCD 2006 Experiment Results Comparison with MAC Filters Using Multiplier Blocks – Resource Utilization # of slices resource utilization 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 MAC Add and Shift 1 2 3 4 5 6 7 8 9 # of taps UC Santa Barbara ICCD 2006 Experiment Results Comparison with MAC Filters Using Multiplier Blocks Performance Performance 350 300 Msps 250 200 Add and Shift 150 MAC 100 50 0 1 2 3 4 5 6 7 8 9 # of taps UC Santa Barbara ICCD 2006 Conclusion/Observations Presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. Validated our techniques on Virtex II/IV devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. an average reduction of 58.7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. Better performance in most of the cases even though our algorithm does not optimize for performance Observed up to 50% reduction in dynamic power consumption Higher performance as the filter size increases. UC Santa Barbara Critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. ICCD 2006