A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French [email protected] University of Southern California, Information Sciences Institute 3811 North Fairfax Dr,
Download ReportTranscript A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French [email protected] University of Southern California, Information Sciences Institute 3811 North Fairfax Dr,
A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French [email protected] University of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200 Arlington, VA 22203 French MAPLD 2004 Xilinx FPGA Power Trend 100000 Logic Blocks Figure of Merit 10000 Frequency (MHz) 1000 Power (mW) 100 Voltage 10 1 XC4000E XC4000XL XC4000XV Virtex Virtex-E Virtex-II Xilinx Family • • • Number of Logic Blocks & Maximum Operating Frequency both loosely track Moore’s Law Voltage Reduction is Slower Resulting Power Increase is Exponential! French Page 2 MAPLD 2004 Power Sensitive Applications • Need to consider power as a first-class design constraint • SRAM-based FPGA Quiescent power based on total circuit size • Dynamic Power • Toggle Rates (Data Dependant) • Components Used • Routing • Actual Quiescent and Dynamic Power not known until Circuit is Placed and Routed – For high accuracy, further simulation necessary on timing model • Tools do timing driven placement and routing • So how does one design for low power? French Page 3 MAPLD 2004 Virtex-II Component Power Profile • Derive micro-architecture feature capacitances from – Xilinx Power Estimation Spreadsheets – Xpower Designs – Power Monitoring Testbed – Shang, Kaviani, Bathala, “Dynamic Power Consumption in Virtex-II FPGA Family” FPGA ’02 • Only trying to establish relative capacitances – Models too imprecise to be exact • Derive Low-Power Design Strategy – Minimize Multipliers – Use Shortest Interconnect French Page 4 mW per component 10 1 Flip-Flop Shift Reg 0.1 LUT Block Select RAM Multiplier 0.01 0.001 0 12.5 25 37.5 50 62.5 75 87.5 100 Frequency (MHz, 20% toggle) Resource Embedded Multiplier Capacitance (pF) 1,196 Block Select RAM 880 CLB 26 Long-line Route 23 Hex-line Route 18 Double-line Route 13 Direct-Connect Route 5 MAPLD 2004 Traditional Image Convolution Tap Mask Input Data Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 × Tap 1 Tap 2 Tap 3 Tap 4 Tap 5 Tap 6 Tap 7 Tap 8 Tap 9 •Slide Tap Mask Over Image •Multiply each pixel Partial Products = PP 1 PP 2 PP 3 PP 4 PP 5 PP 6 PP 7 PP 8 PP 9 •Sum all Partial Products •Resulting in new Filtered Pixel Out 1 •Operations •9 Multiplies & 9 Additions / Output Pixel French Page 5 Output MAPLD 2004 Straight Forward Implementation Input Row 1 S R SET CLR Q S Q R Tap 1 Input Row 2 SET CLR Q Tap 2 S R SET CLR Tap 3 Q S Q R Tap 4 Input Row 3 SET CLR R SET CLR Q Tap 6 Q S Q R Tap 7 – Virtex E • Instance in configurable logic • XCV3200E: ~81 Multipliers Max • 9 Pixels in Parallel – Virtex-II • Embedded Multiplier Blocks • XC2V8000: 168 Multipliers • 18 Pixels in Parallel Q Tap 5 S • 3x3 Kernel = 9 parallel multipliers • Multipliers are resource limited in FPGAs Q SET CLR Q Q Tap 8 Tap 9 • Adder Trees Relatively Cheap – 100’s of slices – XCV32000E: 32,000 slices – XC2V8000: 46,000 slices 9 Binary Adder Tree • This also reflects Power Prioritization Output French Page 6 MAPLD 2004 Convolution Kernel Types: Closer Look • Spatial Filtering – – – – Blurring, Smoothing (Lowpass) Sharpening (Highpass) Noise Reduction Edge Detection 1/9 1/9 1/9 -1 -1 -1 -1 -1 +1 1/9 1/9 1/9 -1 8 -1 -1 +1 -1 1/9 1/9 1/9 -1 -1 -1 +1 -1 -1 Smoothing Filter 1 Unique Tap Value Sharpening Filter Edge Detection Filter 2 Unique Tap Values 2 Unique Tap Values • Derivative Filters – Roberts – Prewitt – Sobel -1 -1 -1 -1 0 0 0 0 +1 +1 +1 +1 -1 -2 -1 -1 0 -1 0 +1 0 -2 0 +2 -1 +1 +2 +1 Prewitt Basis 3 Unique Tap Values 0 +1 0 0 -1 +1 0 +1 Sobel Basis 5 Unique Tap Values • Filter Tap Values Reused Often • Can We Exploit This? French Page 7 MAPLD 2004 1-D Symmetric FIR Filter Lessons • Telecommunication and Radar Communities – Exploit symmetric Filters – Reorder Additions Before Multiplication – 1/2 Multipliers Necessary • Can We Exploit 2-D Symmetry? – Tap Values Reprogrammable – Tap Symmetry Reprogrammable x(k) D C(k) = C(K-(k+1)) SET CLR • Minimize Multipliers • Leverage Large Amount of Configurable Logic Blocks • Benefits of Increased Parallelism Q D Q CLR Q Q C3 SET SET Q D Q D CLR Q Q CLR C2 SET SET Q D Q SET CLR D Q Q CLR C1 SET Q Q D CLR C0 – Higher Throughput – More Efficient Power Utilization Over Time y(k) French Page 8 MAPLD 2004 Key Ideas • Number of Active Multipliers Varies with Tap Mask – Turn off unused Multipliers – lower power – Or, use unused Multipliers to process next pixel • Requires parallel memory accesses • Higher throughput • Finish sooner – sleep device • Lower Clock Rate • Adder Tree layers before and after multiply vary with number of Multipliers per pixel • Input Data must be able to be routed to each multiplier • Will multiplier savings outweigh extra routing, multiplexing, larger circuit quiescent power? French Page 9 MAPLD 2004 Adaptive Convolution Kernel Sizing • Implementing Multiple Pixel Version • How Many Multipliers to Use? – Multiple of 9 – Size that is easy to place and allow for TMR growth 18 Multipliers Per Kernel Number of Unique Taps in 3x3 Conv. Mask Number of Masks per Kernel Speedup Over Traditional Convolution 1 18 9x 2 9 4.5x 3 6 3x 4 4 2x 5 3 1.5x 6 3 1.5x 7-9 2 1x French Page 10 MAPLD 2004 Kernel Block Diagram Register Delay Bank Input Row 0 Q1 A Q1 A H Q8 H Q8 H ENB Register Register Register 3 Register A Q1 A Q1 H H Q Q88 H Q8 H Q8 ENB ENB ENB H A Q8 H Register Q1 A Q8 H ENB Register Register Q8 M 3 A Q1 A Q8 H Q8 H Q8 ENB 3 9 9 M State Machine Group Data Values with Common Taps French Output 17 Q1 Number of Unique Taps Tap Mask Tap Value Output 0 Output 1 Register Q1 H ENB Q1 ENB A ENB 3 ENB Register Q1 ENB Input Row 19 Q8 ENB Q Q11 Register 9 Q1 A A A 9 M Register A ENB 9 Output Adder Tree Input Row 1 Input Row 2 Register 9 Data Mux Register Adder Tree Common Tap Mux Page 11 Dynamically Adjust Multiplier Position within Adder Tree MAPLD 2004 Implementation Comparison Baseline Power Efficient Flip-Flops 6,435 7,231 LUTs 8,141 9,181 6 20 18 18 100 MHz 100 MHz 711.6 965.0 Block Rams Multipliers Operating Frequency Quiescent Power (mW) Quiescent Power 35% Higher French Page 12 MAPLD 2004 Total Energy Comparison 8 Energy (mJ) per 512 x 512 Image 7 Energy Improvement Factor 6 5 For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power 4 3 2 1 0 Baseline Power Efficient (9 taps) Power Efficient (8 taps) Power Efficient (7 taps) Power Efficient (6 taps) Power Efficient (5 taps) Power Efficient (4 taps) Power Efficient (3 taps) Power Efficient (2 taps) Power Efficient (1 taps) Kernel French Page 13 MAPLD 2004 What is hard? • Poor Tool Support for Power Design – Analyzing Power Trade offs can be complex & time consuming – Have to have fully routed and simulated designs to compare approaches – Router is optimized for throughput, not power • Finding all Chip Enables to Disable – For each of several different multiplexer settings • Secondary Power Effects – Can also use Relative Placement Macros to “help” Router • Finding where can be time consuming French Page 14 MAPLD 2004 Analysis • For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power – Crossover point at 7 taps is an implementation limitation of using 18 multipliers in kernel • Quiescent Power – Not much larger considering extra circuitry • 18 Adder Trees, 16 Block RAMs • Dynamic Power Consumption – Observed to vary by +50% within one circuit from one place and route to another, even using same settings – Average of 3 routes used for each circuit • For Systems Where Parallelizing Input Data Stream Is Difficult – Disabling extra Multipliers is best approach – Power savings expected to be less French Page 15 MAPLD 2004 Conclusions • Substantial Power Savings can be Achieved by Making Power a First-Class Design Constraint • Knowledge of Underlying Resource Capacitance a Key Foundation – Re-use Power-Critical Components • Routing Can Be Influenced to Yield Lower Power • Over-constrain timing on power sensitive nets • Use Relative Placement Macros (RPMs) French Page 16 MAPLD 2004