A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French [email protected] University of Southern California, Information Sciences Institute 3811 North Fairfax Dr,

Download Report

Transcript A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French [email protected] University of Southern California, Information Sciences Institute 3811 North Fairfax Dr,

A Power Efficient Image Convolution Engine for
Field Programmable Gate Arrays
Matthew French
[email protected]
University of Southern California, Information Sciences Institute
3811 North Fairfax Dr, Suite 200
Arlington, VA 22203
French
MAPLD 2004
Xilinx FPGA Power Trend
100000
Logic
Blocks
Figure of Merit
10000
Frequency
(MHz)
1000
Power
(mW)
100
Voltage
10
1
XC4000E
XC4000XL
XC4000XV
Virtex
Virtex-E
Virtex-II
Xilinx Family
•
•
•
Number of Logic Blocks & Maximum Operating
Frequency both loosely track Moore’s Law
Voltage Reduction is Slower
Resulting Power Increase is Exponential!
French
Page 2
MAPLD 2004
Power Sensitive Applications
• Need to consider power as a first-class design
constraint
• SRAM-based FPGA Quiescent power based on total
circuit size
• Dynamic Power
• Toggle Rates (Data Dependant)
• Components Used
• Routing
• Actual Quiescent and Dynamic Power not known until
Circuit is Placed and Routed
– For high accuracy, further simulation necessary on timing model
• Tools do timing driven placement and routing
• So how does one design for low power?
French
Page 3
MAPLD 2004
Virtex-II Component Power Profile
• Derive micro-architecture feature
capacitances from
– Xilinx Power Estimation
Spreadsheets
– Xpower Designs
– Power Monitoring Testbed
– Shang, Kaviani, Bathala, “Dynamic
Power Consumption in Virtex-II FPGA
Family” FPGA ’02
• Only trying to establish relative
capacitances
– Models too imprecise to be exact
• Derive Low-Power Design
Strategy
– Minimize Multipliers
– Use Shortest Interconnect
French
Page 4
mW per component
10
1
Flip-Flop
Shift Reg
0.1
LUT
Block Select RAM
Multiplier
0.01
0.001
0
12.5 25 37.5 50 62.5 75 87.5 100
Frequency (MHz, 20% toggle)
Resource
Embedded Multiplier
Capacitance (pF)
1,196
Block Select RAM
880
CLB
26
Long-line Route
23
Hex-line Route
18
Double-line Route
13
Direct-Connect Route
5
MAPLD 2004
Traditional Image Convolution
Tap Mask
Input Data
Data 1
Data 2
Data 3
Data 4
Data 5
Data 6
Data 7
Data 8
Data 9
×
Tap 1
Tap 2
Tap 3
Tap 4
Tap 5
Tap 6
Tap 7
Tap 8
Tap 9
•Slide Tap Mask Over Image
•Multiply each pixel
Partial Products
=
PP 1
PP 2
PP 3
PP 4
PP 5
PP 6
PP 7
PP 8
PP 9

•Sum all Partial Products
•Resulting in new Filtered Pixel
Out 1
•Operations
•9 Multiplies & 9 Additions /
Output Pixel
French
Page 5
Output
MAPLD 2004
Straight Forward Implementation
Input Row 1
S
R
SET
CLR
Q
S
Q
R
Tap 1
Input
Row 2
SET
CLR
Q
Tap 2
S
R
SET
CLR
Tap 3
Q
S
Q
R
Tap 4
Input
Row 3
SET
CLR
R
SET
CLR
Q
Tap 6
Q
S
Q
R
Tap 7
– Virtex E
• Instance in configurable logic
• XCV3200E: ~81 Multipliers Max
• 9 Pixels in Parallel
– Virtex-II
• Embedded Multiplier Blocks
• XC2V8000: 168 Multipliers
• 18 Pixels in Parallel
Q
Tap 5
S
• 3x3 Kernel = 9 parallel multipliers
• Multipliers are resource limited in FPGAs
Q
SET
CLR
Q
Q
Tap 8
Tap 9
• Adder Trees Relatively Cheap
– 100’s of slices
– XCV32000E: 32,000 slices
– XC2V8000: 46,000 slices
9
Binary
Adder
Tree
• This also reflects Power Prioritization
Output
French
Page 6
MAPLD 2004
Convolution Kernel Types: Closer Look
• Spatial Filtering
–
–
–
–
Blurring, Smoothing (Lowpass)
Sharpening (Highpass)
Noise Reduction
Edge Detection
1/9 1/9 1/9
-1 -1 -1
-1 -1 +1
1/9 1/9 1/9
-1
8 -1
-1 +1 -1
1/9 1/9 1/9
-1 -1 -1
+1 -1 -1
Smoothing Filter
1 Unique Tap
Value
Sharpening Filter
Edge Detection Filter
2 Unique Tap
Values
2 Unique Tap
Values
• Derivative Filters
– Roberts
– Prewitt
– Sobel
-1 -1 -1
-1 0
0
0
0
+1 +1 +1
+1
-1 -2 -1
-1 0
-1 0 +1
0
-2 0 +2
-1
+1 +2 +1
Prewitt Basis
3 Unique Tap
Values
0 +1
0
0
-1
+1
0 +1
Sobel Basis
5 Unique Tap
Values
• Filter Tap Values Reused Often
• Can We Exploit This?
French
Page 7
MAPLD 2004
1-D Symmetric FIR Filter Lessons
• Telecommunication and Radar
Communities
– Exploit symmetric Filters
– Reorder Additions Before Multiplication
– 1/2 Multipliers Necessary
• Can We Exploit 2-D Symmetry?
– Tap Values Reprogrammable
– Tap Symmetry Reprogrammable
x(k)
D
C(k) = C(K-(k+1))
SET
CLR
• Minimize Multipliers
• Leverage Large Amount of
Configurable Logic Blocks
• Benefits of Increased Parallelism
Q
D
Q
CLR
Q
Q
C3
SET
SET
Q
D
Q
D
CLR
Q
Q
CLR
C2
SET
SET
Q
D
Q
SET
CLR
D
Q
Q
CLR
C1
SET
Q
Q
D
CLR
C0
– Higher Throughput
– More Efficient Power Utilization Over Time
y(k)
French
Page 8
MAPLD 2004
Key Ideas
• Number of Active Multipliers Varies with Tap
Mask
– Turn off unused Multipliers – lower power
– Or, use unused Multipliers to process next pixel
• Requires parallel memory accesses
• Higher throughput
• Finish sooner – sleep device
• Lower Clock Rate
• Adder Tree layers before and after multiply
vary with number of Multipliers per pixel
• Input Data must be able to be routed to each
multiplier
• Will multiplier savings outweigh extra routing,
multiplexing, larger circuit quiescent power?
French
Page 9
MAPLD 2004
Adaptive Convolution Kernel Sizing
• Implementing Multiple Pixel Version
• How Many Multipliers to Use?
– Multiple of 9
– Size that is easy to place and allow for TMR growth
18 Multipliers Per Kernel
Number of Unique
Taps in 3x3 Conv.
Mask
Number of Masks per
Kernel
Speedup Over
Traditional
Convolution
1
18
9x
2
9
4.5x
3
6
3x
4
4
2x
5
3
1.5x
6
3
1.5x
7-9
2
1x
French
Page 10
MAPLD 2004
Kernel Block Diagram
Register Delay
Bank
Input Row 0
Q1
A
Q1
A
H
Q8
H
Q8
H
ENB
Register
Register
Register
3
Register
A
Q1
A
Q1
H
H
Q
Q88
H
Q8
H
Q8
ENB
ENB
ENB
H
A
Q8
H
Register
Q1
A
Q8
H
ENB
Register
Register
Q8
M
3
A
Q1
A
Q8
H
Q8
H
Q8
ENB
3
9
9
M
State
Machine
Group Data Values with
Common Taps
French
Output 17
Q1
Number of Unique Taps
Tap Mask
Tap Value
Output 0
Output 1
Register
Q1
H
ENB
Q1
ENB
A
ENB
3
ENB
Register
Q1
ENB
Input Row 19
Q8
ENB
Q
Q11
Register
9
Q1
A
A
A
9
M
Register
A
ENB
9
Output
Adder Tree
Input Row 1
Input Row 2
Register
9
Data Mux
Register
Adder
Tree
Common
Tap Mux
Page 11
Dynamically
Adjust
Multiplier
Position
within Adder
Tree
MAPLD 2004
Implementation Comparison
Baseline
Power Efficient
Flip-Flops
6,435
7,231
LUTs
8,141
9,181
6
20
18
18
100 MHz
100 MHz
711.6
965.0
Block Rams
Multipliers
Operating Frequency
Quiescent Power
(mW)
Quiescent Power 35% Higher
French
Page 12
MAPLD 2004
Total Energy Comparison
8
Energy (mJ) per
512 x 512 Image
7
Energy
Improvement
Factor
6
5
For Higher Tap
Commonality,
Shorter Dynamic
Power
Consumption
Window
Overcomes
Higher Quiescent
Power
4
3
2
1
0
Baseline
Power
Efficient
(9 taps)
Power
Efficient
(8 taps)
Power
Efficient
(7 taps)
Power
Efficient
(6 taps)
Power
Efficient
(5 taps)
Power
Efficient
(4 taps)
Power
Efficient
(3 taps)
Power
Efficient
(2 taps)
Power
Efficient
(1 taps)
Kernel
French
Page 13
MAPLD 2004
What is hard?
• Poor Tool Support for Power Design
– Analyzing Power Trade offs can be complex & time
consuming
– Have to have fully routed and simulated designs to
compare approaches
– Router is optimized for throughput, not power
• Finding all Chip Enables to Disable
– For each of several different multiplexer settings
• Secondary Power Effects
– Can also use Relative Placement Macros to “help” Router
• Finding where can be time consuming
French
Page 14
MAPLD 2004
Analysis
• For Higher Tap Commonality, Shorter Dynamic Power
Consumption Window Overcomes Higher Quiescent
Power
– Crossover point at 7 taps is an implementation limitation of
using 18 multipliers in kernel
• Quiescent Power
– Not much larger considering extra circuitry
• 18 Adder Trees, 16 Block RAMs
• Dynamic Power Consumption
– Observed to vary by +50% within one circuit from one place and
route to another, even using same settings
– Average of 3 routes used for each circuit
• For Systems Where Parallelizing Input Data Stream Is
Difficult
– Disabling extra Multipliers is best approach
– Power savings expected to be less
French
Page 15
MAPLD 2004
Conclusions
• Substantial Power Savings can be Achieved
by Making Power a First-Class Design
Constraint
• Knowledge of Underlying Resource
Capacitance a Key Foundation
– Re-use Power-Critical Components
• Routing Can Be Influenced to Yield Lower
Power
• Over-constrain timing on power sensitive nets
• Use Relative Placement Macros (RPMs)
French
Page 16
MAPLD 2004