Image Processing With FPGAs Zach Fuchs Sarit Patel EEL6935

Download Report

Transcript Image Processing With FPGAs Zach Fuchs Sarit Patel EEL6935

Image Processing With FPGAs
Zach Fuchs
Sarit Patel
EEL6935
14 April 2008
FPGA-Based Configurable Systolic
Architecture for Window-Based
Image Processing
Authors:
César Torres-Huitzil
Miguel Arias-Estrada
Introduction
• Image processing is a fundamental step in
modern machine vision systems.
• Many complex algorithms use lower level
results to pursue higher level goals.
– e.g.: edge detection to determine object
• Real time performance in video applications is
usually required.
Difficulty Building Systems
• Most computer vision applications are
computationally intensive
– Sequential nature of conventional processors slow
down performance
• Different computations in processing limits
parallelization
• Real time performance is required
Sample Applications
•
•
•
•
•
•
Robotics
Multimedia
Virtual reality
Industrial inspection
Medical engineering
Autonomous navigation
Goals of Paper
• Design 2D systolic architecture for windowbased image processing
• Consider design issues:
–
–
–
–
–
Flexibility
Silicon area
Power consumption
Performance
Area
Window-Based Image Processing
• Large number of repetitive neighbor operations
over image data
• Area of w x w pixels extracted from image
• Transformed according to window mask and
mathematical functions
• Produce single, new output according to
transform
Windows-Based Image Processing
2
1
3
Window-Based Operators
• Same scalar function applied on a pixel by
pixel basis
• Scalar functions
– e.g.: relational, arithmetic, logical, look up tables
• Reduction functions
– Reduce window of results from scalar function to
one output
– e.g.: accumulation, maximum, absolute value
Computational Requirements
• Window-based operations are computationally
expensive tasks
• Focusing on convolution
– Convolution - the amount of overlap between f and
a reversed and translated version of g
• In general, complexity = O(w^2 x M x N)
– w x w window mask
– M x N image
Data Transfer Rate
• Must transfer data between image acquisition
module, memory, and processor
• Input Data Transfer Rate
• Output Data Transfer Rate
– b = # of bits per pixel
– fF = processing rate of images per second
• Requires efficient use of communication
bandwidth and parallel processing
Implementation Technology: FPGA
• Provides massive parallel structures and high density
for logic arithmetic
• Tasks implemented by spatially rather than
temporally
• Possible to control at bit level to build specialized
data paths
• Offer more raw computational power compared to
conventional processors
• Shorter design cycles than ASICs
• Well suited for implementing parallel architectures.
Memory Accesses
• Gap between processor speed and memory
access speed
– Memory access overhead critical issue
• Window-based operations are memory
intensive; require new pixel in each step
• High potential for parallelism since
independent operations are applied to large
regions of image arrays
Memory Accesses
• Pixels might not be stored as neighboring
elements
– Parallelism is hidden
• Windows usually overlap with neighboring
windows
• Must create vectors of data elements and
process them using parallel vectorization
techniques.
Overlapping Windows
• Three windows shown; shaded box indicates
overlapping data.
Overlapping Windows
• Some pixels can be used in computation of all
three windows
• Reduce memory accesses for those pixels by a
factor of 3
• Large number of windows means less overlap
• Must compromise between data overlap and
window count
Data Parallelism
• Can be combined with loop unrolling to
diminish memory accesses for sequential
accesses
• Process one window, then slide to the right and
process next
• Unroll this loop so more windows are
computed in parallel
• Authors use vertical unrolling
– Can apply to horizontal unrolling equally
Data Parallelism
• Number of pixels read per column is directly
dependent on number of rows processed in
parallel
• Number of pixels read = w + NR – 1
– w = windows mask length/width
– NR = rows processed
• Number of Memory Accesses (MxN
Image)
Data Parallelism
Systolic Architecture
• Configurable Window Processor (CWP)
– Processing element in systolic arch.
• Architecture reads data from input memory
– P = image pixel
– W = window mask coefficients
• Transmitted to array of processing elements
for computation
Array of CWPs
• LDC = Local data collector
– Collects results of CWPs
• CWP
– Compute a window operator on same
column of input image
• D = Delay line / shift register
– Used for synchronization purposes
Architecture Flow
• Pixel is broadcast to all CWPs
• At each clock cycle:
– Each CWP receives a different window
coefficient
– New image pixel for all processing elements
• Each CWP multiplies and accumulates
values until all pixels in a window are
processed
• After short latency, the LDC will collect
the data and send it to output memory
CWP
• AP – Arithmetic Processor (ALU)
– Multiplies
• LRM – Local Reduction Module
– Accumulator
• Pc – Result of window operation
• Wd – delayed window coefficient
Systolic Architecture
Processing Time
• Latency
– Time required to start pipeline operation
– Measured between activation of first CWP to last CWP
• Parallel processing time
– Time when all CWPs are working in parallel
– Addition of all times to process set of rows
• Performance compromised with number of rows
processed
– Directly reflects silicon resources allocated to architecture
Throughput
• Number of elemental operations system can
perform per second
• Only scalar function and local reduction
function are considered
Implementation
• Fully parameterizable VHDL description
– Use generics to make design flexible
• Structural description used only elementary
logic operations
• Design is platform, version, technology, and
tool independent
• Used XCV2000E-6 VirtexE FPGA w/ 2
Million Gates
FPGA Technical Data
Performance Results
• I/O time not considered
in results
• 512x512 Image w/ 7x7
Window Mask
Performance Results
• Image processing time
for 7x7 window mask is
8.35 ms
• Leaves enough time for
image acquisition
• 30ms required for realtime constraints
• Post-processing also
possible
Performance Results
• Throughput increases
with number of
processing elements
• Utilization and activity
efficiency of processing
elements decrease
Improving Performance
•
•
•
•
Optimize design mapped on the FPGA
Apply timing restrictions for increased speed
Use better FPGA
Note that performance requirement for realtime operation is still met with lower FPGA
Comparisons to Other Architectures
Area/Performance Tradeoffs
• Low resource utilization allows
implementation in compact mobile apps
• High computational density due to small area
usage
• Can reduce hardware or clock frequency
– Reduces power
– Still meets timing requirements
Reconfigurability
• Flexible enough to support different windowbased image operators
• Allows different image-based applications on a
SoC
Conclusion
• Easy to exploit SIMD for parallelism in image
processing
• FPGAs allow reconfigurability and flexibility
• Real-time constraints can be met with high
performance and low area usage
•
All Images and Graphs from:
Torres-Huitzil, Cesar, and Miguel Arias-Estrada. "FPGA-Based Configurable Systolic Architecture for Window-Based Image
Processing." EURASIP Journal on Applied Signal Processing 7(2005): 1024-1034.
Hardware, Design and
Implementation Issues on a FPGABased Smart Camera
Fabio Dias, Francois Berry, Jocelyn Serot,
Francois Marmoiton
Summary of Paper
• Describe the hardware architecture of a FPGAbased Smart Camera research platform and some of
the hardware design issues.
• Propose a architectural design methodology based
on pre-programmed processing elements.
• Provide a low level image processing example.
• Present an embedded tracking application to show
the camera’s utilization.
What is a Smart Camera?
• Smart cameras utilize embedded processing to
relieve some of the low level computational
burden of the interfacing system.
• Reduce communication flow and overhead.
• Processing resources consist of FPGA devices,
medi/streaming processors, DSP’s, etc.
Why FPGA devices?
• Reconfigurability
– Allows the camera to adapt to a wide range of applications.
• Parallelism
– Take advantage of independence of many computational
tasks in order to meet time restraints.
• Hardware Flexibility
– Capable of interfacing with a wide range of external
devices such as memory or ASICs.
Smart Camera Hardware Architecture
•
•
•
•
•
•
ALTERA Stratix EP1S60F1020C7
4Mpixels LUPA-400 image sensor
(2) 2d accelerometers
(3) gyroscopes
10Mb SRAM
64Mb SDRAM
Smart Camera Hardware Architecture
Design Methodology
• Centralized around reconfiguration of the
FPGA.
– Set of Pre-designed configurable data processing
elements (PE’s).
– Programmable Control Module
• System supervisor, communicating with the PE’s
through registers and hand-shake signals
• Configures and synchronizes different PE’s
Design Methodology
Schematic of a SoPC architecture illustrating the proposed methodological approach.
Generic Window-Based
Processing Element
• Applied over a small defined over a small
defined portion of the input image.
• Deal with large amounts of data because they
are often applied over the entire image.
• Examples
– Convolution
– Correlation estimation
– Morphological transformations
Generic Window-Based
Processing Element
Q(i , j )  FM (FD ( A(i , j ) , B(i , j ) ))
x  FR (Q(i , j ) )
Smart Camera Application
•
•
•
•
•
•
Template Tracking System
VGA images sent to host
computer to be displayed.
The user selects frame of interest
for tracking.
A search window is acquired and
stored into memory.
A sliding window SAD
algorithm is applied.
The portion with the best
correlation score is considered
the as being the new template
location.
A null acceleration model is
employed in order to predict
displacement in the next frame.
Smart Camera Application
Embedded tracking implemented architecture
Experimental Results
Conclusion
• Generic window-based processing element
successfully implemented in an FPGA.
• An image tracking algorithm utilizing the
described design methodology successfully
implemented with adequate performance.
• A flexible FPGA base smart camera research
platform created for future research.
All Images and Graphs from:
Dias, Fabio, Francois Berry, Jocelyn Serot, and Francois Marmoiton, "HARDWARE, DESIGN AND IMPLEMENTATION
ISSUES ON A FPGA-BASED SMART CAMERA." IEEE 1-4244-1354-0/07(2007): 20-26.