Transcript Document

FPGA Implementation of Reduced Bit Plane
Motion Estimation
Shrutisagar Chandrasekaran, Abbes Amira and Faycal Bensaali
Overview
September 2004
Chandrasekaran
1
MAPLD 2005/P200
Outline
Research Objectives
Introduction
Reduced Bit-Plane Motion Estimation
Proposed Architecture
FPGA Implementations and Results
Conclusions
 Future Work and Acknowledgments
Chandrasekaran
2
MAPLD 2005/P200
Research Objectives
 To efficiently implement a reduced bit plane motion
estimation algorithm on FPGA using Handel-C for onboard
video compression
 To develop efficient low power architectures for image
processing techniques such as Motion Estimation (ME)
 To evaluate and model power consumption of FPGA
based designs at various levels of abstraction and to
evolve and implement strategies for low power energy
efficient design
Chandrasekaran
3
MAPLD 2005/P200
Introduction
 Block Matching (BM) is a widely used Motion Estimation
(ME) technique for calculating motion vectors by minimising
some cost functions
 Optimal prediction is obtained when a Full Search (FS)
algorithm is performed
 FS algorithm is computationally intensive and requires a
large number of I/O pins and large bandwidth for real time
ME
 An effective method for reducing the complexity of ME
architecture is to reduce the number of bit planes used for
computing the motion vector
Chandrasekaran
4
MAPLD 2005/P200
Introduction
 Most of the motion information is the 6th bit plane and a
significant amount of the motion information is also available
in the 7th bit plane
 The lower bit planes contain significantly less motion
information as they represent the smooth areas of the image
 Reduce bit-plane methods for ME using a range of
arithmetic units and simple Boolean operations leads to
power and area efficient architectures
Chandrasekaran
5
MAPLD 2005/P200
Reduced Bit-Plane ME
for i=1:dim:M-dim+1,
for j=1:dim:N-dim+1,
ii=i+d; jj=j+d;
window=previous_frame(ii-d:ii+dim+d-1,jj-d:jj+dim+d-1);
[m,n]=size(window); I=1; J=1;
val=sum(sum((current_frame(i:i+dim-1,j:j+dim-1)-window(I:I+dim-1,J:J+dim1)).^2));
for l=1:m-dim+1,
for k=1:n-dim+1,
val_t=sum(sum(abs(current_frame(i:i+dim-1,j:j+dim-1)-window(l:l+dim1,k:k+dim-1)).^2));
if val_t<val,
I=l; J=k;
val=val_t;
end
end
end
I=I-d-1; J=J-d-1;
vec=[vec;I,J];
end
end
Where -dim : block size
d : border extension for window (square window)
vec : array of motion vectors
Chandrasekaran
6
Pseudo Code
MAPLD 2005/P200
Proposed Architecture
Control Unit
31
0
PSU: Processor Sub-Unit
0
15
16 Bits
16 Bits
PSU 2
Adder 0
Adder 15
5 bits
5 bits
Adder
9 bits
9 bits
Comparator
2 bits
Least input location
9 bits
5 bits
5 bits
New min
Register
Register
Comparator
Intermediate motion
Vectors
5 bits
Final motion Vectors
Chandrasekaran
7
MAPLD 2005/P200
Proposed Architecture
 The architecture exploits the massive parallelism
available in hardware to reduce the computation time
 The search window is stored on-chip in an array of 32
bit wide registers, the width of each register being
equal to the size of the search window
 The block size is taken to be 16x16 bits (1 Bit Per
Pixel), and is stored on-chip in an array of registers
 Each Processing Sub-Unit (PSU) contains 256
Processor Elements PEs (256 XOR + 16 5-bit
Adders) for parallel execution of the block matching
and estimate the SAD (Sum of Absolute Differences)
Chandrasekaran
8
MAPLD 2005/P200
Proposed Architecture
 2 PSUs are used to cover the entire search window by
means of bitwise shift of the contents of the search
window in horizontal and vertical directions
 The intermediate values of motion vectors are stored in
the on chip array, with one location for each PSU
 At the end, the global values of motion vectors are
obtained using the intermediate values and the output of
the comparators
Chandrasekaran
9
MAPLD 2005/P200
Proposed Architecture
 The proposed architecture yields improved performance
metrics when compared to other existing work
Architecture
Nbre of PEs
Throughput
Search range
Proposed
256
1 MV/308 cycles
[ -8, 7 ]
[1]
1024
1MV/256 cycles
[ -16, 15 ]
[2]
256
1 MV/496 cycles
[ -8, 7 ]
[3]
256
1 MV/2209 cycles
[ -8, 7 ]
[1] Y-H. Yeh and C-Y. Lee, IEEE Trans. VLSI Syst. 7, 345 (1999)
[2] T. Komarek and P. Pirsch, IEEE Trans. Circuits Syst. 36, 1301 (1989)
[3] C-H. Hseih and T-P. Lin, IEEE Trans. Circuits Syst. Video Technol. 2, 169 (1992)
Chandrasekaran
10
MAPLD 2005/P200
FPGA Implementations and Results
 In order to verify the
performance of the proposed
architectures, designs have been
prototyped on the Celoxica
RC1000 board containing the
Xilinx XCV2000E FPGA
Available
on
chip
logic
resource include - Slices : 19200
- CLB Array : 80 x 120 - Block
RAM : 655,360 bits - Distributed
RAM : 614,400 bits
The RC1000 has 4 memory
banks which communicate with
the host by means of DMA
transfers
Chandrasekaran
11
MAPLD 2005/P200
FPGA Implementations and Results
Design Flow
Chandrasekaran
12
MAPLD 2005/P200
FPGA Implementations
and Results
 Handel-C adds constructs to ANSI-C to enable DK to
directly implement hardware
 Fully synthesizable HW programming language based on ANSI-C
 Implements C algorithm direct to optimized FPGA or outputs RTL
from C
Handel-C
Majority of ANSI-C
constructs supported by DK
Software-only
ANSI-C constructs
Recursion
Side effects
Standard libraries
Malloc
Chandrasekaran
Control statements
(if, switch, case, etc.)
Integer Arithmetic
Functions
Pointers
Basic types
(Structures, Arrays etc.)
#define
#include
13
Additions for hardware
Parallelism
Timing
Interfaces
Clocks
Macro pre-processor
RAM/ROM
Shared expression
Communications
Handel-C libraries
FP library
Bit manipulation
MAPLD 2005/P200
FPGA Implementations and Results
Reduced BitPlane ME
16 pixels
16 pixels
Bank0
Bank1
Bank2
XCV
2000E
Bank3
8x8 Blocks
Motion Vectors
Chandrasekaran
14
MAPLD 2005/P200
FPGA Implementations and Results
 The bit-plane values from the current frame are sent from
the host to the SRAM Bank 0, and those from the previous
frame are sent as 16 bit values to the SRAM Bank 1
 The motion vectors are computed by the ME core and
stored in the SRAM Bank 3
 The host application reads the motion vectors and
generates the predicted image in real time
Chandrasekaran
15
MAPLD 2005/P200
FPGA Implementations and Results
The proposed architecture is area efficient, as the motion
estimation is performed on a single bit plane, requiring
compact logic and greatly reduced on-chip memory size
The architecture is efficient, compact and can be
massively parallelised as the PE contains simple 1-bit XOR
gates only
Memory access is greatly reduced due to use of single bit
plane only, saving considerable amount of I/O power
Chandrasekaran
16
MAPLD 2005/P200
FPGA Implementations and Results
This, along architectural level optimisations including
parallelism
and
pipelining
yield
power
efficient
implementation
Implementation is carried out on the Celoxica RC1000
board equipped with Xilinx XCV2000E FPGA, as well as
synthesised on Xilinx QPro Virtex-II FPGA
Results in terms of power/area/maximum frequency show
that using reduced bit planes instead of full resolution
images drastically reduces the FPGA resources used
Chandrasekaran
17
MAPLD 2005/P200
FPGA Implementations and Results
Various performance metrics of the RBFSBM algorithm
implemented on the Virtex-E and the QPro Virtex-II FPGAs
Performance
Metrics
Virtex-E
QPro
Virtex-II
Area Occupied (slices)
1500
1488
Max Frequency (MHz)
43.57
76.247
Max Power (mW)
432.65
227.31
Energy/CIF Frame (mJ)
1.934
1.40
Max Throughput (FPS)
89.305
173
Chandrasekaran
18
MAPLD 2005/P200
Conclusions
A reduced bit plane architecture for full search block
matching has been proposed
The proposed architecture is low power, area efficient
and suitable for VLSI/FPGA implementation
The developed architecture can be used for space
applications such as onboard video compression, video
conferencing, etc.
Chandrasekaran
19
MAPLD 2005/P200
Future work and Acknowledgments
Develop Complete on-chip compression engine for realtime video compression, with applications ranging from
onboard satellite compression to video conferencing
Explore the effect of Algorithmic, architectural and RTL
level optimisations to minimise power consumption
Acknowledgments
Celoxica (Mr. Roger Gook) and EPSRC for supporting this work
Chandrasekaran
20
MAPLD 2005/P200