Transcript Slide 1

Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer
S. Gakkhar, A. Dasu
Utah State University
DWT algorithm optimizations for the design
Algorithmic Overview
Let A[i][j] represent the elements of a matrix, F[N] and G[M] be the two filter banks, having N
Based on this recursive outlook, it is possible to modify the algorithm so as to eliminate the
potential pitfalls associated with hardware implementation of DWT.
Why Wavelet Transforms ?
A signal cannot be represented as a point in the time – frequency space.


and M taps. .: Matrix Transformations are →
Fourier Transforms and Integrals can be used to represent any arbitrary signal
in terms of Sines and Cosines. However, though Fourier expansion of a signal
can have possibly infinite support in frequency domain, Fourier expansions
contain only frequency resolution and no time resolution. That is, though it is
possible to determine all possible frequencies in the given signal, it cannot be
determined when they are present.
Introducing a windowing on the basis function to extract time resolution
translates into convolution between the windowing function and signal in time
domain or multiplication in frequency domain. Now since windowing functions
tend to contain a wide range of frequencies, for instance Dirac pulse
comprises of all possible frequencies, this leads to smearing of signal, and
posing the exact opposite problem as Fourier transforms – presence of time
resolution, but an absence of a frequency resolution.
A[0][0]…A[0][j-1]
A[1][0]…A[1][j-1]
B[0][0]…B[0][j-1]
B[1][0]…B[1][j-1]
A[i-1][0]…A[i-1][j-1]
B[i-1][0]…B[i-1][j-1]
B[0][0]…B[0][j-1]
B[1][0]…B[1][j-1]
C[0][0]…C[0][j-1]
C[1][0]…C[1][j-1]
B[i-1][0]…B[i-1][j-1]
C[i-1][0]…C[i-1][j-1]
The image is traversed row-wise. An Nx(M-2) matrix is brought in, and is concatenated
with a 2x1 matrix already in cache. Initially the cache contains all zeroes, as such this
gives an intrinsic method for dealing with true edge conditions. The First Stage convolution
is computed, and this is fed into the second stage convolution. For M = N = 4, it has been
demonstrated that only one element in final convolution can be fully computed. This
element is written to memory, while partially computed elements are stored in a cache
associated with second stage convolution. This is done till an entire row is traversed. The
cache for first stage convolution is flushed as a new edge condition needs to be dealt with
for the next set of rows. The first stage convolution is calculated in precisely the same
manner, but for second stage convolution, partially computed values of second stage
values from last iteration are used to compute the new second stage values. The new
partial values now overwrite the previous partial values in second cache.
B[i][j] corresponds to A[i][j] * F[0] + .... A[i][j+N-1] * F [N-1]
C[i][j] corresponds to B[i][[j] * G[0] + .... B[i+M-1][j] * G [M-1]
Doubling the throughput
…the solution - Wavelet Transforms
Let,
Convolution with F be represented by
Convolution with G be represented by
Therefore, the above transformation can be represented as
The target transform can be represented in a tree structure →
LP
→
LP


Wavelet transforms address this issue by introducing a fully scalable and
modulated windowing function. As such Wavelet transforms involve multiresolution analysis where the windowing function is scaled and shifted with
regard to the signal while tracking freq100ency and ‘spatial’ spectrums
An important characteristic of Wavelet Transforms is that Wavelet Transforms
demonstrate perfect reconstructability, that is, it is possible to obtain the
original signal by taking the inverse wavelet transform.
HP LP
HP
First Stage
Convolution
Second Stage
Convolution
2↓
HP
High Pass
Filter
Low Pass
LP Filter
Computed In Parallel
And the target transform as
→
2↓
2↓
2↓
2↓
With this factor of 2, the parallelism extracted is M*N*2, or for Dabushies’ four tap scaling and
wavelet filters, the parallelism extracted is 32, along with Pipelining introduced between the two
stages of convolution.
Note:
 2↓ represents down-sampling of
the final transformed matrix
 The horizontal bar indicates the
convolution performed first
Design Overview

Hardware
Xilinx 2V6000 FPGA, PE6 on the Starbridge® Hypercomputer.

Software
Viva 2.4.1, polymorphic data and information rate EDA tool
Continuous Wavelet Transforms - A theoretical foundation
Take for example the following input matrix for two four tap filters Building on fundamentals of function spaces, Continuous Wavelet Transforms,
can be mathematically written as decomposition of signal function f(t), onto a
set of basis functions called wavelets (represented by ψs,‫( ד‬t) where s
represents the scale, and ‫ ד‬the translation – the new variables after the
transformation) –
HP
The matrix
abcd
e f gh
i j k l
mnop
AB
EF
I J
MN
transforms to
CD
GH
K L
OP
Design Overview
Cache Associated w/ First Stage
Cache Associated w/ Second Stage
ϒ(s, ‫ ∫ = )ד‬f(t)ψ*s,‫( ד‬t) dt
The wavelets are generated from a single basic wavelet (called the mother
wavelet) ψ(t) by scaling and translation –
ψs,‫( ד‬t) = 1/√s ψ((t- ‫) ד‬/s)
Each of the sub-matrices represents the down sampled result of one set of convolutions, as
such considering only one set of convolutions the original 16 point matrix transforms to just four
element matrix. However, only point [0][0] can be computed in the final sub-matrix, as rest of
the points depend on elements external to this matrix. This is easy to visualize if one considers
the corner and edge matrices in an image. Such a scenario is referred to as an edge condition.
A commonly adopted way to dealing with this is to assume the elements outside the matrix to
be zero..
Output
The factor 1/√s is the normalizing factor
DWT – A Quasi Recursive Approach
…but Continuous Wavelet Transform is impractical to use
in the mathematical form



CWT is redundant for it involves correlation between a continuously shifting /
scaling function and the signal.
Most functions lack an analytical wavelet transform
Wavelets in transform are infinite
Discrete Wavelet Transform
First Stage Convolution
Consider a 2x2 matrix, with two tap filters. It is evident that it will require only one step to
generate the entire first stage output, and this output can directly be fed into the second stage
processing. Now if the bigger image can be broken down into small chunks that can directly be
fed into second stage then the redundant memory read is eliminated as the two convolution
stages have been collapsed into one. But, since two tap filters are quite uncommon (to say the
least), the methodology has to be generalized for arbitrary filter sizes (MxN). This leads to an
additional complication that the first stage cannot be calculated completely as it is dependent
on data outside the matrix.
Second Stage Convolution
Input
This is quite similar to an edge condition, except that the elements can no longer be treated to
be zero. Let such a situation be referred as a pseudo edge condition. Now for every chunk of
data a “pseudo edge condition” matrix can be defined which comprises of elements that are
required for complete evaluation of first stage convolution of the matrix.
Viva Schematic for Processing Core
A modified definition of Wavelet Transform in which wavelets are scaled and
shifted in discrete steps addresses these flaws –
ψj,k(t) = 1/√s0j ψ((t - k ‫ד‬0s0j )/ s0j )
j, k are integers,
while ‫ד‬0 is chosen as 1 for dyadic sampling of time axis and,
s0 is chosen as 2 for dyadic sampling of frequency axis.
Matrix defining pseudo edge
conditions for red sub-matrix
The DWT Algorithm
Subband Coding Principles for DWT
The sub-matrices can be so formed that the edge condition matrix is reduced to -
Computational Implementation of DWT utilizes Subband Coding Schemes by
modeling the transform as filtering through a set of filter banks.
The filters banks used are designated as Wavelet (high pass) and
Scaling (low pass), based on their attributes. In a nutshell, the algorithm
for 2-D DWT comprises of convolution along x axis, followed by convolution along y
axis (on previous transformed matrix) with the two filters taken in all possible
permutations. The matrix is then down sampled by 2 along both axis.
λj
g(k)
2↓
New matrix defining pseudo edge
conditions for red sub-matrix
ϒj-1


Data Access Pattern
h(k)
2↓
λj-1
Hardware Issues with DWT Implementation
High Computational Overhead
High computational redundancy arises due to way the filtering is required by the
algorithm. The combination of the filters used in the algorithm allows sharing of the first
stage convolution results for two second stage convolutions. However, on a scalar
processor the memory accesses (if this scheme of sharing first stage results is to be
utilized) are so scattered that multiple page faults are induced, which hits the
performance extremely hard.
Multiple / Redundant Memory Accesses
Implementation of the algorithm requires two sets of memory accesses (Reads/ Writes) –
i)
First, for original image read, and write for first stage of convolution
ii)
Second, for first stage convolution result read, and final result write
Gakkhar
Results
For filters of sizes M and N, the basic matrix size is (MxN). Since, data is down-sampled
after first convolution (essentially every other row is thrown away), therefore, we can cache
the first two columns of the matrix, and then at each cycle Nx(M-2) elements are brought
in, the first two columns of the new matrix replace the old block in cache after the data is
processed
N


Processing Core implements wavelet transform for two single precision
floating point four tap filters. The design is completely polymorphic, though
it has been implemented with floating point multipliers and all
intermediates are cast to single precision floating point values.
The core is pipelined and generates four final pixel (after both
convolutions) values every clock cycle. This is equivalent to 24 floating
point multiplies and 18 floating point additions.
The logic utilization is 93%
Generating 4 pixel values every clock – cycle, this design can compute the
wavelet transform of 512x512 image in 65536 clock cycles
i
First M x N
sub matrix
First Set of Data
Written to Cache
M
j
2
MAPLD 2005/163