ARM-Optimized JPEG Decoder
Download
Report
Transcript ARM-Optimized JPEG Decoder
HW/SW Implementation of
JPEG Decoder
ARINDAM GOSWAMI
ERIC HUNEKE
MERT USTUN
ADVANCED EMBEDDED SYSTEMS ARCHITECTURE
SPRING 2011
Division of Labor
Software
Profiling – Arindam/Eric
Timing analysis – Arindam/Eric
Interface to hardware - Arindam
Test data for hardware - Eric
Hardware – Mert
C to Verilog Conversion
Scheduling & Resource Allocation on FPGA
Bus Communication Interface
Outline
What is JPEG?
Project Description
JPEG Algorithm
Profile Data
Software Design
Hardware Design
Results
Conclusion
What is JPEG?
Image codec released by the Joint Photographic
Experts Group in 1992
Joint committee between the ISO/IEC JTC1 and ITU-T
standards committees
Informally used to describe the file format JPEG-
encoded images are packed in
Although the file format specified in the original standard,
JPEG Interchange Format (JIF), is rarely used
Exif or JFIF, both based JIF, are commonly used
What is JPEG? (cont.)
Optimized for realistic images and photographs
Color transitions should be smooth for best results
Lossy compression, which can be tuned to produce
compressions of varying quality and size
Up to 20:1 without loss in quality for appropriate images
Better ratios than other algorithms such as GIF, but slower to
compress and decompress
Has lossless mode, but not widely used
Project Description
Selected an existing software JPEG implementation
we could modify and increase performance
Criteria
Small enough to be easily understood and modified
Reasonably fast, but not optimized
Project Description (cont.)
Most common JPEG implementation out there is
libjpeg, from the Independent JPEG Group
Fast, but hard modify due to complexity
Various other open source implementations
Tiny Jpeg Decoder
jpeg-compressor
Project Description (cont.)
We ended up choosing NanoJPEG, written by Martin
Fiedler
Reasonably fast, but not optimized
Very small code size (< 1000 lines) in a single file
Easy to understand
I/O
Decompresses grayscale or YCbCr images
Outputs grayscale or RGB raw images
Other details
Written in C
No floating point
JPEG Algorithm
Step 1
Convert the image to the YCbCr color space (typically
from RGB)
Y for brightness
Cb and Cr for blue and red color components
The human eye is less sensitive to color changes than
it is too brightness changes
JPEG takes advantage of this
JPEG Algorithm (cont.)
Step 2
Downsample the color data (CbCr) by averaging
together rows and vertically
Factor of two on rows
Factor of one or two on column
Data can thus be reduced by 1/2 or 1/3
Imperceptible loss in quality
JPEG Algorithm (cont.)
Step 3
For each component, split the pixel data into 8x8
blocks
Run each block through a discrete cosine transform
(DCT)
End up with a matrix containing one DC value and
63 AC components
JPEG Algorithm
Step 4
Divide each cell of the matrix by values defined in a
quantization matrix, then round to the nearest
integer
The quantization matrix has values of customizable
size
The larger the values, the more cells are reduced to zero, and
hence lost
JPEG Algorithm (cont.)
Step 5
Take the reduced blocks and perform Huffman
encoding (or Arithmetic encoding) to eliminate
redundant values
Lossless compression
Step 6
Wrap data in a standard file format, along with
compression data including quantization and
Huffman tables
JPEG Algorithm (cont.)
Decoding is simply the reverse of the encoding
process
Get the reduced matrixes back
Multiply it with the quantization matrix
Run an inverse DCT (IDCT)
Upsample
Convert to RGB
Profile Data
Profiled NanoJPEG on sample image with armsd
simulator
55.10% of total time spent converting the image to
RGB upsampling
Logically separate from decode phase
38.34% of total time spent decoding the 8x8 blocks
So really 85.39% of time not spend converting/upsampling
Row and column IDCTs were about half of the block
decode time
Our main focus for speedup, since took about 42% of decode
time, and were an obvious candidate for FPGA implementation
Software Design
Block decoding
code
Row and column
IDCT calls
Software Design
Row
IDCT
Column
IDCT
Software Design
Interface –
Write 8x8 integers to FPGA addresses- D3000100-1FF
Read 8x8 integers from D3000200-2FF (o/p of RowIDCT)
Read 8x8 bytes from D3000300-33F (o/p of ColIDCT)
Code –
Replace calls to IDCT functions with r/w to FPGA addresses
Hardware Design - Architecture
1. ARM writes row 0
2. Row IDCT: row 0
ARM writes row 1
3. …
4. Row IDCT: row 7
ARM reads row 0
5. Col IDCT: col 0 - 7
ARM reads rest of the block
6. ARM reads colIDCT results
ROW IDCT
AMBA BUS
BUS COMM.
IF
8x8x8b COL_OUT
Register File
8x8x32b BLOCK
Register File
COL IDCT
IDCT
CORE
Hardware Design - Optimizations
Register Files are used instead of RAMs to allow
random access to any word in the block matrix
Arithmetic operations were distributed in multiple
stages to share resources and therefore reduce area
Column IDCT and Row IDCT have a lot of common
operations –
Use only a single datapath for both = Core IDCT
Hardware Design – Core IDCT
Row
IDCT
Column
IDCT
Hardware Design – Optimizations (2)
The hardware speed is limited by the ARM – FPGA
bus transactions (block transfers).
Optimize bus state machine:
Started with 6 state bus machine of Lab 2
Reduced it to only 3 states !!!
Total # of FPGA cycles per 8x8 block process:
3 x (64 Writes + (64+16) Reads ) = 432 Cycles
432 Cycles for 8 Row and 8 Column IDCTs
Results
Hardware produces correct outputs in simulation
Integrated system does not yet match simulation
Communication overhead between ARM and FPGA
is the major bottleneck
Expected speed-up:
ARM: 8 x 60 + 8 x 120 = 1440 ARM Cycles (optimistic appr.)
FPGA: 3 x (64 Writes + (64+16) Reads ) = 432 FPGA Cycles
Conclusion
Work Completed
Parallelized IDCT routines for each block decode in FPGA
Work to be completed
Get interface working
What we would have done differently
Used DMA to reduce communication overhead even more
Parallelize ARM and FPGA block processing
Additional speed-up possible by moving njConvert
(upsampling & color conversion) into FPGA
References
Joint Photographic Experts Group
http://www.jpeg.org/jpeg/index.html
Introduction to JPEG
http://www.faqs.org/faqs/compression-faq/part2/
NanoJPEG
http://keyj.s2000.ws/?p=137
Questions
?