ARM-Optimized JPEG Decoder

Download Report

Transcript ARM-Optimized JPEG Decoder

HW/SW Implementation of
JPEG Decoder
ARINDAM GOSWAMI
ERIC HUNEKE
MERT USTUN
ADVANCED EMBEDDED SYSTEMS ARCHITECTURE
SPRING 2011
Division of Labor
 Software
 Profiling – Arindam/Eric
 Timing analysis – Arindam/Eric
 Interface to hardware - Arindam
 Test data for hardware - Eric
 Hardware – Mert
 C to Verilog Conversion
 Scheduling & Resource Allocation on FPGA
 Bus Communication Interface
Outline
 What is JPEG?
 Project Description
 JPEG Algorithm
 Profile Data
 Software Design
 Hardware Design
 Results
 Conclusion
What is JPEG?
 Image codec released by the Joint Photographic
Experts Group in 1992

Joint committee between the ISO/IEC JTC1 and ITU-T
standards committees
 Informally used to describe the file format JPEG-
encoded images are packed in


Although the file format specified in the original standard,
JPEG Interchange Format (JIF), is rarely used
Exif or JFIF, both based JIF, are commonly used
What is JPEG? (cont.)
 Optimized for realistic images and photographs
 Color transitions should be smooth for best results
 Lossy compression, which can be tuned to produce
compressions of varying quality and size



Up to 20:1 without loss in quality for appropriate images
Better ratios than other algorithms such as GIF, but slower to
compress and decompress
Has lossless mode, but not widely used
Project Description
 Selected an existing software JPEG implementation
we could modify and increase performance
 Criteria


Small enough to be easily understood and modified
Reasonably fast, but not optimized
Project Description (cont.)
 Most common JPEG implementation out there is
libjpeg, from the Independent JPEG Group

Fast, but hard modify due to complexity
 Various other open source implementations
 Tiny Jpeg Decoder
 jpeg-compressor
Project Description (cont.)
 We ended up choosing NanoJPEG, written by Martin
Fiedler



Reasonably fast, but not optimized
Very small code size (< 1000 lines) in a single file
Easy to understand
 I/O
 Decompresses grayscale or YCbCr images
 Outputs grayscale or RGB raw images
 Other details
 Written in C
 No floating point
JPEG Algorithm
 Step 1
 Convert the image to the YCbCr color space (typically
from RGB)


Y for brightness
Cb and Cr for blue and red color components
 The human eye is less sensitive to color changes than
it is too brightness changes

JPEG takes advantage of this
JPEG Algorithm (cont.)
 Step 2
 Downsample the color data (CbCr) by averaging
together rows and vertically



Factor of two on rows
Factor of one or two on column
Data can thus be reduced by 1/2 or 1/3
 Imperceptible loss in quality
JPEG Algorithm (cont.)
 Step 3
 For each component, split the pixel data into 8x8
blocks
 Run each block through a discrete cosine transform
(DCT)
 End up with a matrix containing one DC value and
63 AC components
JPEG Algorithm
 Step 4
 Divide each cell of the matrix by values defined in a
quantization matrix, then round to the nearest
integer
 The quantization matrix has values of customizable
size

The larger the values, the more cells are reduced to zero, and
hence lost
JPEG Algorithm (cont.)
 Step 5
 Take the reduced blocks and perform Huffman
encoding (or Arithmetic encoding) to eliminate
redundant values

Lossless compression
 Step 6
 Wrap data in a standard file format, along with
compression data including quantization and
Huffman tables
JPEG Algorithm (cont.)
 Decoding is simply the reverse of the encoding
process





Get the reduced matrixes back
Multiply it with the quantization matrix
Run an inverse DCT (IDCT)
Upsample
Convert to RGB
Profile Data
 Profiled NanoJPEG on sample image with armsd
simulator
 55.10% of total time spent converting the image to
RGB upsampling

Logically separate from decode phase
 38.34% of total time spent decoding the 8x8 blocks
 So really 85.39% of time not spend converting/upsampling
 Row and column IDCTs were about half of the block
decode time

Our main focus for speedup, since took about 42% of decode
time, and were an obvious candidate for FPGA implementation
Software Design
Block decoding
code 
Row and column 
IDCT calls
Software Design
Row
IDCT
Column
IDCT 
Software Design
 Interface –
 Write 8x8 integers to FPGA addresses- D3000100-1FF
 Read 8x8 integers from D3000200-2FF (o/p of RowIDCT)
 Read 8x8 bytes from D3000300-33F (o/p of ColIDCT)
 Code –
 Replace calls to IDCT functions with r/w to FPGA addresses
Hardware Design - Architecture
1. ARM writes row 0
2. Row IDCT: row 0
ARM writes row 1
3. …
4. Row IDCT: row 7
ARM reads row 0
5. Col IDCT: col 0 - 7
ARM reads rest of the block
6. ARM reads colIDCT results
ROW IDCT
AMBA BUS
BUS COMM.
IF
8x8x8b COL_OUT
Register File
8x8x32b BLOCK
Register File
COL IDCT
IDCT
CORE
Hardware Design - Optimizations
 Register Files are used instead of RAMs to allow
random access to any word in the block matrix
 Arithmetic operations were distributed in multiple
stages to share resources and therefore reduce area
 Column IDCT and Row IDCT have a lot of common
operations –
  Use only a single datapath for both = Core IDCT
Hardware Design – Core IDCT
Row
IDCT
Column
IDCT 
Hardware Design – Optimizations (2)
 The hardware speed is limited by the ARM – FPGA
bus transactions (block transfers).
  Optimize bus state machine:


Started with 6 state bus machine of Lab 2
Reduced it to only 3 states !!!
 Total # of FPGA cycles per 8x8 block process:
 3 x (64 Writes + (64+16) Reads ) = 432 Cycles
 432 Cycles for 8 Row and 8 Column IDCTs
Results
 Hardware produces correct outputs in simulation
 Integrated system does not yet match simulation
 Communication overhead between ARM and FPGA
is the major bottleneck
 Expected speed-up:
 ARM: 8 x 60 + 8 x 120 = 1440 ARM Cycles (optimistic appr.)
 FPGA: 3 x (64 Writes + (64+16) Reads ) = 432 FPGA Cycles
Conclusion
 Work Completed
 Parallelized IDCT routines for each block decode in FPGA
 Work to be completed
 Get interface working
 What we would have done differently
 Used DMA to reduce communication overhead even more
 Parallelize ARM and FPGA block processing
 Additional speed-up possible by moving njConvert
(upsampling & color conversion) into FPGA
References
 Joint Photographic Experts Group
 http://www.jpeg.org/jpeg/index.html
 Introduction to JPEG
 http://www.faqs.org/faqs/compression-faq/part2/
 NanoJPEG
 http://keyj.s2000.ws/?p=137
Questions
?