Transcript Document

cujpeg
A Simple JPEG Encoder
With CUDA Technology
Dongyue Mou and Zeng Xing
Outline
•
•
•
•
•
JPEG Algorithm
Traditional Encoder
What's new in cujpeg
Benchmark
Conclusion
Outline
•
•
•
•
•
JPEG Algorithm
Traditional Encoder
What's new in cujpeg
Benchmark
Conclusion
JPEG Algorithm
JPEG is a commonly used method for image compression.
1.
2.
3.
4.
5.
6.
7.
JPEG Encoding Algorithm is consist of 7 steps:
Divide image into 8x8 blocks
[R,G,B] to [Y,Cb,Cr] conversion
Downsampling (optional)
FDCT(Forward Discrete Cosine Transform)
Quantization
Serialization in zig-zag style
Entropy encoding (Run Length Coding & Huffman coding)
JPEG Algorithm -- Example
This is an example
Divide into 8x8 blocks
This is an example
Divide into 8x8 blocks
This is an example
RGB vs. YCC
The precision of colors suffer less (for a human eye)
than the precision of contours (based on luminance)
Color space conversion makes use of it!
Simple color space model: [R,G,B] per pixel
JPEG uses [Y, Cb, Cr] Model
Y = Brightness
Cb = Color blueness
Cr = Color redness
Convert RGB to YCC
8x8 pixel
1 pixel = 3 components
MCU with
sampling factor
(1, 1, 1)
Downsampling
Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels
4 blocks
16 x16 pixel
MCU: minimum coded unit:
The smallest group of data
units that is coded.
Data size reduces to a half
immediately
MCU with
sampling
factor
(2, 1, 1)
Apply FDCT
2D IDCT: Bottleneck, the complexity of the algorithm is O(n^4)
1D IDCT:
2-D is equivalent to 1-D applied in each direction
Kernel uses 1-D transforms
Apply FDCT
Shift operations
From [0, 255]
To [-128, 127]
DCT
Result
Meaning of
each position
in DCT resultmatrix
Quantization
DCT result
Quantization Matrix
(adjustable according to quality)
Quantization
result
Zigzag reordering / Run Length Coding
Quantization
result
[ Number of Zero before me, my value]
Huffman encoding
Values
G
Real saved values
0
-1, 1
-3, -2, 2, 3
-7,-6,-5,-4,5,6,7
.
.
.
.
.
.
.
.
.
-32767..32767
0
1
2
3
4
5
.
.
.
.
.
.
.
15
.
0,1
00, 01, 10, 11
000,001,010,011,100,101,110,111
.
.
.
.
.
.
.
.
.
RLC result:
[0, -3] [0, 12] [0, 3]......EOB
After group number added:
[0,2,00b] [0,4,1100b] [0,2,00b]
...... EOB
First Huffman coding (i.e. for [0,2,00b] ):
[0, 2, 00b] => [100b, 00b]
( look up e.g. table AC Chron)
Total input: 512 bits,
Output: 113 bits output
Outline
•
•
•
•
•
JPEG Algorithm
Traditional Encoder
What's new in cujpeg
Benchmark
Conclusion
Traditional Encoder
CPU
Image
Load image
Color conversion
DCT
Quantization
Zigzag Reorder
Encoding
.jpg
Outline
•
•
•
•
•
JPEG Algorithm
Traditional Encoder
What's new in cujpeg
Benchmark
Conclusion
Algorithm Analyse
1x full 2D DCT scan
O(N4)
8x Row 1D DCT scan
8x Column 1D DCT scan
O(N3)
8 threads can paralell work
Algorithm Analyse
DCT In Place
__device__ void
{
float
float
float
float
}
vectorDCTInPlace(float *Vect0, int Step)
*Vect1
*Vect3
*Vect5
*Vect7
=
=
=
=
Vect0
Vect2
Vect4
Vect6
float
float
float
float
X07P
X16P
X25P
X34P
=
=
=
=
(*Vect0)
(*Vect1)
(*Vect2)
(*Vect3)
+
+
+
+
(*Vect7);
(*Vect6);
(*Vect5);
(*Vect4);
float
float
float
float
X07M
X61M
X25M
X43M
=
=
=
=
(*Vect0)
(*Vect6)
(*Vect2)
(*Vect4)
-
(*Vect7);
(*Vect1);
(*Vect5);
(*Vect3);
float
float
float
float
X07P34PP
X07P34PM
X16P25PP
X16P25PM
+
+
-
X34P;
X34P;
X25P;
X25P;
=
=
=
=
+
+
+
+
X07P
X07P
X16P
X16P
Step, *Vect2 = Vect1 + Step;
Step, *Vect4 = Vect3 + Step;
Step, *Vect6 = Vect5 + Step;
Step;
__device__ void blockDCTInPlace(float *block)
{
for(int row = 0; row < 64; row += 8)
vectorDCTInPlace(block + row, 1);
for(int col = 0; col < 8; col++)
vectorDCTInPlace(block + col, 1);
}
__device__ void parallelDCTInPlace(float *block)
{
int col = threadIdx.x % 8;
int row = col * 8;
(*Vect0)
(*Vect2)
(*Vect4)
(*Vect6)
=
=
=
=
C_norm
C_norm
C_norm
C_norm
*
*
*
*
(X07P34PP + X16P25PP);
(C_b * X07P34PM + C_e * X16P25PM);
(X07P34PP - X16P25PP);
(C_e * X07P34PM - C_b * X16P25PM);
(*Vect1)
(*Vect3)
(*Vect5)
(*Vect7)
=
=
=
=
C_norm
C_norm
C_norm
C_norm
*
*
*
*
(C_a
(C_c
(C_d
(C_f
*
*
*
*
X07M
X07M
X07M
X07M
+
+
+
C_c
C_f
C_a
C_d
*
*
*
*
X61M
X61M
X61M
X61M
+
+
+
C_d
C_a
C_f
C_c
*
*
*
*
X25M
X25M
X25M
X25M
__syncthreads();
vectorDCTInPlace(block + row, 1);
__syncthreads();
vectorDCTInPlace(block + col, 1);
__syncthreads();
}
+
+
C_f
C_d
C_c
C_a
*
*
*
*
X43M);
X43M);
X43M);
X43M);
Allocation
Desktop PC
– CPU: 1 P4 Core, 3.0GHz
– RAM: 2GB
Graphic Card
– GPU: 16 Core 575MHz
8 SP/Core, 1.35GHz
– RAM: 768MB
Binding
Huffman Encoding
• many conditions/branchs
• intensive bit operating
• less computing
Color conversion, DCT, Quantize
• intensive computing
• less conditions/branchs
Binding
Hardware: 16KB Shared Memory
Problem: 1 MCU contains 702 Byte data
Result: maximal 21 MCUs/CUDA Block
Hardware: 512 threads
Problem: 1 MCU contains 3 Blocks,
1 Block needs 8 threads
Result: 1 MCU needs 24 threads
1 CUDA Block = 504 Threads
cujpeg Encoder
CPU
Image
GPU
Load image
Color conversion
DCT
Quantization
Zigzag Reorder
Encoding
.jpg
cujpeg Encoder
CPU
GPU
cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost);
for (int i=0; i<BLOCK_WIDTH; i++)
myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f);
Image
Load image
Global
Memory
Host
Memory
Color
Conversion
In Place
DCT
Quantization
Reorder
Result
Shared Memory
Texture
Memory
Quantize
int b = tex2D(TexSrc, TexPosX++, TexPosY);
Reorder
int g = tex2D(TexSrc, TexPosX++, TexPosY);
cudaMallocArray(
int&textureCache,
r = tex2D(TexSrc, &channel,
TexPosX+=6, scanlineSize,
TexPosY);
imgHeight ));
cudaMemcpy2DToArray(textureCache, 0, 0,
float yimage,
= 0.299*r
+ 0.587*g + imageWidth,
0.114*b - 128.0imageHeight,
+ 0.5;
imageStride,
float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5;
cudaMemcpyHostToDevice ));
float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5;
cudaBindTextureToArray(TexSrc, textureCache, channel));
Encoding
myDCTLine[Offset + i] = y;
myDCTLine[Offset
+ 64 + i]= ResultSize);
cb;
cudaMalloc((void
**)(&ResultDevice),
myDCTLine[Offset + 128 + i]= cb;
.jpg
Scheduling
For each MCU:
RGB Data
•24 threads
x24
• Convert 2 pixel
•8 threads
YCC Block
Y
Cb
Cr
Y
Cb
Cr
• Convert rest 2 pixel
x24
•24 threads
• Do 1x row vector DCT
• Do 1x column vector DCT
• Quantize 8x scalar value
DCT Block
x24
Quantized/Reordered Data
Outline
•
•
•
•
•
JPEG Algorithm
Traditional Encoder
What's new in cujpeg
Benchmark
Conclusion
GPU Occupancy
Varying Block Size
504
Threads Per Block
24
16
Registers Per Thread
18
16128
Shared Memory Per Block (bytes)
Active Threads per Multiprocessor
504
Active Warps per Multiprocessor
16
Active Thread Blocks per Multiprocessor
Multiprocessor
Warp Occupancy
My Block Size 504
12
1
Occupancy of each Multiprocessor
6
67%
0
Maximum Simultaneous Blocks per GPU
16
16
80
144
208
272
336
400
464
Threads Per Block
Varying Shared Memory Usage
Varying Register Count
24
24
18
My Shared Memory 16128
My Register
Count 16
Multiprocessor
Warp Occupancy
Multiprocessor
Warp Occupancy
18
12
6
12
6
0
0
16384
15360
14336
13312
12288
11264
Shared Memory Per Thread
10240
9216
8192
32
7168
28
6144
Registers Per Thread
24
5120
20
4096
16
3072
12
2048
8
1024
4
0
0
Benchmark
512x512
1024x1024
2048x2048
4096x4096
cujpeg
0.321s
0.376s
0.560s
1.171s
libjpeg
0.121s
0.237s
0.804s
3.971s
( Q = 80, Sample = 1:1:1 )
Benchmark
Time Consumption (4096x4096)
Load
Tansfer
Compute
Encode
Total
Quality = 100
0.132s
0.348s
0.043s
0.837s
1.523s
Quality = 80
0.121s
0.324s
0.043s
0.480
1.123s
Quality = 50
0.130s
0.353s
0.044s
0.468s
1.167s
Time Consumption
encode 47%
compute 3%
others 13%
transfer 27%
load
load 10%
transfer
compute
encode
others
Benchmark
Time Consumption (4096x4096)
Load
Tansfer
Compute
Encode
Total
Quality = 100
0.132s
0.348s
0.043s
0.837s
1.523s
Quality = 80
0.121s
0.324s
0.043s
0.480
1.123s
Quality = 50
0.130s
0.353s
0.044s
0.468s
1.167s
Each thread has 240 operations
24 threads process 1 MCU
4096x4096 image includes 262144 MCUs.
Total ops: 262144*24*210 = 1509949440 flops
Speed: (Total ops) /0.043 = 35.12Gflops
Outline
•
•
•
•
•
JPEG Algorithm
Traditional Encoder
What's new in cujpeg
Benchmark
Conclusion
Conclusion
CUDA can obviously accelerate the JPEG compression.
The over-all performance
•
•
•
•
Depends on the system speed
More bandwidth
Besser encoding routine
Support downsample