data compression through wavelet decomposition

Download Report

Transcript data compression through wavelet decomposition

scientific data compression through wavelet transformation

chris fleizach cse262

Problem Statement

    Scientific data is hard to compress with traditional “run-length” encoders, like gzip or bzip because common patterns do not exist If the data can be transformed into a form that retains the information but can be thresholded, it can be compressed Thresholding removes excessively small features and replaces with a zero (much easier to compress than a 64b double) Wavelet transforms are well suited for this purpose and have been used in image compression (jpeg 2000)

Why Wavelets?

 Wavelet transforms encode more information than other techniques like Fourier transforms.

 Time and frequency information is saved  In practical terms, the transformation is applied to many scales and sizes within the signal  This results in vectors that encode approximation and detail information.  By separating the signals, it is easier to threshold and remove information. Thus the data can be compressed

Wavelets in Compression

  The jpeg2000 standard gave up the discrete cosine transform in favor of a wavelet transform The FBI uses wavelets to compress fingerprint scans by 15 – 20 times the original size

Choosing the Right Wavelet The Transform

f t t dt s ,

  Continuous wavelet transform – the sum over the time of the signal convolved by the scaled and shifted versions of the wavelet Unfortunately, it’s slow and generates way too much data. It’s also hard to implement From mathworks.com

Choosing the Right Wavelet The Transform

  The discrete transform - if the scales and positions are chosen based on powers of two, then the transform will be much more efficient and just as accurate. Then the signal is sent through only two “subband” coders (which get the approximation and the detail data from the signal).

Signal decomposed by low pass and high pass filters to get approx and detail info.

From mathworks.com

Choosing the Right Wavelet The Decomposition

    The signal can be recursively decomposed to get finer detail and more general approximation. This is called multi-level decomposition. A signal can be decomposed as many times as it can be divided in half.

Thus, we only have one approximation signal at the end of the process From mathworks.com

Choosing the Right Wavelet The Wavelet

    The low and high pass filters (subband coders) are in reality the wavelet that is used. There have been a wide variety of wavelets created over time The low pass is called the scaling function The high pass is the wavelet function Different wavelets give better results depending on the type of data Approx/ Low pass/ Scaling Detail/ High pass/ Wavelet From mathworks.com

Choosing the Right Wavelet The Wavelet

   The wavelets that turned out to give the best results were the Biorthogonal wavelets These were discovered by Daubechies and make use of the fact that exact reconstruction is impossible if you use the same wavelet.

Instead a reconstruction wavelet and a decomposition wavelet are used that are slightly different These are the coefficients of the filters used for convolution Actual wavelet and scaling functions From mathworks.com

Testing Methodology

     In order to find what was the best combination of wavelet, decomposition, and thresholding, an exhaustive search was done with Matlab A 1000x1000 grid of vorticity data from the navier stokes simulator was first compressed with gzip. This was the baseline file to compare against Then each available wavelet in Matlab was tested with 1, 3 and 5 level decomposition, in combination with thresholding by removing values smaller than 1x10 -4 to 1x10 -7 . The resulting data was saved and compressed with gzip and compared against the baseline.

Then the data was reconstructed and the max and average error was taken.

Testing Methodology

Sample Results Desc MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 1 Level 1 Level MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel5 MultiLevel3 MultiLevel5 Wavelet bior3.1

bior3.3

db5 bior3.5

db4 bior4.4

bior5.5

sym4 ZeroLimit 0.0001

0.0001

0.0001

OldSize 6348259 6348259 6348259 NewSize 49242 61604 67950 0.0001

0.0001

0.0001

0.0001

0.0001

6348259 6348259 6348259 6348259 6348259 69646 71276 74540 75502 77540 Ratio 128.92

103.05

93.43

MaxError 3.02992E-05 2.98889E-05 4.69813E-05 AvgError 1.54739E-06 1.39991E-06 1.65457E-06 StdDev 8.9255E-07 5.65952E-07 2.89868E-07 91.15

89.07

85.17

84.08

81.87

3.18938E-05 4.23476E-05 7.37695E-05 6.88371E-05 6.60595E-05 1.29838E-06 1.95235E-06 1.65387E-06 2.03694E-06 1.97454E-06 5.99841E-07 7.95851E-07 3.5692E-07 6.00806E-07 6.83028E-07 sym3 bior3.1

bior3.5

bior3.1

bior3.7

bior3.3

bior3.9

bior2.2

bior2.2

0.0000001

0.0000001

0.00001

0.00001

0.00001

0.00001

0.00001

0.0000001

0.0000001

6348259 6348259 6348259 6348259 6348259 6348259 6348259 6348259 6348259 2529604 2213431 134704 105967 146749 124831 159372 4072607 4075903 2.51

2.87

47.13

59.91

43.26

50.85

39.83

1.56

1.56

1.18008E-07 1.4574E-07 2.61954E-06 3.54416E-06 2.64505E-06 3.37968E-06 2.68436E-06 1.11517E-07 1.12578E-07 7.8547E-09 6.39022E-09 2.23942E-07 2.55347E-07 2.19979E-07 2.29704E-07 2.22655E-07 5.36598E-09 5.62053E-09 6.15159E-09 6.91756E-09 1.20665E-07 1.28242E-07 1.1741E-07 1.1771E-07 1.10396E-07 3.3909E-09 3.35255E-09 For the application, I chose three of the methodologies that represented High compression/high error, medium/medium error and low compression/low error.

  

Matlab functions

Four matlab functions were made for compression and decompression wavecompress (1D) and wavecompress2 (2D) wavedecompress (1D) and wavedecompress2 (2D) wavecompress2 - Lossy compression for 2D data [savings] = WAVECOMPRESS2(x,mode,outputfile) compresses the 2-D data in x using the mode specified and saves to outputfile.

The return value is compression ratio achieved The valid values for mode are: 1 = high compression, high error (uses bior3.1 filter and 1xE-4 limit) 2 = medium compression, medium error (uses bior3.1 filter and 1xE-5 limit) 3 = low compression, low error (uses bior5.5 filter and 1xE-7 limit) To decompress the data, see: wavedecompress2

Some Pictures

C++ Implementation

 With the easy work out of the way, the next phase of the project was a C++ implementation. There were a few reasons for reinventing the wheel:     I wanted to fully understand the process I could try my hand at some parallel processing I could have a native 3-D transformation And Matlab makes my computer very slow

Demo

 ./wavecomp –c 1 –d 2 vorticity000.dat

 ./wavedec vorticity000.dat.wcm

Algorithm

 1.

decomposition Convolve the input with the low pass filter to get the approx. vector. 2.

The basic algorithm for 1-D multi-level Convolve the input with the high pass to get the detail vector 3.

Set the input = approx, and repeat for number of times to get desired decomposition level

Convolution

   The convolution step is tricky though because the filters use data from before and after a specific point, which makes edges hard to handle For signals that aren’t sized appropriately, the data must be extended. The most common way is periodically, symmetrically or with zero-padding. The convolution algorithm: for (k = 0; k < signal size; k += 2) int sum = 0; for (j = 0; j < filter size; j++) sum += filter[j]*input[k-j] output[k] = sum;

Implementation

 The convolution caused the most problems as many available libraries didn’t seem to do it correctly or assumed the data was periodic or symmetric.  I finally appropriated some code from the PyWavelets project that handled zero padding extension, determining the appropriate output sizes, and performing the correct convolution along the edges

2-D transformation

 1.

The 2-D transformation proved more challenging in terms of how to store the data and how decompose it.

Convolve each row with the low pass filter to get the approx. vector, then downsample rows 2.

3.

1.

2.

3.

4.

5.

Convolve each row with the high pass to get the detail vector, then downsample columns Convolve each remaining low pass column with low pass Convolve each remaining low pass col with high pass Convolve each high pass column with low pass Convolve each high pass column with high pass Downsample each result Store the 3 detail vectors and set input = low pass/low pass. Repeat desired number of levels.

Post Transformation

   After the data was transformed and stored in an appropriate data structure, an in memory gzip compression (which was oddly better than bzip2) was done on the data and it was outputted in binary format Reconstruction is another program that does everything in reverse except uses the reconstruction filters.

There was trouble in storing and reading the data back in an appropriate form based on the decomposition structure.

DD L1 DA L1 AD L1 DD L2 DA L2 AD L2 DDL3 DAL3 ADL3 AA3 Storing data

2D Results

Results for 2 D data vorticity data sets using “high” compression

(uses bior3.1 wavelet. 1x10^-4 threshold)

Data Size 64x64 grid 128x128 256x256 512x512 1024x1024 2500x2500 Compression Time Compression Ratio .066s

2.038

.090s

.161s

.390s

1.474s

21.21s

3.359

6.775

16.901

41.624

295.376 (.33% of original) Max Error 1.414 x 10 -4 1.43 x 10 -4 1.359 x 10 -4 2.218 x 10 -4 7.965 x 10 -5 4.67 x 10 -5 2500x2500 went from size 50,000,016B -> 169,344B 1024x1024 went from 8,388,624B -> 201,601B !!!

2D Results

Ratio 350 300 250 200 150 100 50 0

2.038

3.359

6.775

16.901

41.6124

295.376

64 x6 4 12 8x 12 8 25 6x 25 6 51 2x 51 2 10 24 x1 02 4 25 00 x2 50 0 Data Size

Compression Ratio for different 2-D data sizes

2D Results

Results for 2 D data vorticity data sets using “low” compression

(uses bior5.5 wavelet (more coefficients than bior3.1). 1x10^-7 threshold)

Data Size 64x64 grid 128x128 256x256 512x512 1024x1024 2500x2500 Compression Time Compression Ratio .071s

.6978

.114s

.222s

.601s

2.133s

20.33s

1.350

2.393

5.678

16.11

23.03

Max Error 9.2 x 10 -8 8.95 x 10 -8 9.27 x 10 -8 9.23 x 10 -8 8.013 x 10 -8 1.389 x 10 -7 64x64 actually increases in size because the decomposition creates matrices whose sum is larger than the original and the threshold level is too low.

Comparison

 Compared to the adaptive subsampling presented in the thesis of Tallat..

Threshold Level (AS 1020x1020) Wave 1024x1024) 10 -3 10 -4 10 -5 Wavelet Compression Ratio 82.0091

41.6241

17.256

Adaptive Subsampling Compression Ratio (Best results) 53.84

19.57

7.378

2D Pictures

128x128 vorticity – ORIGINAL 128x128 vorticity – RECONSTRUCTED Using High compression

2D Pictures

Difference between 128x128 vorticity original and reconstructed

2D Pictures

Plot of 1024x1024 max difference for each row between orig. and reconstructed

3-D Data

Similar to 2-D except more Complicated.

My implementation: 1) Take Z axis and downsample 1) Get A and D 2) Take Y axis and downsample 1) Get AA, AD, DA, DD 3) Take X axis and downsample 1) Get AAA, AAD, ADA, ADD, DAA, DAD, DDA, DDD 4) Take AAA and set as input, 5) Repeat for desired level of steps Real trouble in trying to store the data in way that can be reconstructed later From: http://taco.poly.edu/WaveletSoftware/standard3D.html x ADD AAD z DDD DAD y ADA AAA DDA DAA Store next decomposition here

3-D Data

 It’s also problematic getting 3-D data.  Took vorticity frames and concatenated.  Tested with 64x64 vorticity with 64 frames (so not really 3D data).

Data 64x64x64 Compression Ratio 3.29635

(2,097,176B to 636,308B) Max Error 2.53188 x 10 -4

3-D Visual Comparison

Reconstructed 64x64x64 vorticity data Original 64x64x64 vorticity data

Parallel Processing

     The detail and approximation data can be calculated independently. XML-RPC was used to send an input vector to one node to find the detail data and another to find the approximation data.

The master node coordinates the sending and receiving.

This led to an enormous slowdown in performance as expected.

   XML-RPC adds a huge overhead Data is sent one row at a time instead of sending the entire level decomposition. This creates excessive communication In the 2-D decomposition, when convolving the columns, four operations can be done in parallel, but instead are only done two at a time Performance was not the main goal here, rather it was a proof of concept.

Demo

 Master Node ./wavecomp –c 1 –d 2 -p -m -s 132.239.55.175 132.239.55.174 vorticity001.dat

 Slave Node ./wavecomp –p -s

Parallel Processing Results

Results for 2 D data vorticity data sets using “high” compression

Parallel Processing on two nodes

Processing Time

Data Size Running Time 64x64 grid 128x128 256x256 512x512 .880s

2.64s

8.76s

33.49s

40 30 20 10 0 64 0.88

2.64

128 Data Size 256 8.76

512 33.49

Single Parallel

Known Issues

   This method is not applicable for all kinds of data. If enough values are not thresholded (because the limit is too low or the wavelet wasn’t appropriate), then the size can actually increase (because decomposition usually creates detail and approx vectors larger in sum than the original) My implementation does excessive data copying, which could be eliminated, speeding up the time for processing. It comes down to the question of whether the transformations should be done in place (which is tricky because sizes can change)

Conclusion

 Lossy compression is applicable for many kinds of data, but the user should have a basic understanding of the thresholding required  Wavelets are a good choice for doing such compression, as evidenced by other applications and these results.  The finer the resolution of the data, the better the compression

References

 The following helped significantly.

     Matlab Wavelet Toolbox: http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet /wavelet.html

Robi Polikar – Wavelet Tutorial http://users.rowan.edu/~polikar/WAVELETS/WTtutorial.html

PyWavelets http://www.pybytes.com/pywavelets/ Geoff Davis – Wavelet Construction Kit http://www.geoffdavis.net/dartmouth/wavelet/wavelet.html

Wickerhauser – Adapted Wavelet Analysis from Theory to Software