Data Compression/Multiple transformations

< Data Compression

combination

Early text data compression techniques were divided into two very different-seeming approaches:

Researchers are searching for some way to combine them, to get the benefits of both. [1]

A typical paragraph of text has several different kinds of patterns. Some archive software applies a series of different transformations, each transformation targeting a different kind of pattern in the text.

Such a transformation is often based on some kind of model of how the file was produced.

Compression software (especially image compression software) often sends the raw data through one or more stages of decorrelation -- such as Karhunen–Loève transform[2], color space transformation, linear filtering, DFT, transform coding, etc. -- the "precompressor" stages -- before feeding the result into an entropy compressor.

Many transforms used in compression, counter-intuitively, produce an intermediate result that requires more bits of dynamic range to store. Transforming raw 24-bit RGB pixels into 26-bit YCoCg-R pixels[2] or transforming raw 16-bit audio samples into 17-bit delta coded differences may seem to "waste" bits, but it improves net compression performance: In many cases, the entropy compressor gives smaller compressed file sizes when given such "de-correlated" intermediate data than when given the original raw data. (The improvement in results resulting from such transformations is often called "coding gain").

color transforms

The JPEG 2000 Reversible Color Transform (RCT) is a YUV-like color space that does not introduce quantization errors, so it is fully reversible.

The compressor converts RGB to YCbCr with:

and then compresses the image.

The decompressor decompresses the image, then takes YCbCr pixels and restores the original RGB image with:

A pixel in the JPEG 2000 YCbCr format requires 26 bits to represent losslessly.

The Red, Green, and Blue layers of an image are typically highly correlated. So compressing the R, G, and B layers independently ends up storing much of the same data 3 times.

Even though "decorrelating" a 24-bit-per-pixel RGB image into a YCbCr requires *more* bits (26 bits) to hold each pixel, decorrelating helps later stages in the compression pipeline do a better job. So independently compressing the Y, Cb, and Cr planes usually results in a smaller compressed file than independently compressing the R, G, and B layers of the same file.

Most (but not all!) lossless color space transforms, including the JPEG 2000 RCT, are "expanding". "Expanding" color space transforms convert 24 bit RGB pixels to intermediate YCbCr pixels that require *more* than 24 bits to represent losslessly.

A few lossless color space transforms are non-expanding. "Non-expanding" color space transforms convert 24 bit RGB pixels to intermediate YCoCg24 pixels that require the same 24 bits to represent losslessly.

All color space transforms that convert 24 bit RGB pixels to something less than 24 bits per pixel are lossy. (Do I need to spell out why?).

audio transforms

music

... (fill in more details) ...


Most modern lossy audio codecs use modified discrete cosine transform (MDCT) as the decorrelation stage (as in Vorbis, AAC and AC-3) or as one of the decorrelation stages (as in MP3 and ATRAC).

speech

Speech compression often uses a relatively simple model of the human voice. The glottis produces a buzz at some frequency. The vocal tract resonates or dampens that frequency and its harmonics. The tongue and lips occasional producing hissing sibilants and popping plosive sounds.

A fairly intelligible speech sound can be reconstructed at low bit rate, by updating a linear predictive coding model at 30 to 50 frames per second.

There are several ways to represent the linear predictive filter coefficients.

While in principle the linear predictive filter coefficients could be transmitted directly, typically the system is set up so (after entropy decoding) the received bits represent reflection coefficients,[3] log area ratios, line spectral pairs, or some other representation of the filter coefficients.

Codec2

... (fill in details) ...

DEFLATE

DEFLATE combines LZ77 with Huffman encoding. It is specified in RFC 1951.

It is extremely widely used to decode ZIP files and gzip files and PNG image files.

zlib

The "zlib compressed data format specification" is a standardized way of packaging compressed data in a file, with a simple header that indicates the type of compression and helps detect the most common kinds of data corruption. It is often used to store data compressed with DEFLATE with a window size up to 32K.

The "zlib compressed data format specification" is specified in RFC 1950.

LZX

The LZX file format and compression algorithm was invented by Jonathan Forbes and Tomi Poutanen. LZX is used in a popular stand-alone AMIGA file archiver. Microsoft Compressed HTML Help (CHM) files use LZX compression. Windows Imaging Format (".WIM") files support several data compression methods, including LZX. Xbox Live Avatars use LZX compression.

LZX uses both LZ77-style sliding window compression and dynamic Huffman codes, somewhat like DEFLATE uses both. The LZ77-style window size (WINDOW_SIZE) is some power-of-two, at least 2^15 and at most 2^21. LZX uses several Huffman tree structures, each of them canonical.

The most important Huffman tree in LZX is the "main tree", composed of 256 elements corresponding to all possible ASCII characters, plus 8*NUM_POSITION_SLOTS elements corresponding to matches. (The NUM_POSITION_SLOTS is in the range of 30 slots to 50 slots). The second most important Huffman tree in LZX is the "length tree", composed of 249 elements. Other trees have smaller roles.

The LZX compressor arbitrarily divides up a file into one or more blocks. Each block is one of 3 types: "verbatim block", "aligned offset block", "uncompressed block". The "uncompressed block" has a few more things in the header, and then a literal copy of the uncompressed data.

In the data portion of the "verbatim block" and "aligned offset block" blocks, LZX uses length-limited Huffman codewords; LZX limits each codeword to at most 16 bits wide. Each "verbatim block" and "aligned offset block" has a larger header that gives the information (in a very compact differentially-compressed form) required for the decoder to regenerate the various Huffman trees used in that block.

The header is decoded to get the codeword length for each symbol, has a length of 0 to 16 bits (inclusive), where "0 bits" indicates the corresponding symbol never occurs in this block, and "16 bits" are used for symbols that occur extremely rarely. A variety of techniques are used to compress the header into relatively few bits.

When the "distance" value is decoded, there are several special values. It happens quite often that two long phrases of English text are *almost* identical, but there is a small typo or substitution in the middle of the otherwise-identical long phrase. With the original LZ77, the result is 3 copy items: [long distance X to first part, length of first part], [distance Y to the insertion, length of insertion], [exactly the same long distance X to start of second part, length of second part]. LZX, rather than repeat exactly the same long distance X twice, uses a special "distance" value to indicate that we are "continuing" to copy a previous block after making a substitution in the middle. (The special distance value ends up occupying fewer bits than the full long distance).

After the decompressor uses the block header to set up all the trees, the decompressor's main loop in a verbatim block is:

After the decompressor uses the block header to set up all the trees, the decompressor's main loop in an aligned block is:

The Microsoft "cabinet" (".CAB") compressed archive file format supports 3 data compression methods: DEFLATE, LZX, and Quantum compression. The compressor that puts executable files into a ".CAB" archive may optionally "detect "CALL" instructions, converting their operands from relative addressing to absolute addressing, thus calls to the same location resulted in repeated strings that the compressor could match, improving compression of 80x86 binary code." The ".CAB" decompressor, after doing LZX decompression described above, and if the "preprocessing" bit in the header was set to one, must convert those operands back to relative addressing. We will discuss such preprocessor/decorrelators in general, and Executable Compression specifically, in more detail in a later chapter.

[4][5][6][7][8]

LZX DELTA

LZX DELTA (LZXD) is a variation of LZX modified to support efficient delta compression (patching).

A LZX DELTA decompressor takes the old version of some plaintext file (this must be exactly the same as the old version used by the compressor) and the LZXD-compressed patch file, and uses them to generate the new version of that plaintext file.

The main difference between LZXD and LZX is that the window is pre-loaded with the reference file, with the decompressor decoding the first byte of the output file into the window at the location immediately after the last byte of the reference file. Normal LZX copy items have a distance that is at most the full distance back to the first byte of the output file; LZXD allows a distance that reaches much further back to allow copies from the reference file.

LZXD supports many more "slots", up to 290 slots, which are encoded into the main tree. This allows LZXD copy items to grab things much further into the past, allowing it to take advantage of very long reference files.

LZXD also supports much larger "length" values. Much as LZX has a "short length" of 0 to 7, where "7" is used as an escape for to get a "long length" from 7 to 255, LZXD uses the same "short length" and "long length" and goes one step further, using a long length of "257" as an escape for an "extra-long length" that can copy 32 KByte in a single copy.

[9]

LZARI

LZARI uses LZSS pre-processor followed by arithmetic encoding. LZARI was developed by Haruhiko Okumura.

LZH

LZH uses LZSS pre-processor followed by Huffman coding. LZH was developed by Haruyasu Yoshizaki for the LHA archiver, based on LZARI.[10]

LZB

A LZB compressor uses a LZSS pre-processor followed by Elias gamma coding of the "match length" and of the "offset". (That extra step makes LZB give a better compression ratio, but run slower, than LZSS alone).

LZB gets compression roughly equal to LZH and LZARI, but is simpler and faster,[10] because a universal code such as Elias gamma coding is simpler and faster to decode than Huffman or arithmetic encoding (respectively).

LZMA

The Lempel–Ziv–Markov chain algorithm (w:LZMA) uses a LZ77-like compression algorithm, whose output is encoded with range coding.

LZMA is one of the compression methods used in the open-source 7-Zip file archiver.

The LZ77-like compression part of LZMA has many similarities with the corresponding LZ77-like compression part of LZX. LZMA does not use Huffman codes -- instead, it uses context-coded binary arithmetic codes.

LZHAM

LZHAM (LZ, Huffman, Arithmetic, Markov) is a general purpose lossless data compression library by Richard Geldreich, Jr. Both the unstable/experimental version and the "bitstream compatibility" stable version of LZHAM are released under the MIT license.[11][12]

LZW and Huffman Encoding

Some people combine LZW and Huffman encoding.

The decoder for such a system reads a series of Huffman codes from the compressed file. The decoder decodes each variable-length Huffman code into an intermediate fixed-length 12-bit index. With a typical plaintext file, some of those 12-bit numbers are used far more frequently than others, and so are assigned Huffman codes of less than 12 bits. The decoder then uses a LZW decoder to recover the original variable-length plaintext of 1 or more characters associated with that intermediate index.


[13]

implementation tips

Many compressed file formats must be decompressed using an entropy decoder, followed by a whole series of transformations -- a kind of Wikipedia: Pipeline (software).

There are 4 ways to implement such decompressors (and the corresponding compressors):


Converting software from a "call down to fetch more data" API to a "return up to fetch more data" API is difficult.[14]

encryption

Because encrypt-then-compress doesn't make the file any smaller, many people do compress-then-encrypt.[15][16][17][18][19][20][21][22][23]

Because modern CPUs process data faster than data can be read off a spinning disk or most external flash drives, it is possible to read and write a compressed, encrypted file in less time than it takes to read and write a normal plain text file containing the same data. [24]

A few people are doing research on algorithms that compress and encrypt simultaneously.[25][26][27][28][29][30]

Some researchers test encryption algorithms by trying to compress the encrypted data. Many encryption algorithms are specifically designed to produce ciphertext that is "indistinguishable from random noise".[31] When trying to implement such an algorithm, If compression produces a compressed output file that is smaller than the encrypted input file, that tells the researcher that there is a flaw in the encryption software -- either a bug in the implementation, or it's a correct implementation of an insecure algorithm.

implementation tips

Some people using lossless compression prefer to use "idempotent" (Is there a better term for this?) compression software, even if it takes a little longer to compress.[32] In other words, they expect that when you compress a file with "maximum compression" (often "-9n") to create one compressed file, then uncompress it, then re-compress it, they expect the resulting second compressed file is identical to the first compressed file.

This generally requires that each and every transformation implementation (at least when passed the "-9n" option) is idempotent. Some things that are not idempotent (and so should be avoided, at least when passed the "-9n" option) are:

some information theory terminology

(FIXME: replace "idempotent" above with "non-drifting lossy", as defined below? Or is there an even better term?)

Some information theory people[33] classify transformations into 4 categories of loss:


Some transforms are known not to completely remove all redundancy. In particular, researchers have designed special LZ77 encoders and LZW encoders that encode extra "redundant bits" in the choice of which copy was selected, in a way that is backwards-compatible: Standard LZ77 or LZW decoders can extract the original file. Special LZ77 or LZW decoders can extract the original file plus these extra "redundant bits". These extra bits can be used for Reed-Solomon or other forms of error correction, or to store metadata about the original file such as a public signature or message authentication code.[34][35]

Network sparsification

Network sparsification has applications in data compression.[36]

References

  1. "LZRW4: Ziv and Lempel meet Markov" by Ross Williams. 23-Jul-1991.
  2. 1 2 3 Henrique S. Malvar, Gary J. Sullivan, and Sridhar Srinivasan. "Lifting-based reversible color transformations for image compression". "the Karhunen-Loève transform... provides maximum decorrelation."
  3. Nimrod Peleg. "Linear Prediction Coding". p. 55
  4. Wikipedia: LZX (algorithm)
  5. "Microsoft LZX Data Compression Format"
  6. "lzx.c - LZX decompression source code"
  7. "lzx.c - LZX decompression source code" (has many comments on the difference between the documentation and the implementation, clarifying several details the "official" documentation neglects to specify)
  8. "Solve the File Format Problem: LZX".
  9. "LZX DELTA Compression and Decompression"
  10. 1 2 Andy McFadden. "Hacking Data Compression Lesson 11 LZH, LZARI, and LZB". 1993.
  11. "LZHAM codec - unstable/experimental repo."
  12. github: LZHAM - Lossless Data Compression Codec (was Google Code: lzham ) by Rich Geldreich.
  13. Steve Wolfman. "Compression with LZW and Huffman Encoding".
  14. "Hacking Data Compression" by Andy McFadden
  15. "Can I compress an encrypted file?"
  16. "Compress and then encrypt, or vice-versa?"
  17. "Composing Compression and Encryption"
  18. "Compress, then encrypt tapes"
  19. "Is it better to encrypt a message and then compress it or the other way around? Which provides more security?"
  20. "Compressing and Encrypting files on Windows"
  21. "Encryption and Compression"
  22. "Do encrypted compression containers like zip and 7z compress or encrypt first?"
  23. "When compressing and encrypting, should I compress first, or encrypt first?"
  24. Bill Cox. "TinyCrypt".
  25. Dahua Xie and C.-C. Jay Kuo ENHANCED MULTIPLE HUFFMAN TABLE (MHT) ENCRYPTION SCHEME USING KEY HOPPING http://viola.usc.edu/Publication/PDF/ISCAS/2004_ISCAS_Xie.pdf 2004
  26. "Encryption Using Huffman Coding". quote: "If you take a data file and compress it using Huffman coding techniques, you get something that looks random. Make the coding table the key... You could make it more difficult by arbitrarily permuting the ones and zeros and choosing random digraphs to encode... several dozen common digraphs as single symbols".
  27. mok-kong shen. "PREFIXCODING, an encryption scheme with pseudo-random prefix codes substitution and transposition".
  28. Douglas W. Jones. "Splay Tree Based Codes". quote: "Splay compression performs unexpectedly well on images... It is only an accident that splay trees are useful for encryption, and the security of variants on the algorithm is still largely an open question."
  29. Qing Zhou, Kwok-wo Wong, Xiaofeng Liao, Yue Hu. "On the security of multiple Huffman table based encryption" 2011.
  30. Gillman, Mohtashemi, and Rivest. "On Breaking a Huffman Code". 1996. doi:10.1.1.309.9256 quote: "The idea of using data compression schemes for encryption is very old, dating back at least to Roger Bacon in the 13th century".
  31. Wikipedia: ciphertext indistinguishability#Indistinguishable from random noise
  32. "gzip -9n sometimes generates a different output file on different architectures"
  33. https://groups.google.com/forum/?fromgroups=#!topic/comp.compression/r96nB43uuLs
  34. Stefano Lonardi and Wojciech Szpankowski. "Joint Source-Channel LZ'77 coding". Data Compression Conference, 273-282, Snowbird, 2003.
  35. Yonghui Wu, Stefano Lonardi, Wojciech Szpankowski. "Error-Resilient LZW Data Compression". Data Compression Conference, 193-202, Snowbird, 2006.
  36. Erica Klarreich. "'Outsiders' Crack a 50-Year-Old Math Problem". 2015.
This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.