Data Compression/Order/Entropy

< Data Compression < Order

Entropy

Entropy is a measure of unpredictability. Understanding entropy not only helps you understand data compression, but can also help you choose good passwords and avoid easily-guessed passwords.

For example, 1000 bits of data representing 1000 consecutive tosses of a fair coin has an entropy of 1000 bits, since there is no way to predict whether "heads" or "tails" will come up next. 1000 bits of data representing 1000 consecutive tosses of an unfair two-headed coin has an entropy of zero (0) bits, since the coin will always come up heads.

The entropy of a message is in a certain sense a measure of how much information it really contains.[1]

By taking the consistencies out of a storage element, such as the redundancies, similar components, longest most used words, etc., Compression increases the entropy of the file, until there is so much entropy in the file that it can't be further compressed. Compression when it shrinks a file, can lead to a higher entropy density per length of file. Which makes a properly compressed file when followed by an encryption pass harder to break in ciphertext only attacks than the file encrypted without the compression done first. See:

http://en.wikipedia.org/wiki/Unicity_distance

Of course it does not help with most plain text attack modes if attacker gets to pick the plain text you encrypt.

At the same time, the decompression algorythm adds more and more consistency and order to the entropic file, until it means something. One of the reasons that compression does not make a good form of encryption, is that there is information hidden in the entropy of the file, that indicates to the decompression program how to recover it. In fact in extreme compression the amount of data needed to recover the compressed text, is often a greater percentage of the file, than the actual remaining data that has yet to be compressed.

Entropy is always relative to some model. It is not a property of a particular string of bits that you are given, but of the set of bitstrings that you could plausibly have been given.

Entropy is always relative to some model. It is not a property of a particular string of bits that you are given, but of the set of bitstrings that you could plausibly have been given. For a given set of bits, the model that gives the lowest entropy is the model that best predicts those bits and therefore compresses the data the smallest. Most compressed file formats, store (a) what model we used, (b) the particular parameters of that model (if any), and (c) the data compressed according to that model. Some kinds of models have a lot of "parameters" and require a lot of space to store, so often we use completely unrealistic models -- but ones that require very little space to describe -- in order to reduce the net compressed file size (model description + compressed data).

One of the the simplest models is "Every byte of the file was independently and randomly generated by throwing darts at a wall while blindfolded. Some letters occur more frequently because they have a larger area on the wall." This is the order-0 Markov model, also known as the Shannon entropy when each byte is considered an independent message. Using that model, the entropy of any particular letter in a file F is

(This is the number of bits required to represent that letter using an entropy coder)

And the entropy of the entire file is the sum of the entropy of each letter in the file, (the number of bits required to represent the entire file is the sum of the number of bits required to represent each letter in that file)

In terms of the number of unique possible messages n (any particular letter in the file is one of a list n possible letters, from , any of which may occur 0, 1, or possibly N times)

Where

* is the number of times that unique letter x_i occurs in the file

Often some particular letter x_i in the character set never occurs anywhere in some particular file -- the way the letter 'e' never occurs in Gadsby (1939). There can be any number of such letters that never actually occur. Letters that never occur make no difference to the entropy of any particular letter that does occur, or to the total of them, which is the entropy of the entire file under the order-0 model. When some letter x_i exists in the character set, but it is never actually used in some particular file, that letter's frequency is zero, its probability is zero, and the value is effectively zero.

Any arbitrary file, when analyzed using a order-1 Markov model ("first order model"), will give a number for the entropy that is smaller or the same as the order-0 entropy.

We discuss order-1 Markov and higher-order Markov models in more detail later in this book (Markov models).

How do you further compress data

Compression at some point becomes how do you further compress already entropic data, that consists mostly of instructions to decompress a minimal amount of data?

There are two approaches that might be tried, the first approach is to inject order, without increasing the length of the file, so that you can recover more compression, and the other approach is much more controversial because it suggests the injection of entropy, without changing the length of the file, so that you can recover more compression.

The latter approach is based on a little known scientific principle first explained in "A New Kind of Science", which suggested that entropy and order are cyclical at least in incremental or decremental type IV cellular automata. This approach which might explain self-organization despite the Second Law of Thermodynamics would mean that sometimes when enough entropy is injected into a data field, the data self-organizes into an orderly state again. If we can keep track of the entropic states that it has gone through to get to the self-organization level, we can compress the new order, with some hope of recovering the original state.

entropy coding

statistical symbol compression

prefix coding

Huffman encoding

For each prefix code in the compressed data, a Huffman decompressor looks up the compressed symbol in a table and emits the corresponding plaintext symbol.

In the following examples, we assume the plaintext symbol is some 8 bit byte. (Most Huffman decompressors have 257 symbols in their alphabet: the 256 possible bytes, and a special 257th code that indicates "end-of-block". Some Huffman decompressors have additional compressed symbols that decompress to common byte pairs or entire words, which can lead to smaller compressed file sizes). [2][3]

Where does the translation table of code-to-letters come from? The particular source chosen is perhaps the biggest difference between one Huffman compressed file format and another.

(Yes, it's confusing to use "static" as a synonym for "dynamic". Is it too late to change this now?)

When a decompressor uses a fixed Huffman code, it uses a translation table hard-wired into the decompressor that never changes.

When a decompressor uses a static Huffman code, it reads the translation table at the beginning of the compressed block. Then it uses that data to decode the rest of the block. Most of the "optimality proofs" of Huffman assume this sort of static code.

Often one block of text will have slightly different letter frequencies from another, even inside the same file. Most static Huffman decompressors watch for a special 257th "end-of-block" symbol, and then discard the old Huffman table and read in a new static compressed table. The decompressor uses that table to decode the new block.

DEFLATE and some other decompressors put a few bits of "model selector" metadata at the beginning of every block. That commands the decompressor to either

When a decompressor uses an adaptive Huffman code, it begins with some default translation table. However, it updates that table with every symbol that goes through.

getting the compressor and the decompressor to use the same codewords

To get lossless compression, when the compressor represents the byte "e" with the bitstring "010", the decompressor has to somehow know that the bitstring "010" must be decoded into the byte "e".

There are several ways for a Huffman decompressor to "know" the symbol-to-bitstring map (sometimes called a "frequency table") used by the corresponding Huffman compressor:

Per-file and per-block dynamic Huffman codes can guarantee that certain symbols never occur in a block, and so some people write code that entirely eliminates those never-occurs symbols from the tree, reducing the number of bits in at least one codeword that does occur in the compressed text.

Many compression algorithms (including fixed Huffman codes and adaptive Huffman compression) estimate probabilities in a way that cannot guarantee that some symbol will never occur. In particular, if some symbol is so rare that it has not yet occurred in the training text -- it has zero frequency so far -- it is tempting to assign an estimated zero probability to that symbol. That leads to The zero-frequency problem.

Canonical Huffman code

A canonical Huffman code is a kind of Huffman code that can be very compactly described. That makes it super-wonderful for static Huffman compression. A large text file may have thousands of sections, each with a slightly different letter frequency. Every unnecessary byte in the format we choose for storing the description of the Huffman code table bloats that file by thousands of unnecessary bytes.

For example, the article on x-rays has a much higher frequency of the letter x than the previous article on yellow-bellied sapsuckers. Giving each section its own optimized Huffman table minimizes the number of bits required to store that section, if we pretend that the Huffman table overhead is not significant. Alas, that Huffman table overhead is often significant. Many compressors will spend a lot of time thinking about whether the total file will be shorter if we (a) use a separate Huffman table for each of several consecutive sections or (b) merge consecutive sections together into a single block and use a single compromise Huffman table for the whole block.

"drop some trailing ones, and then increment"

The decompressor needs two things to regenerate a block of plaintext: the exact Huffman code used by the compressor, and the series of compressed codewords.

Given any one Huffman code, one can construct many other equivalent codes, all of which have exactly the same codeword length for any particular plaintext symbol, but vary in the exact bit pattern assigned to each plaintext symbol. Since any particular plaintext symbol is assigned a codeword of a certain length, a length that is the same for every equivalent code, no matter which particular code you choose, any series of plaintext symbols will be compressed into a series of codewords of exactly the same total length. If the overhead of the prefix code description were insignificant, then it would not matter which of these many equivalent codes you would use -- just pick one at random. All equivalent codes compress that data to the same number of bits.

However, rather than picking one of them randomly and sending that to the decompressor, we can save bits in the prefix code description by picking one particular canonical Huffman code.

Given only the length, in bits, of the codeword for each symbol of any prefix code, we can construct the canonical Huffman code that gives every symbol the same length as that original prefix code.[5][6][7][8][9] What makes a code a canonical Huffman (CH) code is that code words are lexicographically ordered by their code lengths;[10] A given set of probabilities does not result in a unique CH code; one is free to reorder symbols that have the same code length.[11]

The steps to create a canonical Huffman (CH) code are as follows:

  1. Construct a Huffman tree and throw out all information other than the symbols and the lengths of the codewords.
  2. Sort the symbols by increasing code lengths.
  3. Store the number of symbols with each code length along with the sorted symbols themselves.


Something about trees

Every prefix code can be thought about as analogous to a tree.

Every leaf on the tree is tagged with a plaintext letter or other symbol.

The length, in bits, of each codeword is sometimes called the path length of the symbol.

Dynamic Huffman: storing the frequency table

(This is a continuation of "getting the compressor and the decompressor to use the same codewords", for the common case of dynamic Huffman).

Inside a typical compressed file, there is a header before each block of compressed data. For blocks compressed with dynamic Huffman, that header for that block tells the decompressor the set of codewords used in that block, and the mapping from codewords to plaintext symbols.

Typically the compressor and the decompressor have some hard-coded alphabet of S plaintext symbols they need to handle. Often S=258 symbols, the 256 possible bytes and a few other special symbols such as the end-of-block symbol. Each symbol in the alphabet is represented, in the compressed data, by some codeword with a length in bits from 1 bit to some MAX_CODEWORD_LENGTH bits.

Many simple Huffman file formats simply store that list of 258 bitlengths as a block of 258 bytes. Each position represents some plaintext symbol -- even ones that are never used in the plaintext -- typically position 0 represents the 0 byte, position 32 (0x20) represents the space character, position __ (0x65) represents letter 'e', position 255 represents the 255 byte, position 256 represents some special command, and position 257 represents the end-of-block indicator.

The number at each position represents the length, in bits, of the codeword corresponding to that plaintext symbol. (Since we use canonical Huffman codes, the length of the codeword corresponding to each plaintext symbol is all the decompressor need to regenerate the actual set of codewords used in this block). Usually the special "length" of "0 bits" is reserved to indicate letters that never occur in this block, and so should not be included in the Huffman tree.

That list of S numbers is often called the "frequency table". Sometimes that list of S numbers is called the "Kraft vector", because it must satisfy w: Kraft's inequality.

That block of 258 bytes in the header represents a lot of overhead, especially if the compressed file includes many separate blocks, each with its own header. Some Huffman compressors, such as English-word-oriented Huffman compressors, use vastly more symbols (one for each word in the English dictionary it uses), requiring a much longer Kraft vector. There are a variety of ways to get smaller compressed file sizes by designing the file format in a way that reduces the size of this block, including:

Implementation tips and tricks

People have spent a long time figuring out ways to make Huffman compression and Huffman decompression really, really fast. These are just a few of the techniques used to make decompressors (or compressors, or both) that give bit-identical the same output that a non-optimized program gives, while running faster.

Software to implement Huffman decompressors typically has 2 parts:

The inner loop of a Huffman decompressor finds the codeword that matches the next part of the compressed text, and emits the corresponding plaintext symbol. It's possible for a codeword to be more than 32 bits long, even when compressing 7-bit ASCII text, and bit-aligning and comparing such long words can be a bit of a hassle. There are three approaches to avoiding such long comparisons: (a) length-limited Huffman code. Often "natural" files never use such long codewords, only specially-constructed specifically designed to be worst-case files. Some compressors choose to deliberately limit the length of the Huffman code words to some maximum length. This makes such worst-case files, after compression, slightly longer than they would have been without the limit; it has no effect on the length of compressed "natural" files. Most file formats intended to be decompressed in hardware (video decompressed with a FPGA, etc.) limit the length in order to reduce the cost of that hardware. (b) A short, fast table that handles short codewords and has a "long codeword" flag, and other table(s) that handle longer codewords. Since the most frequent Huffman codes are short, the decoder can usually decode a codeword quickly with one hit to the short, fast table that fits entirely in cache; occasionally the decoder would hit a flagged entry, and take a little longer to look up the long code in the other table(s). There's a time/space tradeoff on exactly how short to make the first-level table: a decoder program with a smaller first-level table, compared to a decoder program that uses a larger first-level table, can still correctly decode the same file. The smaller-table decoder uses less RAM, the larger-table decoder decodes more words in the first hit and therefore runs faster (unless the table gets too big to fit in the cache). (c) rather than comparing directly to codeword bitstrings, first do a zero-count (to the next 1 bit, or to the end of file, whichever comes first) finding Z zeros, discard the 1 bit, and then the next n bits.

Prefix code decompressor implementations can be distinguished by how many bits they read from the compressed stream at one time, and how they use those bits to decide what to do next. People have implemented "fast" decompressors a variety of ways, including:

Conceptually the simplest one is the "one big lookup table" approach:

This is probably the fastest technique if your maximum codelength is relatively short.[12]

One fast inner loop for a fast prefix code decoder reads the compressed data a byte at a time (rather than the bit-by-bit for a "simple" prefix decoder). The "current tree position" and the compressed byte are used as an index into a large table that holds the resulting "new tree position" and up to 8 decompressed plaintext symbols.[13]

Any codeword can be represented by two numbers, (Z count of initial zeros, F value of trailing bits). For example, the codeword "000101" can be represented by the two values (3,5). The codeword "11" can be represented by the two values (0,3).

One quirk of the canonical Huffman code is that, if all our plaintext symbols fit in a n bit fixed-length code, then Z can be represented in n bits, and F can also be stored in n bits. (This is *not* true for all prefix codes). (When the decompressor does the actual zero-count in the compressed text, consecutive all-zero code words can lead to an indefinite number of consecutive zeros -- what I'm trying to point out here is that the number of initial zeros in any one particular codeword can be represented in n bits). This allows several different time/speed tradeoffs that all give identically the same output file. Perhaps the fastest inner loop is something like For each symbol:

Repeat until we hit the end of file.

This ignores the special-case handling of all-zeros codeword. Many Huffman coders, such as DEFLATE, design a Huffman table with a "fake" entry such that the compressed text never uses the all-zeros codeword; so the decompressor never has to deal with that complication. This also ignores special handling of codewords that map to some special command, rather than mapping to a literal plaintext symbol. This requires a lookup table of size 2^(2n-1). For a typical Huffman code with n=9 bits, able to handle all 2^8 bytes a few other special symbols, using a canonical Huffman code with this kind of implementation requires a lookup table of size 2^(17) = 131 072 entries. If a plaintext symbol is mapped to a codelength that is less than the maximum codelength, then multiple entries in that table map to the same symbol. There are other implementations that use far less RAM. Other implementations may even be faster, since they allow more of the working memory to fit in cache.

In general, for alphabet_size plaintext symbols, the worst-case maximum "compressed" Huffman code length is alphabet_size - 1 bits long.[12]

For 257 plaintext symbols, the worst (unlikely) case is a maximum bitlength of 2 characters that are "compressed" to 256 bits, so the maximum zero count (except when we have 1 or more consecutive all-zero codes) is 255.

Handling the end-of-text is often awkward, since the compressed string of bits is usually stored into some integer number of bytes, even though the compressed string of bits is usually *not* a multiple of 8 bits long. Most Huffman decompressors use one or the other of the following techniques for handling end-of-text:

improving on Huffman

The original Huffman algorithm is optimal for symbol-by-symbol coding of a stream of unrelated symbols with a known probability distribution.

However, several algorithms can be used to compress some kinds of files smaller than the Huffman algorithm. These algorithms exploit one or another of the caveats in the Huffman optimality proof:

This is reminiscent of the the way many algorithms compress files smaller than the "optimum" Fixed Bit Length Encoding codeword length[14] by exploiting the caveat in the Fixed Bit Length Encoding optimality proof:

Fixed Bit Length Encoding is optimal for symbol-by-symbol coding into fixed-length codewords of a stream of symbols with a known finite alphabet size A. The optimum codeword length for Fixed Bit Length Encoding is .

prefix codes

Huffman coding, Shannon-Fano coding, and Polar tree coding all map some relatively small known alphabet of symbols to prefix codes. [15] [16]

The "universal codes" map an infinitely large known alphabet -- often the set of all positive integers, or the set of all integers -- to prefix codes.

universal codes

There are many "universal codes", a kind of prefix code.

Any particular prefix code (including any universal code) can be seen as a kind of Huffman code for a particular corresponding frequency distribution.

Typical audio data, and typical audio data prediction residues, have approximately Laplacian distribution. Prediction error coding for image compression and geometry compression typically produces an approximately Laplacian distribution.[13] Encoding a Laplacian distribution with a general Huffman code gives a prefix code equivalent to using a Golomb code. The Golomb code is much simpler and faster than a general Huffman code to compress and decompress. If the audio data is "close enough" to a Laplacian distribution, a Golomb code may give even better net compression than a general Huffman code, because a Golomb decoder requires a smaller overhead of 1 parameter, while general Huffman decoder requires the overhead of a Huffman table of probabilities.

Rice coding is a subset of Golomb coding that is much faster to compress and decompress. Because Rice codes closely approximate Huffman codes for a Laplacian distribution, they compress almost as well as general Huffman codes.

Todo: add a few words about Fibonacci codes, a kind of "universal code".[17]

audio compression

fourier transform

perceptual coding

"The Gerzon/Craven theorems show that the level of the optimally de-correlated signal is given by the average of the original signal spectrum when plotted as dB versus linear frequency. ... this dB average can have significantly less power than the original signal, hence the reduction in data rate. ... In practice, the degree to which any section of music data can be ‘whitened’ depends on the content and the complexity allowed in the prediction filter." [18]

Typical audio data, and typical audio data prediction residues (after decorrelating), have approximately Laplacian distribution, and so they are often decoded with a Golomb decoder or Rice decoder rather than a general Huffman decoder.

Some audio decoders can be told to use Rice decoders during passages of music that compress well with them, and later told to switch to some other decoder during passages of music (such as pure sine wave tones, or all values equally probable white noise) that don't compress well with Rice decoders.

In some audio applications, in particular 2-way telephone-like communication, end-to-end latency is critical. The recommended maximum time delay for telephone service is 150 milliseconds[citation needed](Wikipedia:1 E-1 s). This rules out many popular compression algorithms. We discuss latency in more detail in the Data_Compression/Evaluating_Compression_Effectiveness#latency section of this book.

arithmetic coding

arithmetic coding

Context-adaptive binary arithmetic coding

...

combination

see Data Compression/Multiple transformations


References

  1. Wikipedia: entropy (information theory). Retrieved 2011-07-29.
  2. "Constructing Word-Based Text Compression Algorithms (1992)" by Nigel Horspool and Gordon Cormack.
  3. Tomáš Kuthan and Jan Lánský "Genetic algorithms in syllable based text compression". 2007. "The number of unique syllables in one language is measured in tens of thousands ... HuffSyllable is a statistical syllable-based text compression method. This technique was inspired by the principles of HuffWord algorithm."
  4. "Hacking Data Compression" by Andy McFadden
  5. "Canonical Huffman" by Arturo San Emeterio Campos. 1999.
  6. Canonical Huffman Codes by Michael Schindler.
  7. "Huffman Code Discussion and Implementation", including Canonical Huffman Codes. by Michael Dipperstein.
  8. "Decoding of Canonical Huffman Codes with Look-Up Tables" by Yakov Nekritch. 2000.
  9. "Space- and time-efficient decoding with canonical Huffman trees" by Shmuel T. Klein 1997.
  10. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes. New York: Van Nostrand Reinhold, 1994. ISBN 9780442018634. pp. 33-35.
  11. "Huffman encoding [DRAFT"] apparently by Lara Hopley and Jo van Schalkwyk (?). mentions canonical Huffman: "We will allocate '1's to shorter codes." mentions "even a canonical Huffman tree needn't be unique. We still have some scope for playing!" (describing situations where, even though we assign *different* lengths to letters and therefore different canonical Huffman trees, we end up with the same number of compressed bits)
  12. 1 2 3 4 Michael Schindler. "Practical Huffman coding"
  13. 1 2 Renato Pajarola. "Fast Prefix Code Processing". "Proceedings IEEE ITCC Conference". 2003.
  14. NES development: Fixed Bit Length Encoding
  15. http://en.wikipedia.org/wiki/User:C-processor/Polar_Tree
  16. Andrew Polar. "Non-Huffman binary tree". July, 2010.
  17. Shmuel T. Klein, Miri Kopel Ben-Nissan. "On the Usefulness of Fibonacci Compression Codes". 2004. via
  18. "MLP Lossless Compression" by JR Stuart, PG Craven, MA Gerzon, MJ Law, RJ Wilson 1999. section 4.2 "Prediction".
This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.