Posted by dcoetzee on July 8th, 2009
Abstract: The embedded zerotree wavelet algorithm (EZW) is a simple, yet remarkably effective, image compression algorithm, having the property that the bits in the bit stream are generated in order of importance, yielding a fully embedded code. The embedded code represents a sequence of binary decisions that distinguish an image from the “null” image. Using an embedded coding algorithm, an encoder can terminate the encoding at any point thereby allowing a target rate or target distortion metric to be met exactly. Also, given a bit stream, the decoder can cease decoding at any point in the bit stream and still produce exactly the same image that would have been encoded at the bit rate corresponding to the truncated bit stream. In addition to producing a fully embedded bit stream, EZW consistently produces compression results that are competitive with virtually all known compression algorithms on standard test images. Yet this performance is achieved with a technique that requires absolutely no training, no pre-stored tables or codebooks, and requires no prior knowledge of the image source.
The EZW algorithm is based on four key concepts: 1) a discrete wavelet transform or hierarchical subband decomposition, 2) prediction of the absence of significant information across scales by exploiting the self-similarity inherent in images, 3) entropy-coded successive-approximation quantization, and 4) universal lossless data compression which is achieved via adaptive arithmetic coding.
Discussion: This 1993 paper described EZW, one of the earliest successful image compression algorithms using the discrete wavelet transform, and was the first published fully embedded code for images. With such a code, an image file can be truncated at any point and the result will be a valid approximation to the image, enabling a fine-grained tradeoff between file size and quality. It is also, unusually for its time, a compression scheme that requires no training on an existing database of images, instead adapting to the statistics of the image at hand.
If you don’t have a background in signal processing there’s quite a bit of background required for this one, but I’ll summarize the necessary concepts here.
The first concept, adaptive arithmetic coding, is an entropy coding strategy used in a huge variety of both lossless and lossy coding schemes. To simplify discussion, let’s consider the simple scenario where we’re decoding a bilevel (black and white) image file pixel-by-pixel in scan order, and we have access to all preceding pixels in scan order. Suppose the previous pixel is white. How can we use this information to predict the color of the next pixel? One simple way is to keep a count of how many times so far a white pixel has been followed by a black pixel, and how many times it’s been followed by a white pixel; these frequencies are used to estimate the probabilities that the next pixel will be black, or white. This is the first part of the entropy coding scheme, called the model; it is adaptive because it depends on what other pixels have been decoded so far.
The second half is the arithmetic coding. Like all entropy coding schemes, it encodes a sequence of symbols, using less bits to represent a symbol that is expected to occur with high probability, and more bits to represent a symbol that is expected to occur with low probability. The details of how this coding is done are not important here; the important thing is that it is near-optimal, in the sense that if a symbol is expected to appear with probability p, it can be encoded in close to −log2 p bits, even if this value is not an integer, which is the lower bound of Shannon’s source coding theorem. For example: a symbol expected to appear with probability 0.5 can be encoded in −log20.5 = 1 bit, a symbol expected to appear with probability 0.95 can be encoded in −log20.95 or about 0.074 bits, and a symbol expected to occur with probability 0.001 requires about 10 bits. Compared to Huffman coding, arithmetic coding is particularly useful in situations where the number of symbols is small, or where the probabilities are close to 1.
The discrete wavelet transform (DWT) is, like the discrete cosine transform (DCT) used by JPEG files, a way of representing an image in the frequency domain instead of the spatial domain. It is a reversible operation that converts an array of pixels into an equal-sized array of coefficients each representing how much energy the image has in a certain range of frequencies, in a certain part of the image. The reason for doing this is that natural images like photographs tend to have most of their energy at low frequencies due to their smooth, continuously-varying regions, with high-frequency information like edges and fine details being limited to only a small portion of the image. The result is that after undergoing such a transformation, most of the coefficients will be close to zero, and can be truncated to zero with little visual impact on the image. The main difference between the DCT and the DWT is the DCT is applied to a fixed-size spatial region: in JPEG, it’s always applied to an 8×8 block of pixels. The DWT is applied in a multiresolution fashion: it uses a high-pass filter to isolate local changes in brightness (details) in the image between adjacent pixels, then low-pass filters and downscales the image and repeats this process. As a consequence its coefficients vary not only in the frequency range they cover but also in the size of the spatial region they cover, allowing them to represent both very fine details and larger-scale details compactly. Since a range of frequencies is called a subband, this is also called a hierarchical subband decomposition.
The array of DWT coefficients can be visualized as a new image, which at low bit rates will be mostly black (close to zero), but with white in some places, representing the significant coefficients. See the example to the right (Alessio Damato / CC-BY-SA). The main challenge in representing such an image is not so much representing the values of the significant coefficients, but their position, using a data structure called a significance map.
This is where the zerotrees come in. Notice how the three images in the upper-left closely resemble the three large images. In particular, where the smaller images are black, the larger images also tend to be black. Zerotrees take advantage of this by encoding the smaller images first, and assigning one of four values to each coefficient: positive significant coefficient, negative significant coefficient, zerotree root, and isolated zero. A zerotree root is an insignificant (near zero) coefficient for which the corresponding coefficients in all larger images are also insignificant. An isolated zero is an insignificant coefficient that does not have this property. Once a zerotree root has been encoded, none of its descendants need to be encoded, since they’re already known to be insignificant. Additionally, since isolated zeros are rare, this symbol is assigned a small probability and the use of arithmetic coding ensures that not too much space is wasted accounting for this possibility. The parameter that determines how much space our zerotree occupies is the threshold between significant and insignificant coefficients.
This is already a great way of compressing images, but how do we turn this into a fully-embedded code that can be truncated at any point? The key to this lies in a successive-approximation scheme. In this scheme, initially all coefficients are assumed to be insignificant, so the initial zerotree is trivial (one zerotree symbol). We then lower the threshold (divide it by 2) and compute and output a new zerotree, and repeat this indefinitely. Any significant coefficients already detected by previous zerotrees are assumed to be zero, so that they don’t get encoded more than once. Simultaneously, each time we produce a new zerotree we also add one bit of precision to the magnitudes of all the already-known significant coefficients. We don’t have to finish writing out any particular zerotree, since each one prioritizes writing out coarse details (smaller coefficient images) before finer details (larger coefficient images); any coefficients left unencoded are assumed to be insignificant.
What’s happened to EZW today? Its raw compression performance and speed have long been since been exceeded by other systems, such as SPIHT (Said, Amir and Pearlman, William A. A new fast and efficient image codec based upon set partitioning in hierarchical trees. June 1996. PDF),which benefits from a more complex partitioning scheme then the simple splitting-into-four shown here, and other improvements. But its ideas remain the foundation of all the best modern image compression systems. These ideas were incorporated into the JPEG 2000 image format, which uses a very similar wavelet-based fully-embedded code. This eliminates JPEG’s blocking artifacts due to its multiresolution nature, permits a fine-grained quality-for-size tradeoff, gives significantly better quality at low bit rates, and allows lossy and lossless encoding in a single framework, among other advantages. JPEG 2000 hasn’t seen wide adoption in consumer applications – one can only speculate as to why – but these techniques remain a valuable way to look at image compression.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.