Papers in Computer Science

Discussion of computer science publications

Archive for the ‘Information theory and compression’ Category

Embedded image coding using zerotrees of wavelet coefficients

Posted by dcoetzee on July 8, 2009

Citation: Shapiro, J.M., “Embedded image coding using zerotrees of wavelet coefficients,” Signal Processing, IEEE Transactions on , vol.41, no.12, pp.3445-3462, Dec 1993. (IEEE) (PDF)

Abstract: The embedded zerotree wavelet algorithm (EZW) is a simple, yet remarkably effective, image compression algorithm, having the property that the bits in the bit stream are generated in order of importance, yielding a fully embedded code. The embedded code represents a sequence of binary decisions that distinguish an image from the “null” image. Using an embedded coding algorithm, an encoder can terminate the encoding at any point thereby allowing a target rate or target distortion metric to be met exactly. Also, given a bit stream, the decoder can cease decoding at any point in the bit stream and still produce exactly the same image that would have been encoded at the bit rate corresponding to the truncated bit stream. In addition to producing a fully embedded bit stream, EZW consistently produces compression results that are competitive with virtually all known compression algorithms on standard test images. Yet this performance is achieved with a technique that requires absolutely no training, no pre-stored tables or codebooks, and requires no prior knowledge of the image source.

The EZW algorithm is based on four key concepts: 1) a discrete wavelet transform or hierarchical subband decomposition, 2) prediction of the absence of significant information across scales by exploiting the self-similarity inherent in images, 3) entropy-coded successive-approximation quantization, and 4) universal lossless data compression which is achieved via adaptive arithmetic coding.

Discussion: This 1993 paper described EZW, one of the earliest successful image compression algorithms using the discrete wavelet transform, and was the first published fully embedded code for images. With such a code, an image file can be truncated at any point and the result will be a valid approximation to the image, enabling a fine-grained tradeoff between file size and quality. It is also, unusually for its time, a compression scheme that requires no training on an existing database of images, instead adapting to the statistics of the image at hand.

If you don’t have a background in signal processing there’s quite a bit of background required for this one, but I’ll summarize the necessary concepts here.

The first concept, adaptive arithmetic coding, is an entropy coding strategy used in a huge variety of both lossless and lossy coding schemes. To simplify discussion, let’s consider the simple scenario where we’re decoding a bilevel (black and white) image file pixel-by-pixel in scan order, and we have access to all preceding pixels in scan order. Suppose the previous pixel is white. How can we use this information to predict the color of the next pixel? One simple way is to keep a count of how many times so far a white pixel has been followed by a black pixel, and how many times it’s been followed by a white pixel; these frequencies are used to estimate the probabilities that the next pixel will be black, or white. This is the first part of the entropy coding scheme, called the model; it is adaptive because it depends on what other pixels have been decoded so far.

The second half is the arithmetic coding. Like all entropy coding schemes, it encodes a sequence of symbols, using less bits to represent a symbol that is expected to occur with high probability, and more bits to represent a symbol that is expected to occur with low probability. The details of how this coding is done are not important here; the important thing is that it is near-optimal, in the sense that if a symbol is expected to appear with probability p, it can be encoded in close to −log2 p bits, even if this value is not an integer, which is the lower bound of Shannon’s source coding theorem. For example: a symbol expected to appear with probability 0.5 can be encoded in −log20.5 = 1 bit, a symbol expected to appear with probability 0.95 can be encoded in −log20.95 or about 0.074 bits, and a symbol expected to occur with probability 0.001 requires about 10 bits. Compared to Huffman coding, arithmetic coding is particularly useful in situations where the number of symbols is small, or where the probabilities are close to 1.

The discrete wavelet transform (DWT) is, like the discrete cosine transform (DCT) used by JPEG files, a way of representing an image in the frequency domain instead of the spatial domain. It is a reversible operation that converts an array of pixels into an equal-sized array of coefficients each representing how much energy the image has in a certain range of frequencies, in a certain part of the image. The reason for doing this is that natural images like photographs tend to have most of their energy at low frequencies due to their smooth, continuously-varying regions, with high-frequency information like edges and fine details being limited to only a small portion of the image. The result is that after undergoing such a transformation, most of the coefficients will be close to zero, and can be truncated to zero with little visual impact on the image. The main difference between the DCT and the DWT is the DCT is applied to a fixed-size spatial region: in JPEG, it’s always applied to an 8×8 block of pixels. The DWT is applied in a multiresolution fashion: it uses a high-pass filter to isolate local changes in brightness (details) in the image between adjacent pixels, then low-pass filters and downscales the image and repeats this process. As a consequence its coefficients vary not only in the frequency range they cover but also in the size of the spatial region they cover, allowing them to represent both very fine details and larger-scale details compactly. Since a range of frequencies is called a subband, this is also called a hierarchical subband decomposition.

Jpeg2000_2-level_wavelet_transform-lichtensteinThe array of DWT coefficients can be visualized as a new image, which at low bit rates will be mostly black (close to zero), but with white in some places, representing the significant coefficients. See the example to the right (Alessio Damato / CC-BY-SA). The main challenge in representing such an image is not so much representing the values of the significant coefficients,  but their position, using a data structure called a significance map.

This is where the zerotrees come in. Notice how the three images in the upper-left closely resemble the three large images. In particular, where the smaller images are black, the larger images also tend to be black. Zerotrees take advantage of this by encoding the smaller images first, and assigning one of four values to each coefficient: positive significant coefficient, negative significant coefficient, zerotree root, and isolated zero. A zerotree root is an insignificant (near zero) coefficient for which the corresponding coefficients in all larger images are also insignificant. An isolated zero is an insignificant coefficient that does not have this property. Once a zerotree root has been encoded, none of its descendants need to be encoded, since they’re already known to be insignificant. Additionally, since isolated zeros are rare, this symbol is assigned a small probability and the use of arithmetic coding ensures that not too much space is wasted accounting for this possibility. The parameter that determines how much space our zerotree occupies is the threshold between significant and insignificant coefficients.

This is already a great way of compressing images, but how do we turn this into a fully-embedded code that can be truncated at any point? The key to this lies in a successive-approximation scheme. In this scheme, initially all coefficients are assumed to be insignificant, so the initial zerotree is trivial (one zerotree symbol). We then lower the threshold (divide it by 2) and compute and output a new zerotree, and repeat this indefinitely. Any significant coefficients already detected by previous zerotrees are assumed to be zero, so that they don’t get encoded more than once. Simultaneously, each time we produce a new zerotree we also add one bit of precision to the magnitudes of all the already-known significant coefficients. We don’t have to finish writing out any particular zerotree, since each one prioritizes writing out coarse details (smaller coefficient images) before finer details (larger coefficient images); any coefficients left unencoded are assumed to be insignificant.

What’s happened to EZW today? Its raw compression performance and speed have long been since been exceeded by other systems, such as SPIHT (Said, Amir and Pearlman, William A. A new fast and efficient image codec based upon set partitioning in hierarchical trees. June 1996. PDF),which benefits from a more complex partitioning scheme then the simple splitting-into-four shown here, and other improvements. But its ideas remain the foundation of all the best modern image compression systems. These ideas were incorporated into the JPEG 2000 image format, which uses a very similar wavelet-based fully-embedded code. This eliminates JPEG’s blocking artifacts due to its multiresolution nature, permits a fine-grained quality-for-size tradeoff, gives significantly better quality at low bit rates, and allows lossy and lossless encoding in a single framework, among other advantages. JPEG 2000 hasn’t seen wide adoption in consumer applications – one can only speculate as to why – but these techniques remain a valuable way to look at image compression.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Information theory and compression, Signal processing | Leave a Comment »

Content-Based Image Retrieval at the End of the Early Years

Posted by dcoetzee on April 21, 2009

Citation: Smeulders, A.W.M, Worring, M, Santini, S, Gupta, A, Jain, R. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 22, no 12, pp.1349-1380. Dec 2000. (PDF)

Abstract: The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Discussion: This extensive survey reviews and classifies content-based image retrieval systems from 200 early publications and discusses fundamental problems in the area.

If you go to Google Image Search today and type in search keywords, some searches work very well – searches for which the images are clearly labelled in the source webpage with those keywords, such as “Homer Simpson.” But when the person who published the content didn’t think to mark it with the specific term you’re looking for – such as “man looking at a girl” or “paintings of people wearing capes” – it falls apart rapidly. Stock photography sites like Getty Images try to compensate for this by paying people to tag images with every keyword they can possibly think of – needless to say, this is an expensive proposition, and one that still fails to satisfy many useful queries.

We can approach something close to an ideal image search using people: we give them a list of images and explain our query, and they exhaustively examine every image to determine if it satisfies the query. A person does not need to rely on surrounding text; they can look at the image itself, identify objects and their relationships, and incorporate cultural context and application-specific background to resolve ambiguities and identify actions and states. These are the sorts of tasks people are particularly good at.

Content-based image retrieval (CBIR) attempts to respond to image queries as a human would, subject to limitations of feasibility and performance. It relies on image content, using techniques from computer vision to interpret and understand it, while using techniques from information retrieval and databases to rapidly locate images suiting a specific query. Since its inception it has exploded into a field in its own right, with hundreds to thousands of papers each making its own tradeoff between the many variables involved.

A natural-language query to an image search engine rapidly runs into issues of intractability due to our very limited progress on the problem of natural language processing and the need for vast common-sense knowledge about the world to process many queries. For example, consider the rules you might use to determine if an image depicts an “indoor scene” or a “frightened person.” The authors of this paper label this problem the semantic gap. Rather than tackle these problems, most CBIR systems rely on alternate user interfaces, such as search by assocation, search by example, and search by sketching, all of which involve using images to search for other related images. Often this search process is iterative: at each stage, the user clicks an image “more like” their target image, refining the set of candidates. You can see a demonstration of a preliminary version of such an interface at the new Google Labs Similar Images.

There are a number of important variables which strongly influence the design of a CBIR system. These include:

  • The scope of the domain: How are the images constrained? How are the queries constrained? If the images can be taken under controlled conditions, and queries are easy to predict, the problem often admits a highly accurate domain-specific solution. At the other end of the spectrum is general image search of all online images.
  • The type of search activity: Will the user typically be looking for a specific image? Or just browsing? Or do they want to classify or categorize an image collection?
  • What is the user interface?
  • Performance constraints: how quickly must queries be answered?

Since the semantic gap is considered insurmountable, the most fundamental limitation in practical CBIR systems is what the authors of this paper call the sensory gap: “the gap between the object in the world and the information in a (computational) description derived from a recording of that scene.” In other words, it’s figuring out what you’re looking at based only on an image. In narrow domains with images taken under fixed conditions, this can be attainable even using simple schemes. In a general domain like web image search, this is still considered an insurmountable problem.

One of the main reasons the sensory gap is insurmountable in general domains is the issue of segmentation: separating the image into regions, each corresponding to a particular object. Doing this perfectly is called strong segmentation; this is infeasible or impossible in a general setting, particularly in the presence of occlusion (one object in front of another one). An alternative – weak segmentation – only identifies part of the region corresponding to an object, and hopes it is enough to identify the object. A third approach, accumulation, does not rely on segmentation at all: it computes a function across the entire image, which is designed to be insensitive to variations in the part of the image not corresponding to the object. Color indexing, which I described in a previous post, is an example of accumulation, where the color histogram over the image, corrected for lighting conditions, is used to identify the object it depicts.

Generally, object identification relies on a great deal of imprecise ad hoc knowledge about general image characteristics. For example, as a convention in most images the horizon is at about the same position – comparison of object locations to the horizon allows their actual size to be estimated. Much of the same physics of light, reflectance, geometry, and texture that comes into play in computer graphics is useful here in separating the object description from the environment description such as lighting and viewing angle. The rough shapes of objects can be identified with edge detection and invariant transforms to account for changes in viewing angle. Finally, cultural conventions such as the use of certain symbols in diagrams or the existence of perpendicular angles indoors are frequently exploited. In some cases, rather than writing rules for object recognition by hand, a large training set of images labelled with tags is used for statistical learning of image features (as with eigenfaces).

To facilitate efficient lookup, CIBR systems do not perform processing on each image for every query. Instead, they generally map an image to a feature vector, an array of values that succinctly describes the semantic objects in the image and their relationship, whether explicitly or implicitly. These are stored in a database. Lookup can then be done by comparing the feature vector of the query image to the feature vectors of the images in the database, a type of query the database supports efficiently.

A big challenge in CIBR systems has been evaluation: given two systems purporting to solve the same search problem, which one does a better job? In traditional search engines this problem is attacked with the notions of precision and recall, but these depend on a reliable notion of whether or not an image is relevant, and do not take into account the notion of iterative refinement of results. In 2000, it was typical to rely on derived measurements such as the time to complete tasks with access to the tool.

This discussion is but a summary of an already-dense survey, so if you’re interested in CBIR, I recommend consulting the survey for details. After more than a decade, there are increasing signs of wider deployment of CBIR systems both in limited domains and to the general public – don’t be surprised if you end up using one soon.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence, Information theory and compression | 9 Comments »

A Mathematical Theory of Communication

Posted by dcoetzee on April 12, 2009

Citation: Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, Vol. 27, pp. 379-423, 623-656, July, October, 1948. (PDF)

Abstract: The recent development of various methods of modulation such as PCM and PPM which exchange bandwidth for signal-to-noise ratio has intensified the interest in a general theory of communication. A basis for such a theory is contained in the important papers of Nyquist and Hartley on this subject. In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of the information.

Discussion: This seminal 1948 paper by Claude Shannon first introduced many of the central ideas and terminology of the then-nascent field of information theory, including entropy, channel capacity, the noisy-channel coding theorem, Shannon-Fano coding, language modelling, and Shannon-Fano-Elias coding, an early version of modern arithmetic coding.

A communication system is a system that sends data from a source to a destination over a channel. In the 1940s, communication systems such as radio, telephony, and telegraphy systems had been in use for many decades, and were effective, but had little theoretical basis. Simple questions remained unanswered, such as “how fast can I send a particular type of information over a particular channel?”, “what is the best way to represent the data while sending it over the channel?”, and “how do I measure how much information is represented by a piece of data?” This single sweeping paper answered these questions, both for discrete channels (designed to send a sequence of discrete symbols such as bits) and for continuous channels (designed to send continuously-varying signals such as audio or video signals).

The first point emphasized by this paper is how information should be measured: there are two approaches to this. One is to represent the amount of information by the number of possible values that the datum could represent; for example, a 32-bit register can store 4,294,967,296 distinct values. Although it is possible to describe information in these terms, it becomes rapidly unwieldly, and does not correspond to intuition (for example, one would expect two CDs to store twice as much information as one CD, not the square). Shannon chose to use a logarithmic measure, and to represent it in terms of “binary digits or bits“, a word popularized by this paper. This does lead to some counterintuitive results: for example, a piece of information taking on 10 distinct values stores about 3.3 bits of information, and it seems strange to talk of fractional bits.

With this measure, the capacity of a discrete channel, which transmits a sequence of discrete symbols such as bits, is defined as log N(T)/T as T approaches infinity. Here, N(T) is the number of possible messages one can transmit in time T. The source itself is approximated by a Markov process, which generates a sequence of symbols while using a fixed finite amount of state to assign probabilities to which symbol will be generated next.

The Markov process model is surprisingly general: many different models of an information source can be constructed as Markov models, ranging from the trivial one that just treats each character independently and assigns the symbols uniform probability, to one that examines several previous words in an English text to determine what the next word is most likely to be. Designing a Markov process that approximates the behavior of a real information source, like an English text document or an image, is a large part of the problem of language modelling, a practice with diverse applications. In this paper, armed with nothing but a book of random numbers, Shannon emulates four different simple language models for English text and shows subjectively their increasing similarity to actual English text.

Shannon posits a measurement of information or uncertainty that satisfies some basic axioms called entropy, denoted by H, a term borrowed from statistical mechanics. This is defined for a discrete source with n symbols as:

-sum(pi log pi) as i goes from 1 to n

For example, if you have a single bit that is either zero or one with probability 1/2, the total information conveyed by it is:

-[(1/2) log(1/2) + (1/2) log(1/2)] = 1 bit.

If on the other hand the bit is zero with probability 7/8, then it conveys much less information:

-[(7/8) log(7/8) + (1/8) log(1/8)] = about 0.164 bits.

One way of looking at this is that a sequence of k bits each independently produced with these probabilities can (with high probability) be compressed down to a sequence of slightly more than 0.164k bits. As k becomes larger, this bound can be approached more closely.

Shannon defines a channel’s capacity as the number of bits it can transfer per second – for a noise-free discrete channel where each symbol is the same length, this is trivial and is just a constant. Shannon’s noiseless channel coding theorem asserts that if a channel has capacity C and a source has entropy H, we can transmit only up to C/H symbols per second. He gives an explicit construction of a code capable of doing such a transmission, later dubbed Shannon-Fano-Elias coding, which is a near-optimal code very similar to the modern arithmetic coding that forms the basis of nearly all modern data compression methods. Conversely, to show that this rate cannot be exceeded, he shows that essentially any operation that can be done on a sequence of symbols in finite memory can only lower, not increase, its entropy; hence one cannot recover more any more information from the output of the channel than it already has, which is determined by the capacity.

When using a noisy channel, where the channel can modify symbols in transit, the situation is different. He presents an abstract model in which an “observer” machine watches both the input and output of the channel, and then sends “correction” data to the receiver, with the aid of which the receiver can perfectly reconstruct the input. The minimum capacity of the channel required for this correction data to be transmitted is called the equivocation, and denotes the uncertainty in the received data due to noise. The capacity of a noisy channel is defined as:

C = Max(entropy of input – equivocation), with maximum taken over all input sources.

By the use of forward error correction, or redundancy in the transmitted information, Shannon showed the surprising result that with this definition, one can transmit information at a rate of up to C/H symbols per second, with an arbitrarily low probability of errors. This cannot be exceeded; if one eliminates redundancy in an attempt to increase the transmission rate, the increase in errors will eliminate any added benefit. Shannon’s proof demonstrates this constructively by presenting a large class of theoretical error-correcting codes that achieve the desired bound in the average case, meaning most of them are near-optimal. On the other hand, none of these codes are practical for real use. He gives an isolated example, a Hamming code also designed in the 1940s, to demonstrate a more practical error correction code, but a more complete family of practical error correction codes, such as the Reed-Solomon error correction popular today, would have to wait for later research.

Shannon’s model was very general and admitted possibilities that today would be considered unnecessary for most practical scenarios. For example, he allows different symbols to require different lengths of time to transmit, and allows noise to affect different symbols differently. On the other hand, he also makes some important restrictions on the statistical behavior of the source that are often overlooked, the most notable of which being that it must be ergodic: this means that the long-term behavior of any one infinite character sequence must have the same statistical properties as the average over all possible sequences.

It’s easy to give an example of a set of sequences for which this is not the case: suppose we have a set of sequences where every other character is a, and the remaining characters are b, but we can start with either one. Since half of these sequences will have a’s in the odd positions, and half of them have a’s in the even positions, the character a occurs with equal probability in every position over all possible sequences. Yet in any particular infinite sequence, it occurs in half the positions with probability 1 and the other half with probability 0. Ergodic theory is a rich and complex mathematical theory only lightly touched upon in this work, but ergodic sequences form one of the fundamental assumptions of Shannon’s work.

The second and more mathematically sophisticated section of this paper, often overlooked today by computer scientists, deals with transmission of continuously-varying signals such as audio or video signals. In the 1940s this was of greater relevance, since most communication systems dealt with this type of signal; today, many of them are being replaced by digital systems. However, even these discrete signals must eventually be translated into continuous signals to be sent over wires, whether electrical or optic fiber, and the ability to analyze the action of noise on these signals directly rather than only at the higher level of discrete noise allows for a more precise analysis.

Shannon’s treatment of continuous signals and the resulting theorems are largely the same as the discrete case, but in the limit as time units become small. Entropy is now an integral instead of a sum, and is maximized when the distribution is Gaussian (producing a white Gaussian noise signal). Entropy is more complex to work with in this setting, because it depends on the choice of coordinate system. He shows closed-form results for a few cases where the average power of the input signal is bounded, and the case where the peak or maximum power is bounded, both with a simple additive white Gaussian noise (AWGN) model.

Additionally, the concept of “lossless” transmission becomes meaningless in this setting: a continuous signal carries an infinite amount of information, only a finite amount of which can be recovered in the presence of any noise in the channel. Instead of discussing the probability with which the signal can be reconstructed perfectly, he describes a function v that places a limit on how close the input and output functions are, and then defines capacity subject to this limitation.

Sometimes “A Mathematical Theory of Communication” is credited for single-handedly founding information theory. Although it was influential, this is an overstatement – it was essentially a generalization of the 1928 foundational work of Harry Nyquist (“Certain factors affecting telegraph speed”), which described the Nyquist rate and allowed Shannon to represent continuous signals in an essentially discrete way, and Ralph Hartley (“Transmission of Information”), which was the first paper to attempt to measure the information that could be put through a channel (although its conclusions were comparatively conservative). It also adapted the terminology and techniques of statistical thermodynamics as described by Boltzmann and Gibbs in its description of information entropy. Shannon’s inspiration also came from new developments in signal coding techniques, such as pulse-code modulation, which was used by SIGSALY to digitize speech signals for encryption during World War II. The potential for differences in coding techniques to yield improvements in transmission rate raised questions about the fundamental limitations of transmission, creating an environment conducive to such a work.

Conversely, this paper was not the last foundational work of information theory. It did not begin to explore applications to compressed, reliable data storage (which can be seen as sending data over a channel “through time”), nor algorithmic information theory, which examines the computational limitations of information, in particular Kolmogorov complexity. The primitive language models it mentions would be replaced by sophisticated ones based on a variety of statistical techniques such as smoothing, and a huge variety of lossless and lossy codes would be described not only for English text but for quantized signals such as digital images and digital audio, forming the basis of modern standardized formats like ZIP, JPEG, and MP3 files.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Information theory and compression, Networking | 2 Comments »