Papers in Computer Science

Discussion of computer science publications

On understanding data abstraction, revisited

Posted by dcoetzee on December 5, 2009

Citation: William R. Cook. On understanding data abstraction, revisited. ACM SIGPLAN Notices 44, 10, 557-572. October 2009. (PDF)

Abstract: In 1985 Luca Cardelli and Peter Wegner, my advisor, published an ACM Computing Surveys paper called “On understanding types, data abstraction, and polymorphism”. Their work kicked off a flood of research on semantics and type theory for object-oriented programming, which continues to this day. Despite 25 years of research, there is still widespread confusion about the two forms of data abstraction, abstract data types and objects. This essay attempts to explain the differences and also why the differences matter.

Discussion: This October 2009 essay by Assistant Professor William R. Cook of the University of Texas at Austin, an active programming languages researcher, seeks to clarify the fundamental difference between two often-confused forms of data abstraction: abstract data types (ADTs) and objects (as in object-oriented programming).

Both abstract data types and objects have the same basic purpose: they allow the same functionality to be implemented in different ways, and allow the implementation to be modified without affecting client code. For example, when using a “dictionary” module mapping keys to values, the client shouldn’t have to care whether that functionality is internally implemented using a hash table or a red-black tree; and the implementer should be free to modify this kind of implementation detail at any time.

The “objects” provided by mainstream object-oriented programming languages such as Java, C++, and C# are actually a sort of hybrid of abstract data types and true objects, able to effectively simulate both abstractions. To demonstrate the distinction, here we outline an implementation for an “IntQueue” class representing a first-in first-out queue of integers in Java using both methods. First using objects:

interface IIntQueue {
  public boolean isEmpty();
  public void enqueue(int i);
  public int dequeue();
  public IIntQueue append(IIntQueue q);
}

class IntQueue implements IIntQueue {
  java.util.ArrayList list;

  public IntQueue() { list = new java.util.ArrayList(); }
  public boolean isEmpty() { return list.size() == 0; }
  public void enqueue(int i) { list.add(i); }
  public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }
  public IIntQueue append(IIntQueue q) {
    IntQueue result = new IntQueue();
    while (!this.isEmpty()) { result.enqueue(this.dequeue()); }
    while (!q.isEmpty()) { result.enqueue(q.dequeue()); }
    return result;
  }
}

public class Program {
  public static void main(String[] args) {
    IIntQueue q1 = new IntQueue(), q2 = new IntQueue();
    q1.enqueue(1); q1.enqueue(2);
    q2.enqueue(3); q2.enqueue(4);
    IIntQueue q3 = q1.append(q2);
    while (!q3.isEmpty()) {
      System.out.println(q3.dequeue());
    }
  }
}

This Java sample follows two important disciplines:

  1. Concrete class names, such as IntQueue, are only ever used in “new” expressions, not as parameter types, return types, or even local variable types – and this includes in the definition of the IntQueue class itself. Instead, we are constrained to exclusively use interfaces (pure virtual classes in C++) for our static types.
  2. Reference equality (==) is never used.

Although they may seem draconian, adhering to this “pure object” discipline introduces some powerful flexibility. For example, instead of having the “append” method construct the combined queue all at once, we can introduce a new class that allows us to rewrite it as a constant-time operation:

class IntQueue implements IIntQueue {
  // (same as before)
  public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }
}

class LazyAppendIntQueue implements IIntQueue {
  IIntQueue q1, q2;

  public LazyAppendIntQueue(IIntQueue q1, IIntQueue q2) { this.q1 = q1; this.q2 = q2; }
  public boolean isEmpty() { return q1.isEmpty() && q2.isEmpty(); }
  public void enqueue(int i) { q2.enqueue(i); }
  public int dequeue() { if (q1.isEmpty()) return q2.dequeue(); else return q1.dequeue(); }
  public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }
}

This modification requires no changes to the client code; indeed, no client code could possibly detect any change in behavior, as long as it’s constrained to accessing objects through the IIntQueue interface.

An alternate way of optimizing the append() method is to append the underlying arrays, and then make this the underlying array of a new IntQueue. In the strict objects model, however, this is impossible: the append() method can see the representation of this, but can only access its argument through the IIntQueue interface. We might try to fix this by adding a method to the interface to get the underlying array, but that would severely limit the possible implementations of IIntQueue (in particular, LazyAppendIntQueue above would be excluded). The restriction that objects cannot access other objects except through their interface, even other instances of the same type, is called autognosis, and is a fundamental limitation of objects.

Now let’s consider how this same data structure could be implemented in the style of an abstract data type (ADT):

class IntQueue {
  java.util.ArrayList list;

  public IntQueue() { list = new java.util.ArrayList(); }
  public boolean isEmpty() { return list.size() == 0; }
  public void enqueue(int i) { list.add(i); }
  public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }
  public IntQueue append(IntQueue q) {
    IntQueue result = new IntQueue();
    result.list = this.list;
    result.list.addAll(q.list);
    return result;
  }
}

This looks more like typical Java code – each method of IntQueue is free to ravage the internal state of any instance of IntQueue, and interfaces are not used; consequently, different implementations of the same module are not interchangeable at runtime. They are, however, interchangeable at compile-time – if we had two different implementations of “class IntQueue” in two different namespaces, we could decide with a single “import” statement which one we’d like to use. Different parts of the program might choose to use different implementations, but they wouldn’t “mix” – they couldn’t be passed to one another’s append() methods. For these reasons, ADTs are considerably less flexible than true objects, but what they lose in flexibility they make up for in simplicity and the potential for optimization.

Considering how much more ADTs look like normal Java code, you might be asking yourself where objects are really used in practice. In general, they tend to pop up in places where a single, relatively stable interface has many different implementations; common examples include “windows, filesystems, or device drivers.” Filesystems are the most illustrative example: by providing a single fixed API, not only can many different filesystems be accessed through the same uniform filesystem interface, but they can be composed in interesting ways, such as mounting an ISO (a file on one filesystem contains the underlying storage for a different filesystem).

The filesystem example also exposes one of the strongest disadvantages of objects: once many implementations have been written to a stable interface, changing that interface in any way is very difficult. ADT signatures on the other hand are straightforward to extend; for example, if a client of Dictionary needs to iterate over the dictionary in order, you can just add that functionality to the version based on red-black trees, and then compel that client to use that particular implementation.

Following a development process based on refactoring, it’s not uncommon to introduce an abstraction using ADTs and later refine it into a true object when more flexibility is needed. Conversely, the need for optimization using autognosis may require introducing some elements of ADTs. Using type inspection like “instanceof” and reflection, it may be possible to get the “best of both worlds” by optimizing certain common cases, but falling back on the generic interface for unrecognized types. This is sort of a poor man’s version of multiple dispatch.

Neither Cook nor I promote the use of either ADTs or true objects over the other. The most important takeaway from this is: know how to use both ADTs and true objects, recognize which one you’re using, and know their advantages and disadvantages. By avoiding the conceptual conflation that hybrid data abstraction systems tend to induce, you can better predict where to expect issues with extensibility and efficiency in your system.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Programming languages | 2 Comments »

Embedded image coding using zerotrees of wavelet coefficients

Posted by dcoetzee on July 8, 2009

Citation: Shapiro, J.M., “Embedded image coding using zerotrees of wavelet coefficients,” Signal Processing, IEEE Transactions on , vol.41, no.12, pp.3445-3462, Dec 1993. (IEEE) (PDF)

Abstract: The embedded zerotree wavelet algorithm (EZW) is a simple, yet remarkably effective, image compression algorithm, having the property that the bits in the bit stream are generated in order of importance, yielding a fully embedded code. The embedded code represents a sequence of binary decisions that distinguish an image from the “null” image. Using an embedded coding algorithm, an encoder can terminate the encoding at any point thereby allowing a target rate or target distortion metric to be met exactly. Also, given a bit stream, the decoder can cease decoding at any point in the bit stream and still produce exactly the same image that would have been encoded at the bit rate corresponding to the truncated bit stream. In addition to producing a fully embedded bit stream, EZW consistently produces compression results that are competitive with virtually all known compression algorithms on standard test images. Yet this performance is achieved with a technique that requires absolutely no training, no pre-stored tables or codebooks, and requires no prior knowledge of the image source.

The EZW algorithm is based on four key concepts: 1) a discrete wavelet transform or hierarchical subband decomposition, 2) prediction of the absence of significant information across scales by exploiting the self-similarity inherent in images, 3) entropy-coded successive-approximation quantization, and 4) universal lossless data compression which is achieved via adaptive arithmetic coding.

Discussion: This 1993 paper described EZW, one of the earliest successful image compression algorithms using the discrete wavelet transform, and was the first published fully embedded code for images. With such a code, an image file can be truncated at any point and the result will be a valid approximation to the image, enabling a fine-grained tradeoff between file size and quality. It is also, unusually for its time, a compression scheme that requires no training on an existing database of images, instead adapting to the statistics of the image at hand.

If you don’t have a background in signal processing there’s quite a bit of background required for this one, but I’ll summarize the necessary concepts here.

The first concept, adaptive arithmetic coding, is an entropy coding strategy used in a huge variety of both lossless and lossy coding schemes. To simplify discussion, let’s consider the simple scenario where we’re decoding a bilevel (black and white) image file pixel-by-pixel in scan order, and we have access to all preceding pixels in scan order. Suppose the previous pixel is white. How can we use this information to predict the color of the next pixel? One simple way is to keep a count of how many times so far a white pixel has been followed by a black pixel, and how many times it’s been followed by a white pixel; these frequencies are used to estimate the probabilities that the next pixel will be black, or white. This is the first part of the entropy coding scheme, called the model; it is adaptive because it depends on what other pixels have been decoded so far.

The second half is the arithmetic coding. Like all entropy coding schemes, it encodes a sequence of symbols, using less bits to represent a symbol that is expected to occur with high probability, and more bits to represent a symbol that is expected to occur with low probability. The details of how this coding is done are not important here; the important thing is that it is near-optimal, in the sense that if a symbol is expected to appear with probability p, it can be encoded in close to −log2 p bits, even if this value is not an integer, which is the lower bound of Shannon’s source coding theorem. For example: a symbol expected to appear with probability 0.5 can be encoded in −log20.5 = 1 bit, a symbol expected to appear with probability 0.95 can be encoded in −log20.95 or about 0.074 bits, and a symbol expected to occur with probability 0.001 requires about 10 bits. Compared to Huffman coding, arithmetic coding is particularly useful in situations where the number of symbols is small, or where the probabilities are close to 1.

The discrete wavelet transform (DWT) is, like the discrete cosine transform (DCT) used by JPEG files, a way of representing an image in the frequency domain instead of the spatial domain. It is a reversible operation that converts an array of pixels into an equal-sized array of coefficients each representing how much energy the image has in a certain range of frequencies, in a certain part of the image. The reason for doing this is that natural images like photographs tend to have most of their energy at low frequencies due to their smooth, continuously-varying regions, with high-frequency information like edges and fine details being limited to only a small portion of the image. The result is that after undergoing such a transformation, most of the coefficients will be close to zero, and can be truncated to zero with little visual impact on the image. The main difference between the DCT and the DWT is the DCT is applied to a fixed-size spatial region: in JPEG, it’s always applied to an 8×8 block of pixels. The DWT is applied in a multiresolution fashion: it uses a high-pass filter to isolate local changes in brightness (details) in the image between adjacent pixels, then low-pass filters and downscales the image and repeats this process. As a consequence its coefficients vary not only in the frequency range they cover but also in the size of the spatial region they cover, allowing them to represent both very fine details and larger-scale details compactly. Since a range of frequencies is called a subband, this is also called a hierarchical subband decomposition.

Jpeg2000_2-level_wavelet_transform-lichtensteinThe array of DWT coefficients can be visualized as a new image, which at low bit rates will be mostly black (close to zero), but with white in some places, representing the significant coefficients. See the example to the right (Alessio Damato / CC-BY-SA). The main challenge in representing such an image is not so much representing the values of the significant coefficients,  but their position, using a data structure called a significance map.

This is where the zerotrees come in. Notice how the three images in the upper-left closely resemble the three large images. In particular, where the smaller images are black, the larger images also tend to be black. Zerotrees take advantage of this by encoding the smaller images first, and assigning one of four values to each coefficient: positive significant coefficient, negative significant coefficient, zerotree root, and isolated zero. A zerotree root is an insignificant (near zero) coefficient for which the corresponding coefficients in all larger images are also insignificant. An isolated zero is an insignificant coefficient that does not have this property. Once a zerotree root has been encoded, none of its descendants need to be encoded, since they’re already known to be insignificant. Additionally, since isolated zeros are rare, this symbol is assigned a small probability and the use of arithmetic coding ensures that not too much space is wasted accounting for this possibility. The parameter that determines how much space our zerotree occupies is the threshold between significant and insignificant coefficients.

This is already a great way of compressing images, but how do we turn this into a fully-embedded code that can be truncated at any point? The key to this lies in a successive-approximation scheme. In this scheme, initially all coefficients are assumed to be insignificant, so the initial zerotree is trivial (one zerotree symbol). We then lower the threshold (divide it by 2) and compute and output a new zerotree, and repeat this indefinitely. Any significant coefficients already detected by previous zerotrees are assumed to be zero, so that they don’t get encoded more than once. Simultaneously, each time we produce a new zerotree we also add one bit of precision to the magnitudes of all the already-known significant coefficients. We don’t have to finish writing out any particular zerotree, since each one prioritizes writing out coarse details (smaller coefficient images) before finer details (larger coefficient images); any coefficients left unencoded are assumed to be insignificant.

What’s happened to EZW today? Its raw compression performance and speed have long been since been exceeded by other systems, such as SPIHT (Said, Amir and Pearlman, William A. A new fast and efficient image codec based upon set partitioning in hierarchical trees. June 1996. PDF),which benefits from a more complex partitioning scheme then the simple splitting-into-four shown here, and other improvements. But its ideas remain the foundation of all the best modern image compression systems. These ideas were incorporated into the JPEG 2000 image format, which uses a very similar wavelet-based fully-embedded code. This eliminates JPEG’s blocking artifacts due to its multiresolution nature, permits a fine-grained quality-for-size tradeoff, gives significantly better quality at low bit rates, and allows lossy and lossless encoding in a single framework, among other advantages. JPEG 2000 hasn’t seen wide adoption in consumer applications – one can only speculate as to why – but these techniques remain a valuable way to look at image compression.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Information theory and compression, Signal processing | Leave a Comment »

The Soft Heap: An Approximate Priority Queue with Optimal Error Rate

Posted by dcoetzee on July 6, 2009

Citation: Chazelle, B. The soft heap: an approximate priority queue with optimal error rate. Journal of the ACM 47, 6 (Nov. 2000), 1012-1027. (PDF)

Abstract: A simple variant of a priority queue, called a soft heap, is introduced. The data structure supports the usual operations: insert, delete, meld, and findmin. Its novelty is to beat the logarithmic bound on the complexity of a heap in a comparison-based model. To break this information-theoretic barrier, the entropy of the data structure is reduced by artificially raising the values of certain keys. Given any mixed sequence of n operations, a soft heap with error rate ε (for any 0 < ε ≤ 1/2) ensures that, at any time, at most εn of its items have their keys raised. The amortized complexity of each operation is constant, except for insert, which takes O(log 1/ε) time. The soft heap is optimal for any value of ε in a comparison-based model. The data structure is purely pointer-based. No arrays are used and no numeric assumptions are made on the keys. The main idea behind the soft heap is to move items across the data structure not individually, as is customary, but in groups, in a data-structuring equivalent of “car pooling.” Keys must be raised as a result, in order to preserve the heap ordering of the data structure. The soft heap can be used to compute exact or approximate medians and percentiles optimally. It is also useful for approximate sorting and for computing minimum spanning trees of general graphs.

Discussion: This 2000 paper by Bernard Chazelle introduced a data structure called a soft heap, which is a heap data structure that can perform any operation in amortized constant time. In exchange for this ideal performance bound, a new limitation is imposed: at any time, any element in the heap may have its key increased at the discretion of the data structure; such an element is said to be corrupted. There is an upper limit of εn corrupted elements, where n is the number of operations completed so far and ε is a fixed parameter. For example, if ε = 0.01, and we insert 1000 elements into the heap, at most 10 elements will be corrupted.

Heaps are one of the fundamental data structures studied in computer science. Like dictionaries or hash tables, they store a set of elements, each with an associated key. Heaps support four fundamental operations:

  • insert: Insert a new element, given the element and its key.
  • extract minimum: Get the element with minimum key and delete it from the heap.
  • decrease key: Update an existing element’s key to a smaller value.
  • merge or meld: Combine two existing heaps together into one large heap.

The most obvious application of these operations is a priority queue: if the keys indicate priority (with smaller values indicating higher priority), then insert adds a new task to the queue, and extract minimum retrieves the current highest-priority task. Heaps also form the basis of the heapsort sorting algorithm, which in its simplest form inserts each element into a heap one-by-one and then uses extract minimum repeatedly to get the result list in order.

The sorting application demonstrates an important lower bound regarding heaps: it is impossible to design a heap data structure that can do both the insert and extract minimum operations in amortized constant time, for if this were possible we would be able to do a comparison sort in linear (O(n)) time; but any such sort requires Ω(n log n) time. Indeed, the best known heap data structure, the Fibonacci heap, developed in 1984, can do all these operations in amortized constant time except for extract minimum, which requires Θ(log n) amortized time.

But what if there were some way to modify the heap data structure that let us break through this lower bound and do all operations in amortized constant time without contradicting the above result? This is exactly what the soft heap does. A soft heap is like a normal heap, but with the troubling property that the key you assign to an element when you insert it may not stay that way; more formally, the data structure may choose at its discretion to increase the key of any element at any time. Any element that has had its key increased is called corrupted. It’s easy to see how this might be helpful in reducing complexity; if the data structure just decided to increase the keys of all elements to infinity, it would be trivial to do all operations in constant time.

But to make soft heaps actually useful and not just fast, we need an additional constraint. This constraint is as follows: each soft heap has a fixed parameter ε called the error rate. Roughly speaking, this controls what proportion of the elements may be corrupted at any given time. More formally, it specifies that if we start with the empty heap and do n operations, the resulting heap will have at most εn corrupted elements. If all n operations are inserts, then indeed ε will be an upper bound on the proportion of corrupted elements. If extract minimum operations are included, the relationship is no longer so clear – it’s possible all elements may become corrupt.

As Chazelle so elegantly says, “despite this apparent weakness, the soft heap is optimal and—perhaps even more surprising—useful.” Perhaps the simplest nontrivial application of soft heaps is finding the median of a list of values in linear time. There is a classical algorithm for doing this, but the implementation and analysis of the algorithm based on the soft heap is much simpler. All we have to do is set ε = 1/3, then insert all n elements into the heap, and then do extract minimum n/3 times. Depending on exactly which elements got corrupted, the final element extracted, the pivot, will be somewhere between the (n/3)th and (2n/3)th smallest element in the list. We then partition the list into the sublist of elements less than or equal to the pivot, and those greater than the pivot, and recursively invoke this procedure on the one containing the median. This is similar to the quicksort algorithm, but only recursively processing one of the sublists instead of both, and it runs in worst-case linear time.

Historically, soft heaps were invented with a specific application in mind: computing minimum spanning trees. Chazelle has achieved some fame for discovering the asymptotically best known algorithms for several problems, and this is one of them: his soft-heap based algorithm can compute minimum spanning trees in O(n α(n)) time, where α is the inverse Ackermann function, a very slowly-growing function that is less than 5 for all remotely practical values of n. I don’t discuss the details of this algorithm here because they’re pretty complex, but you can see Chazelle’s paper (Bernard Chazelle. A Minimum Spanning Tree Algorithm with Inverse-Ackermann Type Complexity. JACM 47(6):1028–1047, 2000, PDF). In 2002, Pettie and Ramachandran presented a minimum spanning tree algorithm based on soft heaps that is known to be asymptotically optimal.

So how does it work? In Chazelle’s original implementation, the soft heap was a simple variant on the binomial heap. Instead of storing a single element at each node, it would store a list of elements, associated with a single key that is an upper bound on the original keys of all the elements. This is the source of the “corruption” – any values in this list whose original keys are smaller than the common key are corrupted. Such “multivalued” nodes could only appear near the top levels of the trees making up the heap, which limits how many corrupted elements there can be. These lists are formed during the soft heap’s special “sift-up” routine, which under certain conditions will run twice on the same node, causing it to merge (concatenate) element lists with one of its children. There are a number of subtleties to this process, however, and to explicate them Chazelle gives complete C source code for the data structure in a literate programming style.

Recently a simpler implementation of soft heaps by Kaplan and Zwick has appeared (A simpler implementation and analysis of Chazelle’s soft heaps. In Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms (Jan 4-6, 2009).) They use the original soft heaps concept of storing a list of items at each node, and of occasionally running sift on elements twice during sifting, but rather than being based on binomial heaps, this implementation is based on a set of heap-ordered binary trees. Also, their condition for running sift twice on a node is somewhat less arbitrary: each node x has an associated “size” value that increases exponentially with rank, and the size |list(x)| of the element list at any internal node x must always satisfy size(x) ≤ |list(x)| ≤ 3size(x).

How do you know if soft heaps might be useful for something you’re doing? Due to their peculiar behavior, they tend to be most useful for inventing new algorithms with better worst-case performance than existing algorithms, particularly in situations where the “approximate rank” – or approximate position in sorted order – of an element in a list is useful. It tends to be applied in places where ordinary heap data structures have traditionally been applied in the past, like graph algorithms and sorting. Beyond this though, there may still remain a rich set of untapped applications of soft heaps to be found that employ them in novel ways.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Algorithms and optimization | Leave a Comment »

Pattern Classification

Posted by dcoetzee on June 17, 2009

Citation: R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd edition. New York: John Wiley & Sons, 654 pages, 2001, ISBN: 0-471-05669-3.

Discussion: This classic machine learning textbook discusses a variety of well-studied application-independent methods of classifying inputs into one of several distinct classes, as well as some of the theory behind them. The book’s organization is constructed around the amount of information available to the classifier, beginning with scenarios where the actual distribution of features of classes is known, and finishing with scenarios where even the classes themselves must be inferred. Among the many techniques considered are decision trees, neural networks, genetic algorithms, Boltzmann learning, nearest neighbor classification, k-means clustering, principal and independent component analysis, formal grammars, linear discriminants, resampling methods like bagging and boosting, and maximum-likelihood estimation; most of these are treated in great detail.

Perhaps the best demonstration of the level of detail that the book goes into is its discussion of neural networks in chapter 6. Whereas any general book on artificial intelligence is bound to include basic discussion of neural networks and their dominant learning algorithm, backpropagation, the discussion here is both denser and covers a broad range of extensions and relations to other methods. First of all, it contextualizes neural networks and places them in their historical context by preceding the chapter with a thorough chapter on linear discriminants, a powerful tool in their own right that are basically equivalent to neural networks with no hidden layers. Unlike more generalized references which tend to merely describe the most common implementation of neural networks without any justification, every design choice going into the neural network is derived and justified and alternative choices are discussed, from the update rules to the activation function to the number of layers and hidden units to the initial weights to the training protocol to the stopping condition. Perhaps most important from an engineer’s point of view is section 6.8, which discusses a variety of extensions needed to make neural networks efficient and accurate in practice, such as input standardization, momentum, weight decay, hints, use of second-order information, and so on. Each of these occupies no more than a page, but still effectively conveys the essence of the technique. Besides these essentials, the chapter also informally proves the power of three-level networks, and incorporates a variety of interesting visualizations of network structure, network weights, error surfaces, convergence, and so on, which helps a lot in developing an intuition for the behavior of neural networks. In short, this chapter contains about as much information on neural networks as I would normally expect to get from an entire book on the subject.

The book isn’t just a bunch of survey papers stapled together though; it has a number of things that tie it together. Most important is its organization of progressively less well-specified problems. Chapter 2 begins with Bayesian decision theory and describes how to perform classification optimally, given the complete probability distribution of each feature for each class. This also brings in the first major theoretical result, the lower bound on classification error given by the Bayes error. From there it progresses to maximum-likelihood estimation, in which the distribution type (e.g. Gaussian) is known but not its parameters (such as mean and standard deviation), and these parameters must be determined from manually-classified example data. After this comes simple nonparametric techniques like nearest neighbor estimation and linear discriminants that make no assumptions about the distribution but only work well when the classification function is simple and the dimension isn’t too high. Neural networks can describe more complex functions, but still get tripped up by complex classification functions with many local minima and maxima. The next chapter addresses stochastic methods like simulated annealing and genetic algorithms that cope well with complex classification functions, at the expense of much greater training time. Skipping over a couple chapters, the final one deals with unsupervised classification, where the example data have not been assigned to classes and the classes are not even known and must be inferred; a typical technique in this scenario is clustering. This organization is effective at emphasizing the point that a classifier needs to take advantage of as much information as the application domain makes available to be effective.

Chapters 8 and 9 don’t fall well into the natural progression. Most of the chapters deal with numeric features, like width or age, leaving discrete features such as color or gender, and the classifers that process them such as decision trees and formal grammars, to chapter 8. Chapter 9, entitled “algorithm-independent machine learning”, is based around theory and metalearning techniques that can be used to enhance all classifiers. It begins with the fundamental No Free Lunch theorem, which essentially states that no classifier performs better in all cases than any other classifier, including the naive classifer that merely guesses at random. It also discusses resampling methods that aim to improve unstable classifiers or combine different types of classifiers, and describes how to compare one classifier with another, which is essential when espousing the benefits of any new classifier.

The progressive organization, while useful, also necessarily implies that some essential basic topics, such as decision trees (chapter 8), resampling methods (chapter 9), and k-means clustering (chapter 10), are not introduced until long after more specialized topics have been visited, such as Bayesian belief networks (chapter 2) and hidden Markov models (chapter 3). As a consequence it probably makes more sense to just read the first few sections of each chapter first, rather than reading the book straight through.

Another frustration with this book is that there are a number of forward references: in particular, cross-validation is mentioned several times before it’s finally explained in Chapter 9, and probabilistic neural networks were discussed in chapter 4 in the context of Parzen windows, with neural networks discussed in chapter 6. They also assume quite a bit of math background in areas that aren’t covered in-depth in most undergraduate programs, like matrix calculus and statistics, with only a terse appendix for a refresher on this material. It would have been helpful to demonstrate of these techniques in the one-dimensional case, where the familiar intuitions of real number calculus apply.

Well dense and thorough, there are necessarily some things the book doesn’t cover, in order to fit into 600 pages. Some theorems are not proved in the text, particularly the No Free Lunch Theorem, but also many of the convergence results and probability bounds. This is generally not a big deal since the book is intended more for those looking to apply machine learning to an application domain, rather than those looking to study theoretical machine learning and conceive new techniques – and its references are thorough enough that this information can be located if needed. On the other hand, it also doesn’t contain any real code or references to real machine learning libraries – it’s not a programmer’s handbook. Finally, the second edition in particular concentrates heavily on application-independent models, and does not delve into the intricacies of the immense number of applications of machine learning; problems like determining the best features to use and how to compute them are discussed but motivated only by toy examples and not real-world problems like speech recognition, image processing, or expert systems. It helps when reading it to have an application domain in mind to mentally plug into the models.

In short, if you’re looking for a dense and application-independent survey of major classification algorithms in machine learning, this is the book to get you started, and will take you a few steps beyond anything you might have studied in an AI intro class. But I would precede it with some reading on matrix calculus, and follow it with some reading on a specific application area of your choice, to help make the concepts more concrete.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | 2 Comments »

Learning from hints in neural networks

Posted by dcoetzee on May 24, 2009

Citation: Abu-Mostafa, Y. S. 1990. Learning from hints in neural networks. Journal of Complexity, 6, 2 (Jun. 1990), 192-198. (PDF)

Abstract: Learning from examples is the process of taking input-output examples of an unknown function f and infering an implementation of f. Learning from hints allows for general information about f to be used instead of just input-output examples. We introduce a method for incorporating any invariance hint about f in any descent method for learning from examples. We also show that learning in a neural network remains NP-complete with a certain, biologically plausible, hint about the network. We discuss the information value and the complexity value of hints.

Discussion: This short 1990 machine learning paper introduced a technique for learning with hints in a neural network; that is, it allows a neural network to learn a function for which we have some limited information about its properties.

In machine learning, neural networks are a simple way of representing functions that is sufficiently powerful to approximate any function. They consist of a set of at least three layers of processing nodes connected by edges labelled with weights. All the inputs into each node are multiplied by the weights on their edges, then a sigmoid function (a particular strictly increasing function bounded between -1 and 1) is applied to produce the output. By adjusting the weights, we can gradually modify the function being computed.

Neural networks are particularly useful in classification problems; for example, one might construct a neural network that takes as input an image and outputs 1 if it looks like a hamburger, or else -1. It would be really difficult to manually write a program that does this. Instead, neural networks can be trained with a set of inputs and their associated outputs; the weights are adjusted based on the examples using the backpropagation algorithm until it computes a function close to the actual one.

For our purposes, the most important thing to note is that the backpropagation works by feeding the training input to the network and determining how far the output is from the desired training output, called the error. It then adjusts the weights in a way that decreases that error. It does this repeatedly until it reaches a stopping point.

Backpropagation is efficient and general but can run into two important related problems:

  • Insufficient data: There may not be enough training data to learn weights that generalize well to new inputs.
  • Overfitting: The resulting function may end up being oversensitive to parameters of the training examples that are actually irrelevant.

For example, say you train a hamburger recognizer on many images off the Internet, and then I give it a picture of an upside-down hamburger. Because it’s never seen an upside-down hamburger before, it’s quite likely to claim that it’s not a hamburger, despite the fact that intuitively we know that orientation does not affect an image’s hamburgerness. Likewise it may fail to recognize an image that is smaller or larger than those in its training set, or where lighting is unusual. These kinds of restrictions on the function representation are called invariants. Invariants cannot be directly expressed as input-output examples; they are larger restrictions on the scope of functions under consideration.

The most obvious way to deal with invariants is to expand your training set – turn all your training images upside-down, and add them to your training set. But when we begin to consider more and more combinations of invariants, this approach can rapidly grow infeasible. Not only does the training set become large, but if there are not enough inputs in the original training set to teach the invariant, then it will not be properly learned.

The key observation of Abu-Mostafa’s work is that we can don’t need to rely entirely on training examples. Whereas training examples specify the constraint that f (x) = y, where f is the function we’re learning and x and y are the input/output, for invariants it’s more useful to deal with equality examples, which are pairs where f (x1) = f (x2). The backpropagation algorithm can be easily modified to accomodate this: instead of computing the error based on the distance between the actual output and desired output, we compute it as the distance between the two outputs produced by the two examples. There’s no requirement to know what the value of f (x1) or f (x2) is. Using this advantage in our example, we can take any image, even if we don’t know whether or not it looks like a hamburger, rotate it, and use the two images to train our network. We could even generate random images and rotate these.

Another way of framing this is that we want to encourage the network to learn new features that describe the input but in some way summarize or interpret the original input features. These new features can then be leveraged by the network to guide the final output. To take an example from the book Pattern Classification (Richard Duda, Peter Hart, David Stork, section 6.8.12): if the input is a soundwave, and we want to determine what speech phoneme it represents,  a useful intermediate feature would be deciding whether it’s a vowel or a consonant. To encourage the network to learn this feature, we can add a new output node that is set to 1 or -1 depending on whether the input soundwave is a vowel or consonant. By incorporating an initial training phase that modifies the network weights to predict this output well, we now have a starting network that can already distinguish vowels and consonants, which is a big help in making finer subclassification. Without the hints, the network may have learned this on its own, but the more domain-specific information we can give it, the quicker it will train and the better it will generalize. In the case of our original example, these intermediate features would be properties of the original image that are insensitive to the invariants like rotation.

For learning with hints, neural networks with more than three layers of nodes are often helpful as well. The idea is that the first layer can convert the input features to the new intermediate features of interest, and then normal learning can be applied to these. Where the invariants are very simple, they can even be applied to the inputs before training (and prediction), placing them in canonical form. For example, to help deal with brightness variation in images, it helps to scale all images to the same brightness range.

My apologies for the long delay in this post – I’m currently engaged in reading the book Pattern Classification, and intend to follow up here with a discussion of it when I’m done.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | 2 Comments »

Bagging Predictors

Posted by dcoetzee on May 7, 2009

Citation: Leo Breiman. Bagging predictors. Machine Learning, 24, 2 (Aug. 1996), 123-140.

Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

Discussion: This 1996 machine learning paper described a very simple mechanism called bagging (short for “bootstrap aggregating“) for improving the accuracy of predictors by averaging the results of a number of independent predictors trained over samples from the same training set.

Suppose we’re creating a decision tree to predict, based on a number of medical tests, whether or not a patient has cancer. A typical way to do this is to gather some cases from a medical database, then use them to train a predictor model, such as a decision tree or neural network.

The issue with this straightforward approach is that many predictors used in machine learning, including decision trees and neural networks, are unstable: training them on a slightly different data set can produce significantly different results. For example, if we split our training set in half, and use each half to build a decision tree, those two decision trees are likely to be quite different and yield different predictions. This is a result of overfitting, the phenomenon where a model fits itself too well to the training data, and not well enough to the complete population of past and future cases underlying the sample.

To overcome this, we introduce ideas from a statistical technique called bootstrapping. To motivate this concept, suppose you have a random sample of 1000 test scores from a pool of 1 million, and you want to estimate what the median test score is. Because we can’t know this exactly, we instead seek a confidence interval that has a 98% chance of containing the median. Finding a confidence interval for the mean test score is easy using standard techniques, because the sample mean and population mean are analytically closely related, but the same is not true of the median.

When analysis fails, we take a more computational approach. The idea is to use the sample itself as an approximate model of the underlying distribution: we can replicate each score 1000 times to obtain a pool of 1 million scores that will statistically resemble the real 1 million scores. We then take (say) 200 independent samples of size 1000 from this pool, and compute their medians. Finally, we take the 1% and the 99% percentiles of these median scores to get our confidence interval. In practice, rather than duplicating scores many times, it’s more typical to simply sample with replacement, so that each score can be chosen multiple times. Provided the original sample is a good empirical approximation of the complete population, the result is a very good prediction of the actual median.

What does this have to with machine learning? Well, suppose instead of test scores we have medical test results, and instead of the median, we’re estimating whether they have cancer. We draw (with replacement) bootstrapping samples from the original data, and for each one we use it as training data to construct a predictor. Then, for new cases we take a majority vote among the predictions to determine the overall prediction. Provided that the data is a good empirical approximation of the complete population of past and future cases, the result will be a good prediction that resists overfitting.

This technique is extremely general, and can be applied to virtually any type of classification problem and any model. It can even use completely different models for different bootstrap samples, which can make it easy to sidestep difficult questions about choosing good models and model parameters. It has three main disadvantages:

  • Performance: For each of the bootstrap samples, we have to construct a separate model from scratch, and each query must bear the overhead of querying each of these models and combining their results. Fortunately, bagging is easily parallelizable – because of this, and because it can extract good accuracy from fast-to-query models like decision trees, it can actually be more efficient in some cases.
  • Relies on instability: If the model being used is stable, or in other words does not change significantly between bootstrap samples, the resulting system can actually be less accurate than a model built directly on the original data. An example of a stable model is a nearest-neighbor classifier. To vividly illustrate this point, the paper describes an experiment with a linear regression model where the stability of the model depends on the number of variables used. Sure enough, there is a crossover point in accuracy between the bagged and the unbagged versions of this predictor.
  • Interpretability: One of the widely cited advantages of models like the decision tree is that it’s easy for a human to understand how it makes it decisions – like a flowchart, it just performs a series of tests. But when you have twenty decision trees, each making different decisions on the same inputs, it’s harder to get an intuitive feel for their aggregate decision making – it’s like predicting the decision that a committee will make by learning about its members.

Bagging is just one of a family of resampling ensemble methods; another popular one is called boosting, a technique that iteratively combines multiple predictors by encouraging each new model to focus on cases misclassified by previous models. Boosting can learn more quickly than bagging because of its focus on misclassified cases, but is also less robust to errors in training data. A 2005 paper by Kotsiantis and Pintelas described a scheme for combining the two (“Combining Bagging and Boosting”, International Journal of Computational Intelligence, PDF).

Bagging has already played a strong role in the design of machine learning systems. Now that we’ve hit the clock speed barrier and are seeing more highly parallel and multicore systems, it’s bound to become an even more attractive option. Keep it in mind if you ever need to design a simple but accurate machine learning system.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | Tagged: , | 1 Comment »

Random Oracles are Practical: A Paradigm for Designing Efficient Protocols

Posted by dcoetzee on May 3, 2009

Citation: Bellare, M. and Rogaway, P. Random oracles are practical: a paradigm for designing efficient protocols. In Proceedings of the 1st ACM Conference on Computer and Communications Security (Fairfax, Virginia, United States, November 3-5, 1993). CCS 1993. ACM, New York, NY, 62-73. (PDF)

Abstract: We argue that the random oracle model — where all parties have access to a public random oracle — provides a bridge between cryptographic theory and cryptographic practice. In the paradigm we suggest, a practical protocol P is produced by first devising and proving correct a protocol PR for the random oracle model, and then replacing oracle accesses by the computation of an “appropriately chosen” function h. This paradigm yields protocols much more efficient than standard ones while retaining many of the advantages of provable security. We illustrate these gains for problems including encryption, signatures, and zero-knowledge proofs.

Discussion: This influential 1993 paper introduced the random oracle model, designed to enable a methodology for the design of efficient formally sound cryptographic protocols such as encryption and authentication, and is one of the milestones in the philosophy of cryptographic protocol design.

The holy grail of cryptography is cryptographic protocols – algorithms for tasks like encryption and signing – that are provably secure against attack. However, in the study of algorithms, negative results such as “no efficient algorithm exists to break this scheme” are notoriously difficult to prove; this is one reason why major problems like whether P=NP defy solution.

Like complexity theorists, cryptographers turned to conditional results: they assume that some cryptographic primitive is available – a very simple building block such as one-way functions, or one-way trapdoor functions – and then build more complex primitives, including complete protocols, using them. In “The Random Oracle Metholodgy, Revisited,” discussed later, Ran Canetti defended this practice:

One of the great contributions of complexity-based modern cryptography, developed in the past quarter of a century, is the ability to base the security of many varied protocols on a small number of well-defined and well-studied complexity assumptions. Furthermore, typically the proof of security of a protocol provides us with a method for transforming [an] adversary that breaks the security of said protocol into an adversary that refutes one of the well-studied assumptions. The Random Oracle Methodology does away with these advantages.

One may draw an analogy with the P=NP problem: why do we suppose that NP-hard problems are difficult to solve? Because if we could solve any one of them in polynomial time, it would resolve the longstanding P=NP problem and provide fast algorithms for a variety of well-studied problems. Likewise, if we could show that a protocol relying on a particular cryptographic primitive is insecure, it would solve a longstanding open problem (e.g. do one way functions exist?) and simultaneously break many other protocols relying on that primitive.

However, unlike the case of P=NP, systems proven conditionally secure by reduction to cryptographic primitives are rarely implemented. Indeed, until the publication of “Random Oracles are Practical,” most cryptographic systems used in practice had no formal justification whatsoever. This is because conditionally secure formal systems are much too inefficient for practical use. Yet there were ad hoc efficient cryptographic protocols in use in practice that clearly were good at withstanding concerted attack – without any formal backing, what was it that gave them their strength? Bellare and Rogaway write:

Theorists view certain primitives (e.g. one-way functions) as “basic” and build more powerful primitivees (e.g. pseudorandom functions) out of them in inefficient ways; but in practice, powerful primitives are readily available and the so-called basic ones seem to be no easier to implement. In fact theorists deny themselves the capabilities of practical primitives which satisfy not only the strongest kinds of assumptions they like to make, but even have strengths which have not been defined or formalized.

To facilitate the design of more efficient protocols,  the authors advance the following methodology: suppose that all parties in a cryptographic protocol, including the attacker, have access to a random oracle. A random oracle is like an ideal pseudorandom number generator: you seed it with a particular initial value, and then it gives you an arbitrarily long sequence of random bits. The sequence depends on the seed, but has no other predictable patterns (and in particular, does not cycle).

In real life, random oracles cannot be implemented; deterministic programs cannot produce arbitrarily-long random outputs. But this paper asserts the following thesis: if a protocol can be proven secure in the presence of a random oracle, and we then replace calls to the random oracle with calls to a good cryptographically secure pseudorandom number generator, a process called instantiation, the resulting protocol is expected to be secure in practice. Moreover, schemes designed using this method are expected to be more efficient than conditionally secure schemes.

It’s not hard to see that this methodology is not perfectly sound in theory. For example, one can design a cryptographic protocol that passes zero to the oracle, and if the result is f(0), it reveals its secret key. In the random oracle model, this is very unlikely to occur, so the system remains secure; but if the system is instantiated with f, then it always occurs, and the system is completely insecure. Part of the thesis is that “an appropriate instantiation for a random oracle ought to work for any protocol which did not intentionally frustrate our method by anticipating the exact mechanism which would instantiate its oracles” (section 6, Instantiation).

To provide some practical justification for its thesis, “Random Oracles are Practical” presents a very simple, efficient encryption algorithm for long messages that is secure against a variety of attacks in the random oracle model. We begin by assuming (as in the conditionally secure model) that we have access to two primitives:

  • a trapdoor permutation f, which is a function that, like RSA, can encrypt or decrypt a short k-bit value;
  • a random hash function H that inputs arbitrarily long strings and outputs k-bit results.

Now suppose we want to encrypt the message x. We choose a k-bit random value s which will be used as the seed for the random oracle. We XOR the output of the oracle with the message x to generate the encrypted message. Finally, we append the encryption f(s) of the seed with the trapdoor permutation, so that the receiver can determine s, and the hash H(sx) of the seed together with the message to protect against tampering. This scheme is secure against three types of attacks in the random oracle model:

  • In a chosen-plaintext attack, the polynomial-time adversary supplies two strings, one of them is encrypted, and the adversary must determine by examining the ciphertext which one was encrypted. They don’t have to do this perfectly, just significantly more than 50% of the time.
  • In a chosen-ciphertext attack, the adversary is additionally given access to the decryption function, and they can ask it to decrypt any string except the encrypted string they’re given.
  • Finally, a malleability adversary is one that, given the encryption of one string, can determine the encryption of a related string, such as an extension of the original string or one where some characters are replaced by other characters.

Sure enough, when the random oracle is replaced by a cryptographically secure pseudorandom number generator, the result is a system that is secure in practice, in the sense that no practical attack against the system has ever been found. The authors go on to describe similar efficient systems for signing and non-interactive zero-knowledge proofs that are sound in the random oracle model.

Over the years, many researchers took the advice of this paper and designed new efficient protocols based on the random oracle model. Encouragingly, these systems withstood the test of the time and have not been broken. This seems to say that the methodology is sound. But in 2002, a strong result by Canetti, Goldreich, and Halevi cast doubt upon the thesis (“The Random Oracle Methodology, Revisited.” J. ACM 51, 4 (Jul. 2004), 557-594. PDF). Their main result is as follows:

There exist signature and encryption schemes that are secure in the Random Oracle Model, but for which any implementation of the random oracle results in insecure schemes. [...] Moreover, each of these schemes has a “generic adversary”, that when given as input the description of an implementation of the oracle, breaks the scheme that uses this implementation.

In other words, not only can these encryption schemes be broken for any possible instantiation, but they supply a constructive method to do so fully automatically.

The proof method is similar to the example cited earlier of an encryption protocol revealing secret information if the oracle returns f(0) when given zero for the seed, where f is the cryptographically secure pseudorandom number generator used to instantiate the protocol. The difference is that instead of modifying the protocol to always fail for one particular f, it causes it to fail on a different f for each input. The input that causes failure under instantiation with f is simply the description of f (i.e., the machine code for f). Because the attacker has access to this information, they know just how to break the system. Fundamentally, this attack relies on a sort of “white box” reverse engineering – not only can the attacker execute the function f, but they can examine its code as well.

This does not say that the practical schemes designed to be secure in the random oracle model are insecure — to the contrary, they have stood the test of time quite robustly — but it does call the theoretical advantages of the paradigm into doubt. The authors of “The Random Oracle Methodology, Revisited” disagree sharply about the philosophical implications of this result for protocol design. Ran concludes that “the Random Oracle model is a bad abstraction of protocols for the purpose of analyzing security,” and strongly prefers the older approach of reductions to hard problems. Oded sees the random oracle model as an effective “sanity check” for rapidly ruling out insecure designs. Shai speculates on two possible reasons why existing protocols based on the random oracle model have defied attack:

  1. The systems are more difficult to break because a proof of security in the random oracle model rules out a large class of attacks – those that work in the random oracle model. Methods of attack that overcome this will take longer to develop.
  2. Existing protocols have some as-yet-unidentified feature that makes them more resilient than the contrived protocols described in this work.

Shai’s view of the utility of the model is more optimistic: he sees it as an “engineering tool” that is better than having no formal proof of security at all, and makes finding attacks more difficult.

What’s the future of the random oracle model? Perhaps we will identify some class of protocols for which instantiation is a sound procedure, or at least for which the generic attacks in this paper are inapplicable. Or perhaps we will adopt a new model that provides stronger guarantees. For now, formalizing and justifying the design of cryptographic protocols remains a difficult problem and a contentious philosophical debate.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Cryptography | Leave a Comment »

Efficient Online Index Construction for Text Databases

Posted by dcoetzee on April 27, 2009

Citation: Lester, N., Moffat, A., and Zobel, J. Efficient online index construction for text databases. ACM Trans. Database Syst. 33, 3 (Aug. 2008), 1-33. (PDF)

Abstract: Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches, in which one or more passes are made over a static set of input data, and, at the completion of the process, an index is available for querying. However, there are search environments in which even a small delay in timeliness cannot be tolerated, and the index must always be queryable and up to date. Here we describe and analyze a geometric partitioning mechanism for online index construction that provides a range of tradeoffs between costs, and can be adapted to different balances of insertion and querying operations. Detailed experimental results are provided that show the extent of these tradeoffs, and that these new methods can yield substantial savings in online indexing costs.

Discussion: This 2008 information retrieval paper discusses how to efficiently incrementally update a search index as new documents are added. This allows new documents to be indexed and available for search immediately. Total index construction time is an order of magnitude faster than older, more trivial incremental update mechanisms, at a moderate expense in query time, about 12% on average.

As a motivating application, consider desktop search: in this scenario, you have a set of documents on your hard disk, and you would like to rapidly search them for keywords at any time. To facilitate this, your operating system keeps an on-disk data structure called an inverted index that specifies, for each possible search keyword (or term), a list of all files that contain it. Using modern index compression techniques, such an index can be stored in about 25% the space of the documents that it is indexing.

Constructing an index for a fixed set of documents is a relatively simple procedure; the following procedure comes from Heinz and Zobel’s “Efficient single-pass index construction for text databases.” With each document we associate a document ID number. We keep an in-memory data structure associating each term with a sorted list of document IDs of documents containing that keyword. We go through documents and expand this data structure until we fill the indexer’s buffer, and then write it out to disk and start again with an empty data structure. At the very end, we go through the complete list of search terms and for each one we perform a disk-based multiway merge of all sorted lists associated with that term.

The organization on disk of a typical index is also simple: it’s typically stored in two parts, the lexicon (or vocabulary) and the lists. The lexicon simply maps each term to the location on disk of its associated list; this can be done for example with a standard data structure like a B-tree. The lexicon is usually the smaller part. The lists part stores the compressed lists of documents associated with each term, all packed tightly together.

This procedure is efficient, but isn’t very flexible: it only works for a set of fixed, unchanging documents. If we add, remove, or modify documents, we have to rebuild the index all over again. In current practice, the way this is usually dealt with is by doing search queries against the old index while the new index is being built. Once the new index is done, we switch over to it all at once. The problem with this is that it introduces a delay period during which queries will not detect recent changes. In a desktop search scenario, it can be very annoying if the search doesn’t turn up that document you were editing five minutes ago, or that e-mail you just got, because it hasn’t been indexed yet.

This is where online index construction comes in. Online index construction attempts to update an existing index to incorporate recent changes, so that the index is up-to-date at all times. All these schemes share a general framework: we keep an in-memory data structure of the most recent changes, and during each query we check both this data structure and the on-disk index. Once the in-memory index gets too big, it has to get merged into the on-disk index. Updating the lexicon is relatively simple if it is stored in a B-tree: we just add, update, or remove keys from the B-tree as necessary.

More problematic is updating the lists. There are a number of simple techniques for doing this, none of which are very efficient in general:

  • In the in-place update method, we find the list for each term being merged and insert the new document IDs into that list. If the list becomes too large and no longer fits in its place in the lists file, it has to get moved to a new location. Updating the index is slow because it requires random access; and the resulting index becomes fragmented, slowing down future queries and updates. Even if we do a good job of overprovisioning by leaving space for each list to grow into, it does not overcome the detrimental effects of this method.
  • Related to the in-place method is the method of Tomasic et al (“Incremental updates of inverted lists for text document retrieval“, 1994), which instead of moving lists when they become too large, simply allows lists to occupy multiple discontiguous segments that get chained together in a linked list. This improves update performance at the expense of query performance, which as the index grows will have to follow more and more links to read a complete list, each necessitating a random disk access.
  • The remerge update scheme avoids the random access and fragmentation issues of in-place update by simply reading through the entire index and writing out the updated index as it goes. The problem with this approach is that as the on-disk index becomes large compared to the in-memory index, it has to process more and more lists that are not being updated, which is wasteful.

The insight of this paper is this: the cost of merging two indexes is roughly proportional to the sum of their sizes; consequently, it takes less time to merge new updates into a smaller index than into a larger index. We can take advantage of this by creating a hierarchy of indexes on disk, with each one having a maximum capacity r times bigger than the previous one. We always merge the in-memory index into the smallest one. Whenever an index at one level fills up, it is merged into the index at the next level. Each merge is done using the remerge update technique described above, and queries are run against all indexes independently. The parameter r represents a critical tradeoff between index update time and query performance – in experiments, queries were 70% faster for r = 12 than for r = 2,  but total index construction time was 3.5 times faster for r = 2 than for r = 12. The authors call this technique geometric partitioning.

Is that it? Well, that’s the most important part. Additionally, relying on B-trees to efficiently manage the vocabulary wasn’t a safe bet, and they need to be structured with geometric partitioning as well, as described in section 5.4. Nevertheless, this is a strikingly simple technique that has the potential to greatly expand the scope of indexing applications.

As with most recent results, there also appears to be considerable room for improvement and exploration – for example, could a superior disk-based data structure achieve simultaneous good query performance and index construction time? Could these ideas be used to define a new general disk-based lookup technique superior to B-trees? Is there an analogous method that relies on the in-place update instead of the remerge update as a subroutine? How would a system like this function in a distributed environment? It’s interesting to me to note that many of the same problems faced in in-place update are also encountered in the context of in-memory dynamic arrays, where some sophisticated techniques allow constant-time update with less wasted space (e.g. Brodnik et al’s “Resizable Arrays in Optimal Time and Space”). It will be interesting to see in which direction the study of online index construction goes.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Information retrieval | 1 Comment »

Virtual Time

Posted by dcoetzee on April 23, 2009

Citation: David R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7, 3 (Jul. 1985), 404-425. (PDF)

Abstract: Virtual time is a new paradigm for organizing and synchronizing distributed systems which can be applied to such problems as distributed discrete event simulation and distributed database concurrency control. Virtual time provides a flexible abstraction of real time in much the same way that virtual memory provides an abstraction of real memory. It is implemented using the Time Warp mechanism, a synchronization protocol distinguished by its reliance on lookahead-rollback, and by its implementation of rollback via antimessages.

Discussion: This 1985 paper introduced virtual time, a concept that allows a distributed system to be organized around a linear global clock; rather than maintain a synchronized clock, it achieves efficiency by having each node maintain its own local virtual time and performing rollback when a node receives a message “in the past.” Although not widely adopted, it has served as an influential model of a general system with optimistic concurrency and rollback.

To motivate the concept, let’s for a moment consider a magical distributed system with the following properties:

  • The system is a set of nodes, each capable of doing any amount of local processing instantaneously.
  • Each node can choose at any time to send a message to any other node. Moreover, it can specify precisely when that message will be received – the only constraint is that it cannot be received in the past.

Such a system is capable of intuitively describing many different distributed systems. For example, you could have a distributed simulation (such as a physical simulation) in which each message tells a node to simulate a particular kind of event, and the times at which the messages arrive correspond to the times at which the events they simulate occur. Consequently all events are simulated in order. You could ensure that the transactions of a database system are committed in order by assigning a time to them indicating when they’re each supposed to atomically occur. Finally, most simply, you could create a system in which all messages are received in order, by merely setting the received times to be the same as the sending time. (These scenarios are based on section 5 from the paper.)

In real life, of course, we don’t have instantaneous processing, and we can’t control when messages will be received. The concept behind this paper is to get rid of “real time”, and replace it by a new sort of time called virtual time (by analogy with virtual memory) that we have more control over. We can control the rate at which virtual time advances, and in fact time it can advance at different rates at different nodes in the system: each node keeps a local clock. To control the time at which messages are received, we introduce a queue at each node that queues up all received messages and does not process them until their virtual time arrives.

Not having to agree on a global virtual time greatly decreases the communication cost, but the problem with this flexibility is that once the local clocks have gotten out of sync, some strange scenarios can arise: if node A is lagging behind node B, it may send a message to B that arrives “in the past” from B’s perspective. But B may have already taken action based on the fact that it didn’t receive that message, including sending new messages to other nodes.

There are a number of solutions to this problem. The most obvious one is to have each node wait until its clock is the furthest in the past of all nodes before proceeding. This corresponds to pessimistic concurrency, and is the equivalent of having a global lock in a threaded program, with all the same consequences of low concurrency and poor throughput. The one taken by Jefferson is a form of optimistic concurrency: each local clock simply jumps to the received time of the next waiting event, so that the node is never idle, maximizing throughput. If it ever receives an new message “in the past,” it does a rollback to a point in time before that message was received and then proceeds forward again. To facilitate this it takes periodic snapshots of its state, tagged with virtual times. This is all transparent to the node’s program; from its perspective, virtual time never decreases. This is unlike, say, transactions, where a failed transaction needs to be detected and retried.

The big remaining problem is that during rollback there are certain things that can’t be undone locally, most importantly the sending of messages to other nodes. There are, again, multiple ways of dealing with this, but the one used by Jefferson is that whenever a message send is rolled back, an “antimessage” indicating this is sent to the same node with the same received time. At the destination, if the corresponding message is still queued, they collide and eliminate one another. If the corresponding message has already been processed, the antimessage arrives in the past and causes that node in turn to rollback and not receive the original message. One requirement for implementing this is that all nodes keep a queue of messages already processed.

A nontrivial problem is to show that the system as a whole makes progress amidst all these cascading rollbacks. Another serious problem is the performance issue of how to prevent the nodes from running out of memory from storing unbounded queues of messages and state snapshots. To deal with these issues, Jefferson defines the concept of global virtual time (GVT), which is the earliest virtual time at which anything in the system is currently happening. Because no node can send a message into its own past, no message can be received before the GVT, and no node will ever have to rollback to a time before the GVT. Consequently, any state snapshot or already-processed message marked with a virtual time prior to GVT can be discarded, in a process Jefferson calls “fossil collection” (this is oddly prophetic, as the term “garbage collection” was not coined until 1994 by Kaushik Ghosh). Additionally, it can be shown that, even though individual local clocks may roll back, GVT never decreases. As long as no message is lost and no node blocks indefinitely, it will eventually increase, which is enough to guarantee that the system makes progress. Eventually, all message queues will be empty, and the system will terminate.

The main downside to a virtual time system is that it fundamentally relies on the assumption of what the author calls “temporal locality” – that messages won’t arrive in the past too often. Less obviously, it also depends on the efficiency of the implementation of state snapshotting, message queuing, and determining the global virtual time. Because virtual time is such a general framework, applying to distributed systems from multiprocessors to databases to LANs, this empirical analysis depends very much on the specific implementation. Evaluations of a highly-optimized practical implementation on multiprocessors were done in a series of papers by Richard M. Fujimoto (“Time warp on a shared memory multiprocessor” (1989), “The effect of memory capacity on Time Warp performance” (1993), “An adaptive memory management protocol for Time Warp parallel simulation” (1994), “An Empirical Evaluation of Performance-Memory Trade-Offs in Time Warp” (1997)). I’m not aware of any thorough evaluations in other scenarios.

The concept of virtual time was later generalized by Friedemann Mattern in his “Virtual Time and Global States of Distributed Systems,” which instead of using a totally ordered linear virtual time relies on a partially ordered virtual time (Parallel and Distributed Algorithms, 1989, 215-226, PDF). The motivation is to eliminate unnecessary ordering constraints imposed by the total order, at the cost of some conceptual complexity. Jefferson briefly hinted at this possibility (“[v]irtual times may be only partially ordered”).

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Distributed systems | 1 Comment »

Content-Based Image Retrieval at the End of the Early Years

Posted by dcoetzee on April 21, 2009

Citation: Smeulders, A.W.M, Worring, M, Santini, S, Gupta, A, Jain, R. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 22, no 12, pp.1349-1380. Dec 2000. (PDF)

Abstract: The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Discussion: This extensive survey reviews and classifies content-based image retrieval systems from 200 early publications and discusses fundamental problems in the area.

If you go to Google Image Search today and type in search keywords, some searches work very well – searches for which the images are clearly labelled in the source webpage with those keywords, such as “Homer Simpson.” But when the person who published the content didn’t think to mark it with the specific term you’re looking for – such as “man looking at a girl” or “paintings of people wearing capes” – it falls apart rapidly. Stock photography sites like Getty Images try to compensate for this by paying people to tag images with every keyword they can possibly think of – needless to say, this is an expensive proposition, and one that still fails to satisfy many useful queries.

We can approach something close to an ideal image search using people: we give them a list of images and explain our query, and they exhaustively examine every image to determine if it satisfies the query. A person does not need to rely on surrounding text; they can look at the image itself, identify objects and their relationships, and incorporate cultural context and application-specific background to resolve ambiguities and identify actions and states. These are the sorts of tasks people are particularly good at.

Content-based image retrieval (CBIR) attempts to respond to image queries as a human would, subject to limitations of feasibility and performance. It relies on image content, using techniques from computer vision to interpret and understand it, while using techniques from information retrieval and databases to rapidly locate images suiting a specific query. Since its inception it has exploded into a field in its own right, with hundreds to thousands of papers each making its own tradeoff between the many variables involved.

A natural-language query to an image search engine rapidly runs into issues of intractability due to our very limited progress on the problem of natural language processing and the need for vast common-sense knowledge about the world to process many queries. For example, consider the rules you might use to determine if an image depicts an “indoor scene” or a “frightened person.” The authors of this paper label this problem the semantic gap. Rather than tackle these problems, most CBIR systems rely on alternate user interfaces, such as search by assocation, search by example, and search by sketching, all of which involve using images to search for other related images. Often this search process is iterative: at each stage, the user clicks an image “more like” their target image, refining the set of candidates. You can see a demonstration of a preliminary version of such an interface at the new Google Labs Similar Images.

There are a number of important variables which strongly influence the design of a CBIR system. These include:

  • The scope of the domain: How are the images constrained? How are the queries constrained? If the images can be taken under controlled conditions, and queries are easy to predict, the problem often admits a highly accurate domain-specific solution. At the other end of the spectrum is general image search of all online images.
  • The type of search activity: Will the user typically be looking for a specific image? Or just browsing? Or do they want to classify or categorize an image collection?
  • What is the user interface?
  • Performance constraints: how quickly must queries be answered?

Since the semantic gap is considered insurmountable, the most fundamental limitation in practical CBIR systems is what the authors of this paper call the sensory gap: “the gap between the object in the world and the information in a (computational) description derived from a recording of that scene.” In other words, it’s figuring out what you’re looking at based only on an image. In narrow domains with images taken under fixed conditions, this can be attainable even using simple schemes. In a general domain like web image search, this is still considered an insurmountable problem.

One of the main reasons the sensory gap is insurmountable in general domains is the issue of segmentation: separating the image into regions, each corresponding to a particular object. Doing this perfectly is called strong segmentation; this is infeasible or impossible in a general setting, particularly in the presence of occlusion (one object in front of another one). An alternative – weak segmentation – only identifies part of the region corresponding to an object, and hopes it is enough to identify the object. A third approach, accumulation, does not rely on segmentation at all: it computes a function across the entire image, which is designed to be insensitive to variations in the part of the image not corresponding to the object. Color indexing, which I described in a previous post, is an example of accumulation, where the color histogram over the image, corrected for lighting conditions, is used to identify the object it depicts.

Generally, object identification relies on a great deal of imprecise ad hoc knowledge about general image characteristics. For example, as a convention in most images the horizon is at about the same position – comparison of object locations to the horizon allows their actual size to be estimated. Much of the same physics of light, reflectance, geometry, and texture that comes into play in computer graphics is useful here in separating the object description from the environment description such as lighting and viewing angle. The rough shapes of objects can be identified with edge detection and invariant transforms to account for changes in viewing angle. Finally, cultural conventions such as the use of certain symbols in diagrams or the existence of perpendicular angles indoors are frequently exploited. In some cases, rather than writing rules for object recognition by hand, a large training set of images labelled with tags is used for statistical learning of image features (as with eigenfaces).

To facilitate efficient lookup, CIBR systems do not perform processing on each image for every query. Instead, they generally map an image to a feature vector, an array of values that succinctly describes the semantic objects in the image and their relationship, whether explicitly or implicitly. These are stored in a database. Lookup can then be done by comparing the feature vector of the query image to the feature vectors of the images in the database, a type of query the database supports efficiently.

A big challenge in CIBR systems has been evaluation: given two systems purporting to solve the same search problem, which one does a better job? In traditional search engines this problem is attacked with the notions of precision and recall, but these depend on a reliable notion of whether or not an image is relevant, and do not take into account the notion of iterative refinement of results. In 2000, it was typical to rely on derived measurements such as the time to complete tasks with access to the tool.

This discussion is but a summary of an already-dense survey, so if you’re interested in CBIR, I recommend consulting the survey for details. After more than a decade, there are increasing signs of wider deployment of CBIR systems both in limited domains and to the general public – don’t be surprised if you end up using one soon.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence, Information theory and compression | 7 Comments »