Papers in Computer Science

Discussion of computer science publications

Archive for the ‘Artificial intelligence’ Category

Pattern Classification

Posted by dcoetzee on June 17, 2009

Citation: R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd edition. New York: John Wiley & Sons, 654 pages, 2001, ISBN: 0-471-05669-3.

Discussion: This classic machine learning textbook discusses a variety of well-studied application-independent methods of classifying inputs into one of several distinct classes, as well as some of the theory behind them. The book’s organization is constructed around the amount of information available to the classifier, beginning with scenarios where the actual distribution of features of classes is known, and finishing with scenarios where even the classes themselves must be inferred. Among the many techniques considered are decision trees, neural networks, genetic algorithms, Boltzmann learning, nearest neighbor classification, k-means clustering, principal and independent component analysis, formal grammars, linear discriminants, resampling methods like bagging and boosting, and maximum-likelihood estimation; most of these are treated in great detail.

Perhaps the best demonstration of the level of detail that the book goes into is its discussion of neural networks in chapter 6. Whereas any general book on artificial intelligence is bound to include basic discussion of neural networks and their dominant learning algorithm, backpropagation, the discussion here is both denser and covers a broad range of extensions and relations to other methods. First of all, it contextualizes neural networks and places them in their historical context by preceding the chapter with a thorough chapter on linear discriminants, a powerful tool in their own right that are basically equivalent to neural networks with no hidden layers. Unlike more generalized references which tend to merely describe the most common implementation of neural networks without any justification, every design choice going into the neural network is derived and justified and alternative choices are discussed, from the update rules to the activation function to the number of layers and hidden units to the initial weights to the training protocol to the stopping condition. Perhaps most important from an engineer’s point of view is section 6.8, which discusses a variety of extensions needed to make neural networks efficient and accurate in practice, such as input standardization, momentum, weight decay, hints, use of second-order information, and so on. Each of these occupies no more than a page, but still effectively conveys the essence of the technique. Besides these essentials, the chapter also informally proves the power of three-level networks, and incorporates a variety of interesting visualizations of network structure, network weights, error surfaces, convergence, and so on, which helps a lot in developing an intuition for the behavior of neural networks. In short, this chapter contains about as much information on neural networks as I would normally expect to get from an entire book on the subject.

The book isn’t just a bunch of survey papers stapled together though; it has a number of things that tie it together. Most important is its organization of progressively less well-specified problems. Chapter 2 begins with Bayesian decision theory and describes how to perform classification optimally, given the complete probability distribution of each feature for each class. This also brings in the first major theoretical result, the lower bound on classification error given by the Bayes error. From there it progresses to maximum-likelihood estimation, in which the distribution type (e.g. Gaussian) is known but not its parameters (such as mean and standard deviation), and these parameters must be determined from manually-classified example data. After this comes simple nonparametric techniques like nearest neighbor estimation and linear discriminants that make no assumptions about the distribution but only work well when the classification function is simple and the dimension isn’t too high. Neural networks can describe more complex functions, but still get tripped up by complex classification functions with many local minima and maxima. The next chapter addresses stochastic methods like simulated annealing and genetic algorithms that cope well with complex classification functions, at the expense of much greater training time. Skipping over a couple chapters, the final one deals with unsupervised classification, where the example data have not been assigned to classes and the classes are not even known and must be inferred; a typical technique in this scenario is clustering. This organization is effective at emphasizing the point that a classifier needs to take advantage of as much information as the application domain makes available to be effective.

Chapters 8 and 9 don’t fall well into the natural progression. Most of the chapters deal with numeric features, like width or age, leaving discrete features such as color or gender, and the classifers that process them such as decision trees and formal grammars, to chapter 8. Chapter 9, entitled “algorithm-independent machine learning”, is based around theory and metalearning techniques that can be used to enhance all classifiers. It begins with the fundamental No Free Lunch theorem, which essentially states that no classifier performs better in all cases than any other classifier, including the naive classifer that merely guesses at random. It also discusses resampling methods that aim to improve unstable classifiers or combine different types of classifiers, and describes how to compare one classifier with another, which is essential when espousing the benefits of any new classifier.

The progressive organization, while useful, also necessarily implies that some essential basic topics, such as decision trees (chapter 8), resampling methods (chapter 9), and k-means clustering (chapter 10), are not introduced until long after more specialized topics have been visited, such as Bayesian belief networks (chapter 2) and hidden Markov models (chapter 3). As a consequence it probably makes more sense to just read the first few sections of each chapter first, rather than reading the book straight through.

Another frustration with this book is that there are a number of forward references: in particular, cross-validation is mentioned several times before it’s finally explained in Chapter 9, and probabilistic neural networks were discussed in chapter 4 in the context of Parzen windows, with neural networks discussed in chapter 6. They also assume quite a bit of math background in areas that aren’t covered in-depth in most undergraduate programs, like matrix calculus and statistics, with only a terse appendix for a refresher on this material. It would have been helpful to demonstrate of these techniques in the one-dimensional case, where the familiar intuitions of real number calculus apply.

Well dense and thorough, there are necessarily some things the book doesn’t cover, in order to fit into 600 pages. Some theorems are not proved in the text, particularly the No Free Lunch Theorem, but also many of the convergence results and probability bounds. This is generally not a big deal since the book is intended more for those looking to apply machine learning to an application domain, rather than those looking to study theoretical machine learning and conceive new techniques – and its references are thorough enough that this information can be located if needed. On the other hand, it also doesn’t contain any real code or references to real machine learning libraries – it’s not a programmer’s handbook. Finally, the second edition in particular concentrates heavily on application-independent models, and does not delve into the intricacies of the immense number of applications of machine learning; problems like determining the best features to use and how to compute them are discussed but motivated only by toy examples and not real-world problems like speech recognition, image processing, or expert systems. It helps when reading it to have an application domain in mind to mentally plug into the models.

In short, if you’re looking for a dense and application-independent survey of major classification algorithms in machine learning, this is the book to get you started, and will take you a few steps beyond anything you might have studied in an AI intro class. But I would precede it with some reading on matrix calculus, and follow it with some reading on a specific application area of your choice, to help make the concepts more concrete.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | 2 Comments »

Learning from hints in neural networks

Posted by dcoetzee on May 24, 2009

Citation: Abu-Mostafa, Y. S. 1990. Learning from hints in neural networks. Journal of Complexity, 6, 2 (Jun. 1990), 192-198. (PDF)

Abstract: Learning from examples is the process of taking input-output examples of an unknown function f and infering an implementation of f. Learning from hints allows for general information about f to be used instead of just input-output examples. We introduce a method for incorporating any invariance hint about f in any descent method for learning from examples. We also show that learning in a neural network remains NP-complete with a certain, biologically plausible, hint about the network. We discuss the information value and the complexity value of hints.

Discussion: This short 1990 machine learning paper introduced a technique for learning with hints in a neural network; that is, it allows a neural network to learn a function for which we have some limited information about its properties.

In machine learning, neural networks are a simple way of representing functions that is sufficiently powerful to approximate any function. They consist of a set of at least three layers of processing nodes connected by edges labelled with weights. All the inputs into each node are multiplied by the weights on their edges, then a sigmoid function (a particular strictly increasing function bounded between -1 and 1) is applied to produce the output. By adjusting the weights, we can gradually modify the function being computed.

Neural networks are particularly useful in classification problems; for example, one might construct a neural network that takes as input an image and outputs 1 if it looks like a hamburger, or else -1. It would be really difficult to manually write a program that does this. Instead, neural networks can be trained with a set of inputs and their associated outputs; the weights are adjusted based on the examples using the backpropagation algorithm until it computes a function close to the actual one.

For our purposes, the most important thing to note is that the backpropagation works by feeding the training input to the network and determining how far the output is from the desired training output, called the error. It then adjusts the weights in a way that decreases that error. It does this repeatedly until it reaches a stopping point.

Backpropagation is efficient and general but can run into two important related problems:

  • Insufficient data: There may not be enough training data to learn weights that generalize well to new inputs.
  • Overfitting: The resulting function may end up being oversensitive to parameters of the training examples that are actually irrelevant.

For example, say you train a hamburger recognizer on many images off the Internet, and then I give it a picture of an upside-down hamburger. Because it’s never seen an upside-down hamburger before, it’s quite likely to claim that it’s not a hamburger, despite the fact that intuitively we know that orientation does not affect an image’s hamburgerness. Likewise it may fail to recognize an image that is smaller or larger than those in its training set, or where lighting is unusual. These kinds of restrictions on the function representation are called invariants. Invariants cannot be directly expressed as input-output examples; they are larger restrictions on the scope of functions under consideration.

The most obvious way to deal with invariants is to expand your training set – turn all your training images upside-down, and add them to your training set. But when we begin to consider more and more combinations of invariants, this approach can rapidly grow infeasible. Not only does the training set become large, but if there are not enough inputs in the original training set to teach the invariant, then it will not be properly learned.

The key observation of Abu-Mostafa’s work is that we can don’t need to rely entirely on training examples. Whereas training examples specify the constraint that f (x) = y, where f is the function we’re learning and x and y are the input/output, for invariants it’s more useful to deal with equality examples, which are pairs where f (x1) = f (x2). The backpropagation algorithm can be easily modified to accomodate this: instead of computing the error based on the distance between the actual output and desired output, we compute it as the distance between the two outputs produced by the two examples. There’s no requirement to know what the value of f (x1) or f (x2) is. Using this advantage in our example, we can take any image, even if we don’t know whether or not it looks like a hamburger, rotate it, and use the two images to train our network. We could even generate random images and rotate these.

Another way of framing this is that we want to encourage the network to learn new features that describe the input but in some way summarize or interpret the original input features. These new features can then be leveraged by the network to guide the final output. To take an example from the book Pattern Classification (Richard Duda, Peter Hart, David Stork, section 6.8.12): if the input is a soundwave, and we want to determine what speech phoneme it represents,  a useful intermediate feature would be deciding whether it’s a vowel or a consonant. To encourage the network to learn this feature, we can add a new output node that is set to 1 or -1 depending on whether the input soundwave is a vowel or consonant. By incorporating an initial training phase that modifies the network weights to predict this output well, we now have a starting network that can already distinguish vowels and consonants, which is a big help in making finer subclassification. Without the hints, the network may have learned this on its own, but the more domain-specific information we can give it, the quicker it will train and the better it will generalize. In the case of our original example, these intermediate features would be properties of the original image that are insensitive to the invariants like rotation.

For learning with hints, neural networks with more than three layers of nodes are often helpful as well. The idea is that the first layer can convert the input features to the new intermediate features of interest, and then normal learning can be applied to these. Where the invariants are very simple, they can even be applied to the inputs before training (and prediction), placing them in canonical form. For example, to help deal with brightness variation in images, it helps to scale all images to the same brightness range.

My apologies for the long delay in this post – I’m currently engaged in reading the book Pattern Classification, and intend to follow up here with a discussion of it when I’m done.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | 2 Comments »

Bagging Predictors

Posted by dcoetzee on May 7, 2009

Citation: Leo Breiman. Bagging predictors. Machine Learning, 24, 2 (Aug. 1996), 123-140.

Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

Discussion: This 1996 machine learning paper described a very simple mechanism called bagging (short for “bootstrap aggregating“) for improving the accuracy of predictors by averaging the results of a number of independent predictors trained over samples from the same training set.

Suppose we’re creating a decision tree to predict, based on a number of medical tests, whether or not a patient has cancer. A typical way to do this is to gather some cases from a medical database, then use them to train a predictor model, such as a decision tree or neural network.

The issue with this straightforward approach is that many predictors used in machine learning, including decision trees and neural networks, are unstable: training them on a slightly different data set can produce significantly different results. For example, if we split our training set in half, and use each half to build a decision tree, those two decision trees are likely to be quite different and yield different predictions. This is a result of overfitting, the phenomenon where a model fits itself too well to the training data, and not well enough to the complete population of past and future cases underlying the sample.

To overcome this, we introduce ideas from a statistical technique called bootstrapping. To motivate this concept, suppose you have a random sample of 1000 test scores from a pool of 1 million, and you want to estimate what the median test score is. Because we can’t know this exactly, we instead seek a confidence interval that has a 98% chance of containing the median. Finding a confidence interval for the mean test score is easy using standard techniques, because the sample mean and population mean are analytically closely related, but the same is not true of the median.

When analysis fails, we take a more computational approach. The idea is to use the sample itself as an approximate model of the underlying distribution: we can replicate each score 1000 times to obtain a pool of 1 million scores that will statistically resemble the real 1 million scores. We then take (say) 200 independent samples of size 1000 from this pool, and compute their medians. Finally, we take the 1% and the 99% percentiles of these median scores to get our confidence interval. In practice, rather than duplicating scores many times, it’s more typical to simply sample with replacement, so that each score can be chosen multiple times. Provided the original sample is a good empirical approximation of the complete population, the result is a very good prediction of the actual median.

What does this have to with machine learning? Well, suppose instead of test scores we have medical test results, and instead of the median, we’re estimating whether they have cancer. We draw (with replacement) bootstrapping samples from the original data, and for each one we use it as training data to construct a predictor. Then, for new cases we take a majority vote among the predictions to determine the overall prediction. Provided that the data is a good empirical approximation of the complete population of past and future cases, the result will be a good prediction that resists overfitting.

This technique is extremely general, and can be applied to virtually any type of classification problem and any model. It can even use completely different models for different bootstrap samples, which can make it easy to sidestep difficult questions about choosing good models and model parameters. It has three main disadvantages:

  • Performance: For each of the bootstrap samples, we have to construct a separate model from scratch, and each query must bear the overhead of querying each of these models and combining their results. Fortunately, bagging is easily parallelizable – because of this, and because it can extract good accuracy from fast-to-query models like decision trees, it can actually be more efficient in some cases.
  • Relies on instability: If the model being used is stable, or in other words does not change significantly between bootstrap samples, the resulting system can actually be less accurate than a model built directly on the original data. An example of a stable model is a nearest-neighbor classifier. To vividly illustrate this point, the paper describes an experiment with a linear regression model where the stability of the model depends on the number of variables used. Sure enough, there is a crossover point in accuracy between the bagged and the unbagged versions of this predictor.
  • Interpretability: One of the widely cited advantages of models like the decision tree is that it’s easy for a human to understand how it makes it decisions – like a flowchart, it just performs a series of tests. But when you have twenty decision trees, each making different decisions on the same inputs, it’s harder to get an intuitive feel for their aggregate decision making – it’s like predicting the decision that a committee will make by learning about its members.

Bagging is just one of a family of resampling ensemble methods; another popular one is called boosting, a technique that iteratively combines multiple predictors by encouraging each new model to focus on cases misclassified by previous models. Boosting can learn more quickly than bagging because of its focus on misclassified cases, but is also less robust to errors in training data. A 2005 paper by Kotsiantis and Pintelas described a scheme for combining the two (“Combining Bagging and Boosting”, International Journal of Computational Intelligence, PDF).

Bagging has already played a strong role in the design of machine learning systems. Now that we’ve hit the clock speed barrier and are seeing more highly parallel and multicore systems, it’s bound to become an even more attractive option. Keep it in mind if you ever need to design a simple but accurate machine learning system.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | Tagged: , | 1 Comment »

Content-Based Image Retrieval at the End of the Early Years

Posted by dcoetzee on April 21, 2009

Citation: Smeulders, A.W.M, Worring, M, Santini, S, Gupta, A, Jain, R. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 22, no 12, pp.1349-1380. Dec 2000. (PDF)

Abstract: The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Discussion: This extensive survey reviews and classifies content-based image retrieval systems from 200 early publications and discusses fundamental problems in the area.

If you go to Google Image Search today and type in search keywords, some searches work very well – searches for which the images are clearly labelled in the source webpage with those keywords, such as “Homer Simpson.” But when the person who published the content didn’t think to mark it with the specific term you’re looking for – such as “man looking at a girl” or “paintings of people wearing capes” – it falls apart rapidly. Stock photography sites like Getty Images try to compensate for this by paying people to tag images with every keyword they can possibly think of – needless to say, this is an expensive proposition, and one that still fails to satisfy many useful queries.

We can approach something close to an ideal image search using people: we give them a list of images and explain our query, and they exhaustively examine every image to determine if it satisfies the query. A person does not need to rely on surrounding text; they can look at the image itself, identify objects and their relationships, and incorporate cultural context and application-specific background to resolve ambiguities and identify actions and states. These are the sorts of tasks people are particularly good at.

Content-based image retrieval (CBIR) attempts to respond to image queries as a human would, subject to limitations of feasibility and performance. It relies on image content, using techniques from computer vision to interpret and understand it, while using techniques from information retrieval and databases to rapidly locate images suiting a specific query. Since its inception it has exploded into a field in its own right, with hundreds to thousands of papers each making its own tradeoff between the many variables involved.

A natural-language query to an image search engine rapidly runs into issues of intractability due to our very limited progress on the problem of natural language processing and the need for vast common-sense knowledge about the world to process many queries. For example, consider the rules you might use to determine if an image depicts an “indoor scene” or a “frightened person.” The authors of this paper label this problem the semantic gap. Rather than tackle these problems, most CBIR systems rely on alternate user interfaces, such as search by assocation, search by example, and search by sketching, all of which involve using images to search for other related images. Often this search process is iterative: at each stage, the user clicks an image “more like” their target image, refining the set of candidates. You can see a demonstration of a preliminary version of such an interface at the new Google Labs Similar Images.

There are a number of important variables which strongly influence the design of a CBIR system. These include:

  • The scope of the domain: How are the images constrained? How are the queries constrained? If the images can be taken under controlled conditions, and queries are easy to predict, the problem often admits a highly accurate domain-specific solution. At the other end of the spectrum is general image search of all online images.
  • The type of search activity: Will the user typically be looking for a specific image? Or just browsing? Or do they want to classify or categorize an image collection?
  • What is the user interface?
  • Performance constraints: how quickly must queries be answered?

Since the semantic gap is considered insurmountable, the most fundamental limitation in practical CBIR systems is what the authors of this paper call the sensory gap: “the gap between the object in the world and the information in a (computational) description derived from a recording of that scene.” In other words, it’s figuring out what you’re looking at based only on an image. In narrow domains with images taken under fixed conditions, this can be attainable even using simple schemes. In a general domain like web image search, this is still considered an insurmountable problem.

One of the main reasons the sensory gap is insurmountable in general domains is the issue of segmentation: separating the image into regions, each corresponding to a particular object. Doing this perfectly is called strong segmentation; this is infeasible or impossible in a general setting, particularly in the presence of occlusion (one object in front of another one). An alternative – weak segmentation – only identifies part of the region corresponding to an object, and hopes it is enough to identify the object. A third approach, accumulation, does not rely on segmentation at all: it computes a function across the entire image, which is designed to be insensitive to variations in the part of the image not corresponding to the object. Color indexing, which I described in a previous post, is an example of accumulation, where the color histogram over the image, corrected for lighting conditions, is used to identify the object it depicts.

Generally, object identification relies on a great deal of imprecise ad hoc knowledge about general image characteristics. For example, as a convention in most images the horizon is at about the same position – comparison of object locations to the horizon allows their actual size to be estimated. Much of the same physics of light, reflectance, geometry, and texture that comes into play in computer graphics is useful here in separating the object description from the environment description such as lighting and viewing angle. The rough shapes of objects can be identified with edge detection and invariant transforms to account for changes in viewing angle. Finally, cultural conventions such as the use of certain symbols in diagrams or the existence of perpendicular angles indoors are frequently exploited. In some cases, rather than writing rules for object recognition by hand, a large training set of images labelled with tags is used for statistical learning of image features (as with eigenfaces).

To facilitate efficient lookup, CIBR systems do not perform processing on each image for every query. Instead, they generally map an image to a feature vector, an array of values that succinctly describes the semantic objects in the image and their relationship, whether explicitly or implicitly. These are stored in a database. Lookup can then be done by comparing the feature vector of the query image to the feature vectors of the images in the database, a type of query the database supports efficiently.

A big challenge in CIBR systems has been evaluation: given two systems purporting to solve the same search problem, which one does a better job? In traditional search engines this problem is attacked with the notions of precision and recall, but these depend on a reliable notion of whether or not an image is relevant, and do not take into account the notion of iterative refinement of results. In 2000, it was typical to rely on derived measurements such as the time to complete tasks with access to the tool.

This discussion is but a summary of an already-dense survey, so if you’re interested in CBIR, I recommend consulting the survey for details. After more than a decade, there are increasing signs of wider deployment of CBIR systems both in limited domains and to the general public – don’t be surprised if you end up using one soon.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence, Information theory and compression | 9 Comments »

Color indexing

Posted by dcoetzee on March 26, 2009

Citation: Swain, M. J. and Ballard, D. H. Color indexing. International Journal of Computer Vision 7, 1 (Nov. 1991), 11-32. (PDF)

Abstract: Computer vision is embracing a new research focus in which the aim is to develop vision skills for robots that allow them to interact with a dynamic, realistic environment. To achieve this aim, new kinds of vision algorithms need to be developed which run in real time and subserve the robot’s goals. Two fundamental goals are [identifying an object at a known location and] determining the location of a known object. Color can be successfully used for both tasks.

This article demonstrates that color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models. It shows that color histograms are stable object representations in the presence of occlusion and over change in view, and that they can differentiate among a large number of objects. For solving the identification problem, it introduces a technique called Histogram Intersection, which matches model and image histograms and a fast incremental version of Histogram Intersection, which allows real-time indexing into a large database of stored models. For solving the location problem it introduces an algorithm called Histogram Backprojection, which performs this task efficiently in crowded scenes.

Discussion: This 1991 computer vision paper introduced the concept of identifying and finding objects in images using their color histograms.

In 1991, as computer vision systems were moving away from offline processing of static photographs and into real-time use by mobile robots with inexpensive cameras attached, there was a pressing need for efficient algorithms for doing simple vision tasks like identifying, finding, and tracking objects. Until the publication of this paper, most of these techniques were based on the most obvious attribute, shape recognition, but this was both computationally expensive and fragile: the slightest rotation or occlusion of an object (stuff going in front of it) could radically alter its perceived shape. Color-based recognition is more robust: the colors of an object don’t change very much as it moves, rotates, or becomes occluded by other objects. It can work with extremely low-resolution images (in one experiment, the authors got acceptable performance on 8 × 5 pixel images!). On the other hand, it’s very sensitive to the color and intensity of lighting, and the image has to be normalized to account for this. It also, obviously, has difficulty distinguishing objects that have similar colors.

The details of the system are simple: divide the space of all colors up into fairly large “buckets” – typical would be 16 buckets along each color axis (e.g. red, green, blue). For example, the RGB color (170, 24, 255) might get bucketed in the bucket (10, 1, 15). Then, for each object you want to recognize or find, take a model photo of it and count the number of pixels falling into each bucket; this is called the color histogram. To compare two color histograms, they use the Histogram Intersection operator: this takes the minimum of the two counts (from each image) for each bucket, and then adds them up. This value is normalized by dividing it by the number of pixels in the source image. Images with very similar histograms will have intersection values close to 1, whereas ones with different histograms will have much smaller values. By comparing an image to be identified against each image in a database of models, each with a precomputed color histogram, this can rapidly locate the desired image.

Locating objects is very similar: each pixel is assigned a value based on its bucket and how common that bucket is in the model image of the object being sought. Then it looks for a region containing many large values. This is facilitated by using a convolution with a circle – effectively “blurring” the image and mixing nearby values – followed by location of the pixel with the largest value.

The approach of the paper is actually more general than it sounds: virtually any image feature that you can construct a histogram of, can be applied to the same tasks using the same approach. This includes features such as local geometry, local texture, rough estimated size, and so on. The sensitivity of the algorithm to any particular feature can be tuned by adjusting the number of bucket divisions along that dimension. For example, Niblack et al’s QBIC system (1993), still in use today by IBM in DB2, uses color, texture, and shape simultaneously. In 1999 a combination of features using this approach was applied effectively to content-based image retrieval in Tao and Grosky’s “Spatial Color Indexing: A Novel Approach for Content-Based Image Retrieval” (at Citeseer).

Unsurprisingly, the paper demonstrated that the performance of color indexing is severely degraded by changes in lighting; this is one reason that the work did not appear until 1991, after work on color correction and normalization had appeared that can be used in a preprocessing step to effectively cope with these issues. This original paper only examined differences in brightness, and performed a trivial normalization of brightness values to demonstrate its advantage. Dealing with illumination invariant color matching – particularly when the light changed in color – would be the subject of several later publications such as Matas et al’s “Color-Based Object Recognition under Spectrally Nonuniform Illumination” and “On Representation and Matching of Multi-Coloured Objects” (1995).

The color indexing paper shows its age when it comes to scale: concerned only with robotics applications, they considered 66 objects to be a “large database.” Today, object identification, classification, and location is studied primarily in the context of image retrieval and image filtering, where there are frequently millions of images to consider with thousands of objects and with no control over image conditions. The “incremental” histogram intersection optimization presented in this paper – essentially, only looking at a few of the buckets with the largest counts – enables it to scale to moderate-sized databases, but not anything as large as modern applications require. Since then, more scalable approaches to color indexing databases have been developed, such as Albuz et al’s “Scalable Color Image Indexing and Retrieval Using Vector Wavelets” (2001) and hierarchical clustering based approaches such as those of Abdel-Mottaleb et al (“Performance Evaluation of Clustering Algorithms for Scalable Image Retrieval”, 1998).

The paper employs combinatorial logic, and a bit of guesswork, to argue that the number of distinct color profiles is sufficiently large to allow a very large number of potential objects to be distinguished. In Stricker and Swain’s “The Capacity of Color Histogram Indexing” (1994, Citeseer) an interesting connection between color indexing and coding theory provides a bound on the actual number of distinguishable images in a color indexing database, which they call its capacity.

However, objects are not necessarily well-distributed among these in practice; in the original experiments the test objects were relatively easy to identify and distinguish and have clear color histograms, such as household items with illustrated packaging. Humans regularly distinguish objects with nearly identical color histograms, such as different types of trees or different brands of cars of the same color; and also regularly identify objects with very different color histograms as being the same, such as a person wearing two different sets of clothes. Color indexing has not, as far as I am aware, been extended to such difficult identification problems.

A peculiar property of color indexing is that, unlike many other object identification and location methods, it has no clear analogy in human visual processing – indeed, the paper cites work by cognitive psychology researcher Anne Treisman showing that humans have demonstrated poor performance at locating objects based on their colors. I’m not aware of any new psychology research investigating the role of color histograms in object location and attention in humans.

Keith Price maintains a bibliography of papers related to recognition by color indexing.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | 2 Comments »

Induction of Decision Trees

Posted by dcoetzee on March 1, 2009

Citation: Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106. (PDF)

Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

Discussion: J. R. Quinlan, also known as Ross Quinlan, is a researcher out of Australia who is best known for having designed the popular ID3 and C4.5 algorithms that are most commonly used in practice for generating decision trees. This 1986 paper surveyed the area of decision tree learning, dealt with a number of outstanding practical issues with the ID3 algorithm, and successfully motivated wide use of decision trees in applications (to quote the conclusion, “[t]he aim of this paper is to demonstrate that the technology for building decision trees from examples is fairly robust”). It is commonly cited as the best-known detailed description of the ID3 algorithm, although that algorithm was actually first published in 1979. (Quinlan, J.R. Discovering rules by induction from large collections of examples. In Expert systems in the micro electronic age.)

In a typical decision tree problem, a list of objects is given providing a list of attributes and a class. For example, an object may have attributes describing a particular patient’s symptoms (does the patient have a cough? what is the patient’s temperature?) and a class describing the medical issue they were subsequently diagnosed with. The goal of a decision tree is to predict the class (the diagnosis) based on the other attributes (the symptoms). It does this by examining one attribute at a time and deciding what to do next based on whether or not that attribute satisfies some property; it continues in this manner until it has enough information to identify the class. For example, if a patient has a cough and a temperature above 98, they might be diagnosed with a cold, whereas if they have a cough and their temperature is below 98 they might be diagnosed with bronchitis.

At one time these trees were generated painstakingly by hand via interviews with experts. Decision tree learning attempts to generate trees automatically from a large collection of training cases (in this case, a database of patient symptoms and diagnoses) that can be used to predict outcomes in new cases. These examples may come from real-world historical data, or may be generated by experts by hand.

The subject of this work was a family of algorithm that the author calls “the TDIDT family of learning systems,” standing for “Top Down Induction of Decision Trees.” This is actually a subset of set of all decision tree learning algorithms; top-down indicates a strategy where initially all training cases belong to a single root node which is then successively split to form a tree. This is in contrast to incremental approaches where a tree is built based on the data seen so far and then modified as new data arrives, as was used for example by Winston in his 1975 paper “Learning structural descriptions from examples.”

The TDIDT family began with Hunt’s Concept Learning System (CLS) framework as early as 1963, which was built upon by Quinlan with ID3 in 1979. Quinlan was motivated by a specific learning problem posed by Donald Michie involving chess endgames, in which the learner was asked to identify a win for black or white based solely on pattern-based features without search. However, ID3 was still limited to attributes with a finite number of discrete values; systems such as ACLS, ASSISTANT, and later C4.5 allowed attributes with integer or even real continuous values, and could generate boolean predicates to examine them.

The primary goal of these systems is to produce a decision tree that performs well on unseen new cases; they typically seek to achieve this by producing a decision tree with a small number of nodes that still performs adequately on the training set. Because it has a small number of nodes, it is believed to generalize better, avoid overfitting, and to be a simpler representation of knowledge for humans to understand, which is one of the primary advantages of decision trees over more opaque predictors such as artificial neural networks.

The critical choices to be made in building decision trees are which tests to apply to attributes, and the order in which to apply them. The goal of ID3 is to do this efficiently for large-scale problems with many attributes and objects. It accomplishes this by using the idea of information entropy from information theory. This can be explained by analogy to data compression: say we want to store in compressed form the class of each object. Using simple arithmetic coding, which examines the frequencies of each class, we can store this information in a number of bits close to the information entropy lower bound. However, once we’ve performed a test on the object (such as “does the patient have a cough?”), we know more information about it, and so we can compress its class into a smaller number of bits. In ID3, each possible test is examined and then the compressed sizes of the classes of the two partitions (e.g. people with and without a cough) is summed. The test that minimizes this value is selected, because it provides the most information.

This entropy-based test has some drawbacks; it tends to favor attributes with more possible values. This can be dealt with either by restricting attributes to two values or by taking the information content of the attribute itself into account (section 7).

Additionally, ID3 implements an orthogonal improvement for scalability: the tree is built incrementally by starting out with a small subset of all training cases, building a tree, and then applying it to all training cases. A selection of incorrectly classified training cases is then added to the training set and the process is repeated. Because a random sample of the training set tends to be representative, this converges rapidly on a tree that suits the entire training set without the need to build a tree on the whole training set directly (an expensive proposition).

This approach works excellently in practice in situations where all data is correct and complete, but many practical scenarios also have to cope with noise (errors in the training data). The naive approach above would rapidly overfit to try and classify noisy entries. ID3 relies on a statistical approach based on the chi square test that is able to determine when an attribute does not have a significant effect on the relative frequencies of classes (that is, each partition corresponding to a particular value of the attribute has about the same proportion of items in each class).

Another problem with naive ID3 is unknown attribute values. The most obvious approach, adding a value for “unknown” to the attribute, yields poor performance. Quinlan’s approach was essentially to treat each unknown value as a probabilistic value that takes on each possible attribute value with a probability based on how frequently that attribute value occurs. This complicated decision tree lookup, since it may now have to follow multiple paths and then produce a probability distribution of output classes. This approach also extends naturally to attributes with partial information (distributions over values).

In 1993 Quinlan created an improved decision tree learning method called C4.5 based on ID3. In addition to support for attributes with infinitely many values, and some of the improvements discussed above, it also prunes completed trees by replacing subtrees with leaf nodes wherever this demonstrates improved generalization (as measured by evaluating a separate test set).

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | Tagged: , , , , | 1 Comment »