## Pattern Classification

Posted by dcoetzee on June 17th, 2009

**Citation**: R.O. Duda, P.E. Hart, and D.G. Stork. *Pattern Classification*, 2nd edition. New York: John Wiley & Sons, 654 pages, 2001, ISBN: 0-471-05669-3.

**Discussion**: This classic machine learning textbook discusses a variety of well-studied application-independent methods of classifying inputs into one of several distinct classes, as well as some of the theory behind them. The book’s organization is constructed around the amount of information available to the classifier, beginning with scenarios where the actual distribution of features of classes is known, and finishing with scenarios where even the classes themselves must be inferred. Among the many techniques considered are decision trees, neural networks, genetic algorithms, Boltzmann learning, nearest neighbor classification, *k*-means clustering, principal and independent component analysis, formal grammars, linear discriminants, resampling methods like bagging and boosting, and maximum-likelihood estimation; most of these are treated in great detail.

Perhaps the best demonstration of the level of detail that the book goes into is its discussion of neural networks in chapter 6. Whereas any general book on artificial intelligence is bound to include basic discussion of neural networks and their dominant learning algorithm, backpropagation, the discussion here is both denser and covers a broad range of extensions and relations to other methods. First of all, it contextualizes neural networks and places them in their historical context by preceding the chapter with a thorough chapter on linear discriminants, a powerful tool in their own right that are basically equivalent to neural networks with no hidden layers. Unlike more generalized references which tend to merely describe the most common implementation of neural networks without any justification, every design choice going into the neural network is derived and justified and alternative choices are discussed, from the update rules to the activation function to the number of layers and hidden units to the initial weights to the training protocol to the stopping condition. Perhaps most important from an engineer’s point of view is section 6.8, which discusses a variety of extensions needed to make neural networks efficient and accurate in practice, such as input standardization, momentum, weight decay, hints, use of second-order information, and so on. Each of these occupies no more than a page, but still effectively conveys the essence of the technique. Besides these essentials, the chapter also informally proves the power of three-level networks, and incorporates a variety of interesting visualizations of network structure, network weights, error surfaces, convergence, and so on, which helps a lot in developing an intuition for the behavior of neural networks. In short, this chapter contains about as much information on neural networks as I would normally expect to get from an entire book on the subject.

The book isn’t just a bunch of survey papers stapled together though; it has a number of things that tie it together. Most important is its organization of progressively less well-specified problems. Chapter 2 begins with Bayesian decision theory and describes how to perform classification optimally, given the complete probability distribution of each feature for each class. This also brings in the first major theoretical result, the lower bound on classification error given by the *Bayes error*. From there it progresses to maximum-likelihood estimation, in which the distribution type (e.g. Gaussian) is known but not its parameters (such as mean and standard deviation), and these parameters must be determined from manually-classified example data. After this comes simple nonparametric techniques like nearest neighbor estimation and linear discriminants that make no assumptions about the distribution but only work well when the classification function is simple and the dimension isn’t too high. Neural networks can describe more complex functions, but still get tripped up by complex classification functions with many local minima and maxima. The next chapter addresses stochastic methods like simulated annealing and genetic algorithms that cope well with complex classification functions, at the expense of much greater training time. Skipping over a couple chapters, the final one deals with unsupervised classification, where the example data have not been assigned to classes and the classes are not even known and must be inferred; a typical technique in this scenario is clustering. This organization is effective at emphasizing the point that a classifier needs to take advantage of as much information as the application domain makes available to be effective.

Chapters 8 and 9 don’t fall well into the natural progression. Most of the chapters deal with numeric features, like width or age, leaving discrete features such as color or gender, and the classifers that process them such as decision trees and formal grammars, to chapter 8. Chapter 9, entitled “algorithm-independent machine learning”, is based around theory and metalearning techniques that can be used to enhance all classifiers. It begins with the fundamental No Free Lunch theorem, which essentially states that no classifier performs better in all cases than any other classifier, including the naive classifer that merely guesses at random. It also discusses resampling methods that aim to improve unstable classifiers or combine different types of classifiers, and describes how to compare one classifier with another, which is essential when espousing the benefits of any new classifier.

The progressive organization, while useful, also necessarily implies that some essential basic topics, such as decision trees (chapter 8), resampling methods (chapter 9), and *k*-means clustering (chapter 10), are not introduced until long after more specialized topics have been visited, such as Bayesian belief networks (chapter 2) and hidden Markov models (chapter 3). As a consequence it probably makes more sense to just read the first few sections of each chapter first, rather than reading the book straight through.

Another frustration with this book is that there are a number of forward references: in particular, cross-validation is mentioned several times before it’s finally explained in Chapter 9, and probabilistic neural networks were discussed in chapter 4 in the context of Parzen windows, with neural networks discussed in chapter 6. They also assume quite a bit of math background in areas that aren’t covered in-depth in most undergraduate programs, like matrix calculus and statistics, with only a terse appendix for a refresher on this material. It would have been helpful to demonstrate of these techniques in the one-dimensional case, where the familiar intuitions of real number calculus apply.

Well dense and thorough, there are necessarily some things the book doesn’t cover, in order to fit into 600 pages. Some theorems are not proved in the text, particularly the No Free Lunch Theorem, but also many of the convergence results and probability bounds. This is generally not a big deal since the book is intended more for those looking to apply machine learning to an application domain, rather than those looking to study theoretical machine learning and conceive new techniques – and its references are thorough enough that this information can be located if needed. On the other hand, it also doesn’t contain any real code or references to real machine learning libraries – it’s not a programmer’s handbook. Finally, the second edition in particular concentrates heavily on application-independent models, and does not delve into the intricacies of the immense number of applications of machine learning; problems like determining the best features to use and how to compute them are discussed but motivated only by toy examples and not real-world problems like speech recognition, image processing, or expert systems. It helps when reading it to have an application domain in mind to mentally plug into the models.

In short, if you’re looking for a dense and application-independent survey of major classification algorithms in machine learning, this is the book to get you started, and will take you a few steps beyond anything you might have studied in an AI intro class. But I would precede it with some reading on matrix calculus, and follow it with some reading on a specific application area of your choice, to help make the concepts more concrete.

*The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.*

June 17th, 2009 at 6:58 am

This is a really great blog. Congratulations!

June 17th, 2009 at 10:05 am

[...] Go here to read the rest: Pattern Classification « Papers in Computer Science [...]