Posted by dcoetzee on May 24th, 2009
Citation: Abu-Mostafa, Y. S. 1990. Learning from hints in neural networks. Journal of Complexity, 6, 2 (Jun. 1990), 192-198. (PDF)
Abstract: Learning from examples is the process of taking input-output examples of an unknown function f and infering an implementation of f. Learning from hints allows for general information about f to be used instead of just input-output examples. We introduce a method for incorporating any invariance hint about f in any descent method for learning from examples. We also show that learning in a neural network remains NP-complete with a certain, biologically plausible, hint about the network. We discuss the information value and the complexity value of hints.
Discussion: This short 1990 machine learning paper introduced a technique for learning with hints in a neural network; that is, it allows a neural network to learn a function for which we have some limited information about its properties.
In machine learning, neural networks are a simple way of representing functions that is sufficiently powerful to approximate any function. They consist of a set of at least three layers of processing nodes connected by edges labelled with weights. All the inputs into each node are multiplied by the weights on their edges, then a sigmoid function (a particular strictly increasing function bounded between -1 and 1) is applied to produce the output. By adjusting the weights, we can gradually modify the function being computed.
Neural networks are particularly useful in classification problems; for example, one might construct a neural network that takes as input an image and outputs 1 if it looks like a hamburger, or else -1. It would be really difficult to manually write a program that does this. Instead, neural networks can be trained with a set of inputs and their associated outputs; the weights are adjusted based on the examples using the backpropagation algorithm until it computes a function close to the actual one.
For our purposes, the most important thing to note is that the backpropagation works by feeding the training input to the network and determining how far the output is from the desired training output, called the error. It then adjusts the weights in a way that decreases that error. It does this repeatedly until it reaches a stopping point.
Backpropagation is efficient and general but can run into two important related problems:
- Insufficient data: There may not be enough training data to learn weights that generalize well to new inputs.
- Overfitting: The resulting function may end up being oversensitive to parameters of the training examples that are actually irrelevant.
For example, say you train a hamburger recognizer on many images off the Internet, and then I give it a picture of an upside-down hamburger. Because it’s never seen an upside-down hamburger before, it’s quite likely to claim that it’s not a hamburger, despite the fact that intuitively we know that orientation does not affect an image’s hamburgerness. Likewise it may fail to recognize an image that is smaller or larger than those in its training set, or where lighting is unusual. These kinds of restrictions on the function representation are called invariants. Invariants cannot be directly expressed as input-output examples; they are larger restrictions on the scope of functions under consideration.
The most obvious way to deal with invariants is to expand your training set – turn all your training images upside-down, and add them to your training set. But when we begin to consider more and more combinations of invariants, this approach can rapidly grow infeasible. Not only does the training set become large, but if there are not enough inputs in the original training set to teach the invariant, then it will not be properly learned.
The key observation of Abu-Mostafa’s work is that we can don’t need to rely entirely on training examples. Whereas training examples specify the constraint that f (x) = y, where f is the function we’re learning and x and y are the input/output, for invariants it’s more useful to deal with equality examples, which are pairs where f (x1) = f (x2). The backpropagation algorithm can be easily modified to accomodate this: instead of computing the error based on the distance between the actual output and desired output, we compute it as the distance between the two outputs produced by the two examples. There’s no requirement to know what the value of f (x1) or f (x2) is. Using this advantage in our example, we can take any image, even if we don’t know whether or not it looks like a hamburger, rotate it, and use the two images to train our network. We could even generate random images and rotate these.
Another way of framing this is that we want to encourage the network to learn new features that describe the input but in some way summarize or interpret the original input features. These new features can then be leveraged by the network to guide the final output. To take an example from the book Pattern Classification (Richard Duda, Peter Hart, David Stork, section 6.8.12): if the input is a soundwave, and we want to determine what speech phoneme it represents, a useful intermediate feature would be deciding whether it’s a vowel or a consonant. To encourage the network to learn this feature, we can add a new output node that is set to 1 or -1 depending on whether the input soundwave is a vowel or consonant. By incorporating an initial training phase that modifies the network weights to predict this output well, we now have a starting network that can already distinguish vowels and consonants, which is a big help in making finer subclassification. Without the hints, the network may have learned this on its own, but the more domain-specific information we can give it, the quicker it will train and the better it will generalize. In the case of our original example, these intermediate features would be properties of the original image that are insensitive to the invariants like rotation.
For learning with hints, neural networks with more than three layers of nodes are often helpful as well. The idea is that the first layer can convert the input features to the new intermediate features of interest, and then normal learning can be applied to these. Where the invariants are very simple, they can even be applied to the inputs before training (and prediction), placing them in canonical form. For example, to help deal with brightness variation in images, it helps to scale all images to the same brightness range.
My apologies for the long delay in this post – I’m currently engaged in reading the book Pattern Classification, and intend to follow up here with a discussion of it when I’m done.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.