Posted by dcoetzee on April 21st, 2009
Citation: Smeulders, A.W.M, Worring, M, Santini, S, Gupta, A, Jain, R. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 22, no 12, pp.1349-1380. Dec 2000. (PDF)
Abstract: The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.
Discussion: This extensive survey reviews and classifies content-based image retrieval systems from 200 early publications and discusses fundamental problems in the area.
If you go to Google Image Search today and type in search keywords, some searches work very well – searches for which the images are clearly labelled in the source webpage with those keywords, such as “Homer Simpson.” But when the person who published the content didn’t think to mark it with the specific term you’re looking for – such as “man looking at a girl” or “paintings of people wearing capes” – it falls apart rapidly. Stock photography sites like Getty Images try to compensate for this by paying people to tag images with every keyword they can possibly think of – needless to say, this is an expensive proposition, and one that still fails to satisfy many useful queries.
We can approach something close to an ideal image search using people: we give them a list of images and explain our query, and they exhaustively examine every image to determine if it satisfies the query. A person does not need to rely on surrounding text; they can look at the image itself, identify objects and their relationships, and incorporate cultural context and application-specific background to resolve ambiguities and identify actions and states. These are the sorts of tasks people are particularly good at.
Content-based image retrieval (CBIR) attempts to respond to image queries as a human would, subject to limitations of feasibility and performance. It relies on image content, using techniques from computer vision to interpret and understand it, while using techniques from information retrieval and databases to rapidly locate images suiting a specific query. Since its inception it has exploded into a field in its own right, with hundreds to thousands of papers each making its own tradeoff between the many variables involved.
A natural-language query to an image search engine rapidly runs into issues of intractability due to our very limited progress on the problem of natural language processing and the need for vast common-sense knowledge about the world to process many queries. For example, consider the rules you might use to determine if an image depicts an “indoor scene” or a “frightened person.” The authors of this paper label this problem the semantic gap. Rather than tackle these problems, most CBIR systems rely on alternate user interfaces, such as search by assocation, search by example, and search by sketching, all of which involve using images to search for other related images. Often this search process is iterative: at each stage, the user clicks an image “more like” their target image, refining the set of candidates. You can see a demonstration of a preliminary version of such an interface at the new Google Labs Similar Images.
There are a number of important variables which strongly influence the design of a CBIR system. These include:
- The scope of the domain: How are the images constrained? How are the queries constrained? If the images can be taken under controlled conditions, and queries are easy to predict, the problem often admits a highly accurate domain-specific solution. At the other end of the spectrum is general image search of all online images.
- The type of search activity: Will the user typically be looking for a specific image? Or just browsing? Or do they want to classify or categorize an image collection?
- What is the user interface?
- Performance constraints: how quickly must queries be answered?
Since the semantic gap is considered insurmountable, the most fundamental limitation in practical CBIR systems is what the authors of this paper call the sensory gap: “the gap between the object in the world and the information in a (computational) description derived from a recording of that scene.” In other words, it’s figuring out what you’re looking at based only on an image. In narrow domains with images taken under fixed conditions, this can be attainable even using simple schemes. In a general domain like web image search, this is still considered an insurmountable problem.
One of the main reasons the sensory gap is insurmountable in general domains is the issue of segmentation: separating the image into regions, each corresponding to a particular object. Doing this perfectly is called strong segmentation; this is infeasible or impossible in a general setting, particularly in the presence of occlusion (one object in front of another one). An alternative – weak segmentation – only identifies part of the region corresponding to an object, and hopes it is enough to identify the object. A third approach, accumulation, does not rely on segmentation at all: it computes a function across the entire image, which is designed to be insensitive to variations in the part of the image not corresponding to the object. Color indexing, which I described in a previous post, is an example of accumulation, where the color histogram over the image, corrected for lighting conditions, is used to identify the object it depicts.
Generally, object identification relies on a great deal of imprecise ad hoc knowledge about general image characteristics. For example, as a convention in most images the horizon is at about the same position – comparison of object locations to the horizon allows their actual size to be estimated. Much of the same physics of light, reflectance, geometry, and texture that comes into play in computer graphics is useful here in separating the object description from the environment description such as lighting and viewing angle. The rough shapes of objects can be identified with edge detection and invariant transforms to account for changes in viewing angle. Finally, cultural conventions such as the use of certain symbols in diagrams or the existence of perpendicular angles indoors are frequently exploited. In some cases, rather than writing rules for object recognition by hand, a large training set of images labelled with tags is used for statistical learning of image features (as with eigenfaces).
To facilitate efficient lookup, CIBR systems do not perform processing on each image for every query. Instead, they generally map an image to a feature vector, an array of values that succinctly describes the semantic objects in the image and their relationship, whether explicitly or implicitly. These are stored in a database. Lookup can then be done by comparing the feature vector of the query image to the feature vectors of the images in the database, a type of query the database supports efficiently.
A big challenge in CIBR systems has been evaluation: given two systems purporting to solve the same search problem, which one does a better job? In traditional search engines this problem is attacked with the notions of precision and recall, but these depend on a reliable notion of whether or not an image is relevant, and do not take into account the notion of iterative refinement of results. In 2000, it was typical to rely on derived measurements such as the time to complete tasks with access to the tool.
This discussion is but a summary of an already-dense survey, so if you’re interested in CBIR, I recommend consulting the survey for details. After more than a decade, there are increasing signs of wider deployment of CBIR systems both in limited domains and to the general public – don’t be surprised if you end up using one soon.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.