Papers in Computer Science

Discussion of computer science publications

Efficient Online Index Construction for Text Databases

Posted by dcoetzee on April 27, 2009

Citation: Lester, N., Moffat, A., and Zobel, J. Efficient online index construction for text databases. ACM Trans. Database Syst. 33, 3 (Aug. 2008), 1-33. (PDF)

Abstract: Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches, in which one or more passes are made over a static set of input data, and, at the completion of the process, an index is available for querying. However, there are search environments in which even a small delay in timeliness cannot be tolerated, and the index must always be queryable and up to date. Here we describe and analyze a geometric partitioning mechanism for online index construction that provides a range of tradeoffs between costs, and can be adapted to different balances of insertion and querying operations. Detailed experimental results are provided that show the extent of these tradeoffs, and that these new methods can yield substantial savings in online indexing costs.

Discussion: This 2008 information retrieval paper discusses how to efficiently incrementally update a search index as new documents are added. This allows new documents to be indexed and available for search immediately. Total index construction time is an order of magnitude faster than older, more trivial incremental update mechanisms, at a moderate expense in query time, about 12% on average.

As a motivating application, consider desktop search: in this scenario, you have a set of documents on your hard disk, and you would like to rapidly search them for keywords at any time. To facilitate this, your operating system keeps an on-disk data structure called an inverted index that specifies, for each possible search keyword (or term), a list of all files that contain it. Using modern index compression techniques, such an index can be stored in about 25% the space of the documents that it is indexing.

Constructing an index for a fixed set of documents is a relatively simple procedure; the following procedure comes from Heinz and Zobel’s “Efficient single-pass index construction for text databases.” With each document we associate a document ID number. We keep an in-memory data structure associating each term with a sorted list of document IDs of documents containing that keyword. We go through documents and expand this data structure until we fill the indexer’s buffer, and then write it out to disk and start again with an empty data structure. At the very end, we go through the complete list of search terms and for each one we perform a disk-based multiway merge of all sorted lists associated with that term.

The organization on disk of a typical index is also simple: it’s typically stored in two parts, the lexicon (or vocabulary) and the lists. The lexicon simply maps each term to the location on disk of its associated list; this can be done for example with a standard data structure like a B-tree. The lexicon is usually the smaller part. The lists part stores the compressed lists of documents associated with each term, all packed tightly together.

This procedure is efficient, but isn’t very flexible: it only works for a set of fixed, unchanging documents. If we add, remove, or modify documents, we have to rebuild the index all over again. In current practice, the way this is usually dealt with is by doing search queries against the old index while the new index is being built. Once the new index is done, we switch over to it all at once. The problem with this is that it introduces a delay period during which queries will not detect recent changes. In a desktop search scenario, it can be very annoying if the search doesn’t turn up that document you were editing five minutes ago, or that e-mail you just got, because it hasn’t been indexed yet.

This is where online index construction comes in. Online index construction attempts to update an existing index to incorporate recent changes, so that the index is up-to-date at all times. All these schemes share a general framework: we keep an in-memory data structure of the most recent changes, and during each query we check both this data structure and the on-disk index. Once the in-memory index gets too big, it has to get merged into the on-disk index. Updating the lexicon is relatively simple if it is stored in a B-tree: we just add, update, or remove keys from the B-tree as necessary.

More problematic is updating the lists. There are a number of simple techniques for doing this, none of which are very efficient in general:

  • In the in-place update method, we find the list for each term being merged and insert the new document IDs into that list. If the list becomes too large and no longer fits in its place in the lists file, it has to get moved to a new location. Updating the index is slow because it requires random access; and the resulting index becomes fragmented, slowing down future queries and updates. Even if we do a good job of overprovisioning by leaving space for each list to grow into, it does not overcome the detrimental effects of this method.
  • Related to the in-place method is the method of Tomasic et al (“Incremental updates of inverted lists for text document retrieval“, 1994), which instead of moving lists when they become too large, simply allows lists to occupy multiple discontiguous segments that get chained together in a linked list. This improves update performance at the expense of query performance, which as the index grows will have to follow more and more links to read a complete list, each necessitating a random disk access.
  • The remerge update scheme avoids the random access and fragmentation issues of in-place update by simply reading through the entire index and writing out the updated index as it goes. The problem with this approach is that as the on-disk index becomes large compared to the in-memory index, it has to process more and more lists that are not being updated, which is wasteful.

The insight of this paper is this: the cost of merging two indexes is roughly proportional to the sum of their sizes; consequently, it takes less time to merge new updates into a smaller index than into a larger index. We can take advantage of this by creating a hierarchy of indexes on disk, with each one having a maximum capacity r times bigger than the previous one. We always merge the in-memory index into the smallest one. Whenever an index at one level fills up, it is merged into the index at the next level. Each merge is done using the remerge update technique described above, and queries are run against all indexes independently. The parameter r represents a critical tradeoff between index update time and query performance – in experiments, queries were 70% faster for r = 12 than for r = 2,  but total index construction time was 3.5 times faster for r = 2 than for r = 12. The authors call this technique geometric partitioning.

Is that it? Well, that’s the most important part. Additionally, relying on B-trees to efficiently manage the vocabulary wasn’t a safe bet, and they need to be structured with geometric partitioning as well, as described in section 5.4. Nevertheless, this is a strikingly simple technique that has the potential to greatly expand the scope of indexing applications.

As with most recent results, there also appears to be considerable room for improvement and exploration – for example, could a superior disk-based data structure achieve simultaneous good query performance and index construction time? Could these ideas be used to define a new general disk-based lookup technique superior to B-trees? Is there an analogous method that relies on the in-place update instead of the remerge update as a subroutine? How would a system like this function in a distributed environment? It’s interesting to me to note that many of the same problems faced in in-place update are also encountered in the context of in-memory dynamic arrays, where some sophisticated techniques allow constant-time update with less wasted space (e.g. Brodnik et al’s “Resizable Arrays in Optimal Time and Space”). It will be interesting to see in which direction the study of online index construction goes.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

One Response to “Efficient Online Index Construction for Text Databases”

  1. Ah. Figures I wouldn’t be the first person to invent this.

    Note that this is applicable to any problem where you want to perform updates on a key value store, and the updates are associative (aka a semigroup). So you can do fast wordcount this way on massive documents. Or your updates might be updates to individual fields of a row of a table. Or your updates might add and remove items from a sorted list, allowing documents to be deleted from as well as inserted into an inverted index.

    The nice thing is that you don’t have to retrieve a key in order to update it. You defer the update to the most convenient time.

    http://www.logarithmic.net/pfh/blog/01244327900

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>