Papers in Computer Science

Discussion of computer science publications

Archive for April, 2009

Efficient Online Index Construction for Text Databases

Posted by dcoetzee on April 27, 2009

Citation: Lester, N., Moffat, A., and Zobel, J. Efficient online index construction for text databases. ACM Trans. Database Syst. 33, 3 (Aug. 2008), 1-33. (PDF)

Abstract: Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches, in which one or more passes are made over a static set of input data, and, at the completion of the process, an index is available for querying. However, there are search environments in which even a small delay in timeliness cannot be tolerated, and the index must always be queryable and up to date. Here we describe and analyze a geometric partitioning mechanism for online index construction that provides a range of tradeoffs between costs, and can be adapted to different balances of insertion and querying operations. Detailed experimental results are provided that show the extent of these tradeoffs, and that these new methods can yield substantial savings in online indexing costs.

Discussion: This 2008 information retrieval paper discusses how to efficiently incrementally update a search index as new documents are added. This allows new documents to be indexed and available for search immediately. Total index construction time is an order of magnitude faster than older, more trivial incremental update mechanisms, at a moderate expense in query time, about 12% on average.

As a motivating application, consider desktop search: in this scenario, you have a set of documents on your hard disk, and you would like to rapidly search them for keywords at any time. To facilitate this, your operating system keeps an on-disk data structure called an inverted index that specifies, for each possible search keyword (or term), a list of all files that contain it. Using modern index compression techniques, such an index can be stored in about 25% the space of the documents that it is indexing.

Constructing an index for a fixed set of documents is a relatively simple procedure; the following procedure comes from Heinz and Zobel’s “Efficient single-pass index construction for text databases.” With each document we associate a document ID number. We keep an in-memory data structure associating each term with a sorted list of document IDs of documents containing that keyword. We go through documents and expand this data structure until we fill the indexer’s buffer, and then write it out to disk and start again with an empty data structure. At the very end, we go through the complete list of search terms and for each one we perform a disk-based multiway merge of all sorted lists associated with that term.

The organization on disk of a typical index is also simple: it’s typically stored in two parts, the lexicon (or vocabulary) and the lists. The lexicon simply maps each term to the location on disk of its associated list; this can be done for example with a standard data structure like a B-tree. The lexicon is usually the smaller part. The lists part stores the compressed lists of documents associated with each term, all packed tightly together.

This procedure is efficient, but isn’t very flexible: it only works for a set of fixed, unchanging documents. If we add, remove, or modify documents, we have to rebuild the index all over again. In current practice, the way this is usually dealt with is by doing search queries against the old index while the new index is being built. Once the new index is done, we switch over to it all at once. The problem with this is that it introduces a delay period during which queries will not detect recent changes. In a desktop search scenario, it can be very annoying if the search doesn’t turn up that document you were editing five minutes ago, or that e-mail you just got, because it hasn’t been indexed yet.

This is where online index construction comes in. Online index construction attempts to update an existing index to incorporate recent changes, so that the index is up-to-date at all times. All these schemes share a general framework: we keep an in-memory data structure of the most recent changes, and during each query we check both this data structure and the on-disk index. Once the in-memory index gets too big, it has to get merged into the on-disk index. Updating the lexicon is relatively simple if it is stored in a B-tree: we just add, update, or remove keys from the B-tree as necessary.

More problematic is updating the lists. There are a number of simple techniques for doing this, none of which are very efficient in general:

  • In the in-place update method, we find the list for each term being merged and insert the new document IDs into that list. If the list becomes too large and no longer fits in its place in the lists file, it has to get moved to a new location. Updating the index is slow because it requires random access; and the resulting index becomes fragmented, slowing down future queries and updates. Even if we do a good job of overprovisioning by leaving space for each list to grow into, it does not overcome the detrimental effects of this method.
  • Related to the in-place method is the method of Tomasic et al (“Incremental updates of inverted lists for text document retrieval“, 1994), which instead of moving lists when they become too large, simply allows lists to occupy multiple discontiguous segments that get chained together in a linked list. This improves update performance at the expense of query performance, which as the index grows will have to follow more and more links to read a complete list, each necessitating a random disk access.
  • The remerge update scheme avoids the random access and fragmentation issues of in-place update by simply reading through the entire index and writing out the updated index as it goes. The problem with this approach is that as the on-disk index becomes large compared to the in-memory index, it has to process more and more lists that are not being updated, which is wasteful.

The insight of this paper is this: the cost of merging two indexes is roughly proportional to the sum of their sizes; consequently, it takes less time to merge new updates into a smaller index than into a larger index. We can take advantage of this by creating a hierarchy of indexes on disk, with each one having a maximum capacity r times bigger than the previous one. We always merge the in-memory index into the smallest one. Whenever an index at one level fills up, it is merged into the index at the next level. Each merge is done using the remerge update technique described above, and queries are run against all indexes independently. The parameter r represents a critical tradeoff between index update time and query performance – in experiments, queries were 70% faster for r = 12 than for r = 2,  but total index construction time was 3.5 times faster for r = 2 than for r = 12. The authors call this technique geometric partitioning.

Is that it? Well, that’s the most important part. Additionally, relying on B-trees to efficiently manage the vocabulary wasn’t a safe bet, and they need to be structured with geometric partitioning as well, as described in section 5.4. Nevertheless, this is a strikingly simple technique that has the potential to greatly expand the scope of indexing applications.

As with most recent results, there also appears to be considerable room for improvement and exploration – for example, could a superior disk-based data structure achieve simultaneous good query performance and index construction time? Could these ideas be used to define a new general disk-based lookup technique superior to B-trees? Is there an analogous method that relies on the in-place update instead of the remerge update as a subroutine? How would a system like this function in a distributed environment? It’s interesting to me to note that many of the same problems faced in in-place update are also encountered in the context of in-memory dynamic arrays, where some sophisticated techniques allow constant-time update with less wasted space (e.g. Brodnik et al’s “Resizable Arrays in Optimal Time and Space”). It will be interesting to see in which direction the study of online index construction goes.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Information retrieval | 1 Comment »

Virtual Time

Posted by dcoetzee on April 23, 2009

Citation: David R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7, 3 (Jul. 1985), 404-425. (PDF)

Abstract: Virtual time is a new paradigm for organizing and synchronizing distributed systems which can be applied to such problems as distributed discrete event simulation and distributed database concurrency control. Virtual time provides a flexible abstraction of real time in much the same way that virtual memory provides an abstraction of real memory. It is implemented using the Time Warp mechanism, a synchronization protocol distinguished by its reliance on lookahead-rollback, and by its implementation of rollback via antimessages.

Discussion: This 1985 paper introduced virtual time, a concept that allows a distributed system to be organized around a linear global clock; rather than maintain a synchronized clock, it achieves efficiency by having each node maintain its own local virtual time and performing rollback when a node receives a message “in the past.” Although not widely adopted, it has served as an influential model of a general system with optimistic concurrency and rollback.

To motivate the concept, let’s for a moment consider a magical distributed system with the following properties:

  • The system is a set of nodes, each capable of doing any amount of local processing instantaneously.
  • Each node can choose at any time to send a message to any other node. Moreover, it can specify precisely when that message will be received – the only constraint is that it cannot be received in the past.

Such a system is capable of intuitively describing many different distributed systems. For example, you could have a distributed simulation (such as a physical simulation) in which each message tells a node to simulate a particular kind of event, and the times at which the messages arrive correspond to the times at which the events they simulate occur. Consequently all events are simulated in order. You could ensure that the transactions of a database system are committed in order by assigning a time to them indicating when they’re each supposed to atomically occur. Finally, most simply, you could create a system in which all messages are received in order, by merely setting the received times to be the same as the sending time. (These scenarios are based on section 5 from the paper.)

In real life, of course, we don’t have instantaneous processing, and we can’t control when messages will be received. The concept behind this paper is to get rid of “real time”, and replace it by a new sort of time called virtual time (by analogy with virtual memory) that we have more control over. We can control the rate at which virtual time advances, and in fact time it can advance at different rates at different nodes in the system: each node keeps a local clock. To control the time at which messages are received, we introduce a queue at each node that queues up all received messages and does not process them until their virtual time arrives.

Not having to agree on a global virtual time greatly decreases the communication cost, but the problem with this flexibility is that once the local clocks have gotten out of sync, some strange scenarios can arise: if node A is lagging behind node B, it may send a message to B that arrives “in the past” from B’s perspective. But B may have already taken action based on the fact that it didn’t receive that message, including sending new messages to other nodes.

There are a number of solutions to this problem. The most obvious one is to have each node wait until its clock is the furthest in the past of all nodes before proceeding. This corresponds to pessimistic concurrency, and is the equivalent of having a global lock in a threaded program, with all the same consequences of low concurrency and poor throughput. The one taken by Jefferson is a form of optimistic concurrency: each local clock simply jumps to the received time of the next waiting event, so that the node is never idle, maximizing throughput. If it ever receives an new message “in the past,” it does a rollback to a point in time before that message was received and then proceeds forward again. To facilitate this it takes periodic snapshots of its state, tagged with virtual times. This is all transparent to the node’s program; from its perspective, virtual time never decreases. This is unlike, say, transactions, where a failed transaction needs to be detected and retried.

The big remaining problem is that during rollback there are certain things that can’t be undone locally, most importantly the sending of messages to other nodes. There are, again, multiple ways of dealing with this, but the one used by Jefferson is that whenever a message send is rolled back, an “antimessage” indicating this is sent to the same node with the same received time. At the destination, if the corresponding message is still queued, they collide and eliminate one another. If the corresponding message has already been processed, the antimessage arrives in the past and causes that node in turn to rollback and not receive the original message. One requirement for implementing this is that all nodes keep a queue of messages already processed.

A nontrivial problem is to show that the system as a whole makes progress amidst all these cascading rollbacks. Another serious problem is the performance issue of how to prevent the nodes from running out of memory from storing unbounded queues of messages and state snapshots. To deal with these issues, Jefferson defines the concept of global virtual time (GVT), which is the earliest virtual time at which anything in the system is currently happening. Because no node can send a message into its own past, no message can be received before the GVT, and no node will ever have to rollback to a time before the GVT. Consequently, any state snapshot or already-processed message marked with a virtual time prior to GVT can be discarded, in a process Jefferson calls “fossil collection” (this is oddly prophetic, as the term “garbage collection” was not coined until 1994 by Kaushik Ghosh). Additionally, it can be shown that, even though individual local clocks may roll back, GVT never decreases. As long as no message is lost and no node blocks indefinitely, it will eventually increase, which is enough to guarantee that the system makes progress. Eventually, all message queues will be empty, and the system will terminate.

The main downside to a virtual time system is that it fundamentally relies on the assumption of what the author calls “temporal locality” – that messages won’t arrive in the past too often. Less obviously, it also depends on the efficiency of the implementation of state snapshotting, message queuing, and determining the global virtual time. Because virtual time is such a general framework, applying to distributed systems from multiprocessors to databases to LANs, this empirical analysis depends very much on the specific implementation. Evaluations of a highly-optimized practical implementation on multiprocessors were done in a series of papers by Richard M. Fujimoto (“Time warp on a shared memory multiprocessor” (1989), “The effect of memory capacity on Time Warp performance” (1993), “An adaptive memory management protocol for Time Warp parallel simulation” (1994), “An Empirical Evaluation of Performance-Memory Trade-Offs in Time Warp” (1997)). I’m not aware of any thorough evaluations in other scenarios.

The concept of virtual time was later generalized by Friedemann Mattern in his “Virtual Time and Global States of Distributed Systems,” which instead of using a totally ordered linear virtual time relies on a partially ordered virtual time (Parallel and Distributed Algorithms, 1989, 215-226, PDF). The motivation is to eliminate unnecessary ordering constraints imposed by the total order, at the cost of some conceptual complexity. Jefferson briefly hinted at this possibility (“[v]irtual times may be only partially ordered”).

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Distributed systems | Leave a Comment »

Content-Based Image Retrieval at the End of the Early Years

Posted by dcoetzee on April 21, 2009

Citation: Smeulders, A.W.M, Worring, M, Santini, S, Gupta, A, Jain, R. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 22, no 12, pp.1349-1380. Dec 2000. (PDF)

Abstract: The paper presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Discussion: This extensive survey reviews and classifies content-based image retrieval systems from 200 early publications and discusses fundamental problems in the area.

If you go to Google Image Search today and type in search keywords, some searches work very well – searches for which the images are clearly labelled in the source webpage with those keywords, such as “Homer Simpson.” But when the person who published the content didn’t think to mark it with the specific term you’re looking for – such as “man looking at a girl” or “paintings of people wearing capes” – it falls apart rapidly. Stock photography sites like Getty Images try to compensate for this by paying people to tag images with every keyword they can possibly think of – needless to say, this is an expensive proposition, and one that still fails to satisfy many useful queries.

We can approach something close to an ideal image search using people: we give them a list of images and explain our query, and they exhaustively examine every image to determine if it satisfies the query. A person does not need to rely on surrounding text; they can look at the image itself, identify objects and their relationships, and incorporate cultural context and application-specific background to resolve ambiguities and identify actions and states. These are the sorts of tasks people are particularly good at.

Content-based image retrieval (CBIR) attempts to respond to image queries as a human would, subject to limitations of feasibility and performance. It relies on image content, using techniques from computer vision to interpret and understand it, while using techniques from information retrieval and databases to rapidly locate images suiting a specific query. Since its inception it has exploded into a field in its own right, with hundreds to thousands of papers each making its own tradeoff between the many variables involved.

A natural-language query to an image search engine rapidly runs into issues of intractability due to our very limited progress on the problem of natural language processing and the need for vast common-sense knowledge about the world to process many queries. For example, consider the rules you might use to determine if an image depicts an “indoor scene” or a “frightened person.” The authors of this paper label this problem the semantic gap. Rather than tackle these problems, most CBIR systems rely on alternate user interfaces, such as search by assocation, search by example, and search by sketching, all of which involve using images to search for other related images. Often this search process is iterative: at each stage, the user clicks an image “more like” their target image, refining the set of candidates. You can see a demonstration of a preliminary version of such an interface at the new Google Labs Similar Images.

There are a number of important variables which strongly influence the design of a CBIR system. These include:

  • The scope of the domain: How are the images constrained? How are the queries constrained? If the images can be taken under controlled conditions, and queries are easy to predict, the problem often admits a highly accurate domain-specific solution. At the other end of the spectrum is general image search of all online images.
  • The type of search activity: Will the user typically be looking for a specific image? Or just browsing? Or do they want to classify or categorize an image collection?
  • What is the user interface?
  • Performance constraints: how quickly must queries be answered?

Since the semantic gap is considered insurmountable, the most fundamental limitation in practical CBIR systems is what the authors of this paper call the sensory gap: “the gap between the object in the world and the information in a (computational) description derived from a recording of that scene.” In other words, it’s figuring out what you’re looking at based only on an image. In narrow domains with images taken under fixed conditions, this can be attainable even using simple schemes. In a general domain like web image search, this is still considered an insurmountable problem.

One of the main reasons the sensory gap is insurmountable in general domains is the issue of segmentation: separating the image into regions, each corresponding to a particular object. Doing this perfectly is called strong segmentation; this is infeasible or impossible in a general setting, particularly in the presence of occlusion (one object in front of another one). An alternative – weak segmentation – only identifies part of the region corresponding to an object, and hopes it is enough to identify the object. A third approach, accumulation, does not rely on segmentation at all: it computes a function across the entire image, which is designed to be insensitive to variations in the part of the image not corresponding to the object. Color indexing, which I described in a previous post, is an example of accumulation, where the color histogram over the image, corrected for lighting conditions, is used to identify the object it depicts.

Generally, object identification relies on a great deal of imprecise ad hoc knowledge about general image characteristics. For example, as a convention in most images the horizon is at about the same position – comparison of object locations to the horizon allows their actual size to be estimated. Much of the same physics of light, reflectance, geometry, and texture that comes into play in computer graphics is useful here in separating the object description from the environment description such as lighting and viewing angle. The rough shapes of objects can be identified with edge detection and invariant transforms to account for changes in viewing angle. Finally, cultural conventions such as the use of certain symbols in diagrams or the existence of perpendicular angles indoors are frequently exploited. In some cases, rather than writing rules for object recognition by hand, a large training set of images labelled with tags is used for statistical learning of image features (as with eigenfaces).

To facilitate efficient lookup, CIBR systems do not perform processing on each image for every query. Instead, they generally map an image to a feature vector, an array of values that succinctly describes the semantic objects in the image and their relationship, whether explicitly or implicitly. These are stored in a database. Lookup can then be done by comparing the feature vector of the query image to the feature vectors of the images in the database, a type of query the database supports efficiently.

A big challenge in CIBR systems has been evaluation: given two systems purporting to solve the same search problem, which one does a better job? In traditional search engines this problem is attacked with the notions of precision and recall, but these depend on a reliable notion of whether or not an image is relevant, and do not take into account the notion of iterative refinement of results. In 2000, it was typical to rely on derived measurements such as the time to complete tasks with access to the tool.

This discussion is but a summary of an already-dense survey, so if you’re interested in CBIR, I recommend consulting the survey for details. After more than a decade, there are increasing signs of wider deployment of CBIR systems both in limited domains and to the general public – don’t be surprised if you end up using one soon.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence, Information theory and compression | 9 Comments »

A Mathematical Theory of Communication

Posted by dcoetzee on April 12, 2009

Citation: Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, Vol. 27, pp. 379-423, 623-656, July, October, 1948. (PDF)

Abstract: The recent development of various methods of modulation such as PCM and PPM which exchange bandwidth for signal-to-noise ratio has intensified the interest in a general theory of communication. A basis for such a theory is contained in the important papers of Nyquist and Hartley on this subject. In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of the information.

Discussion: This seminal 1948 paper by Claude Shannon first introduced many of the central ideas and terminology of the then-nascent field of information theory, including entropy, channel capacity, the noisy-channel coding theorem, Shannon-Fano coding, language modelling, and Shannon-Fano-Elias coding, an early version of modern arithmetic coding.

A communication system is a system that sends data from a source to a destination over a channel. In the 1940s, communication systems such as radio, telephony, and telegraphy systems had been in use for many decades, and were effective, but had little theoretical basis. Simple questions remained unanswered, such as “how fast can I send a particular type of information over a particular channel?”, “what is the best way to represent the data while sending it over the channel?”, and “how do I measure how much information is represented by a piece of data?” This single sweeping paper answered these questions, both for discrete channels (designed to send a sequence of discrete symbols such as bits) and for continuous channels (designed to send continuously-varying signals such as audio or video signals).

The first point emphasized by this paper is how information should be measured: there are two approaches to this. One is to represent the amount of information by the number of possible values that the datum could represent; for example, a 32-bit register can store 4,294,967,296 distinct values. Although it is possible to describe information in these terms, it becomes rapidly unwieldly, and does not correspond to intuition (for example, one would expect two CDs to store twice as much information as one CD, not the square). Shannon chose to use a logarithmic measure, and to represent it in terms of “binary digits or bits“, a word popularized by this paper. This does lead to some counterintuitive results: for example, a piece of information taking on 10 distinct values stores about 3.3 bits of information, and it seems strange to talk of fractional bits.

With this measure, the capacity of a discrete channel, which transmits a sequence of discrete symbols such as bits, is defined as log N(T)/T as T approaches infinity. Here, N(T) is the number of possible messages one can transmit in time T. The source itself is approximated by a Markov process, which generates a sequence of symbols while using a fixed finite amount of state to assign probabilities to which symbol will be generated next.

The Markov process model is surprisingly general: many different models of an information source can be constructed as Markov models, ranging from the trivial one that just treats each character independently and assigns the symbols uniform probability, to one that examines several previous words in an English text to determine what the next word is most likely to be. Designing a Markov process that approximates the behavior of a real information source, like an English text document or an image, is a large part of the problem of language modelling, a practice with diverse applications. In this paper, armed with nothing but a book of random numbers, Shannon emulates four different simple language models for English text and shows subjectively their increasing similarity to actual English text.

Shannon posits a measurement of information or uncertainty that satisfies some basic axioms called entropy, denoted by H, a term borrowed from statistical mechanics. This is defined for a discrete source with n symbols as:

-sum(pi log pi) as i goes from 1 to n

For example, if you have a single bit that is either zero or one with probability 1/2, the total information conveyed by it is:

-[(1/2) log(1/2) + (1/2) log(1/2)] = 1 bit.

If on the other hand the bit is zero with probability 7/8, then it conveys much less information:

-[(7/8) log(7/8) + (1/8) log(1/8)] = about 0.164 bits.

One way of looking at this is that a sequence of k bits each independently produced with these probabilities can (with high probability) be compressed down to a sequence of slightly more than 0.164k bits. As k becomes larger, this bound can be approached more closely.

Shannon defines a channel’s capacity as the number of bits it can transfer per second – for a noise-free discrete channel where each symbol is the same length, this is trivial and is just a constant. Shannon’s noiseless channel coding theorem asserts that if a channel has capacity C and a source has entropy H, we can transmit only up to C/H symbols per second. He gives an explicit construction of a code capable of doing such a transmission, later dubbed Shannon-Fano-Elias coding, which is a near-optimal code very similar to the modern arithmetic coding that forms the basis of nearly all modern data compression methods. Conversely, to show that this rate cannot be exceeded, he shows that essentially any operation that can be done on a sequence of symbols in finite memory can only lower, not increase, its entropy; hence one cannot recover more any more information from the output of the channel than it already has, which is determined by the capacity.

When using a noisy channel, where the channel can modify symbols in transit, the situation is different. He presents an abstract model in which an “observer” machine watches both the input and output of the channel, and then sends “correction” data to the receiver, with the aid of which the receiver can perfectly reconstruct the input. The minimum capacity of the channel required for this correction data to be transmitted is called the equivocation, and denotes the uncertainty in the received data due to noise. The capacity of a noisy channel is defined as:

C = Max(entropy of input – equivocation), with maximum taken over all input sources.

By the use of forward error correction, or redundancy in the transmitted information, Shannon showed the surprising result that with this definition, one can transmit information at a rate of up to C/H symbols per second, with an arbitrarily low probability of errors. This cannot be exceeded; if one eliminates redundancy in an attempt to increase the transmission rate, the increase in errors will eliminate any added benefit. Shannon’s proof demonstrates this constructively by presenting a large class of theoretical error-correcting codes that achieve the desired bound in the average case, meaning most of them are near-optimal. On the other hand, none of these codes are practical for real use. He gives an isolated example, a Hamming code also designed in the 1940s, to demonstrate a more practical error correction code, but a more complete family of practical error correction codes, such as the Reed-Solomon error correction popular today, would have to wait for later research.

Shannon’s model was very general and admitted possibilities that today would be considered unnecessary for most practical scenarios. For example, he allows different symbols to require different lengths of time to transmit, and allows noise to affect different symbols differently. On the other hand, he also makes some important restrictions on the statistical behavior of the source that are often overlooked, the most notable of which being that it must be ergodic: this means that the long-term behavior of any one infinite character sequence must have the same statistical properties as the average over all possible sequences.

It’s easy to give an example of a set of sequences for which this is not the case: suppose we have a set of sequences where every other character is a, and the remaining characters are b, but we can start with either one. Since half of these sequences will have a’s in the odd positions, and half of them have a’s in the even positions, the character a occurs with equal probability in every position over all possible sequences. Yet in any particular infinite sequence, it occurs in half the positions with probability 1 and the other half with probability 0. Ergodic theory is a rich and complex mathematical theory only lightly touched upon in this work, but ergodic sequences form one of the fundamental assumptions of Shannon’s work.

The second and more mathematically sophisticated section of this paper, often overlooked today by computer scientists, deals with transmission of continuously-varying signals such as audio or video signals. In the 1940s this was of greater relevance, since most communication systems dealt with this type of signal; today, many of them are being replaced by digital systems. However, even these discrete signals must eventually be translated into continuous signals to be sent over wires, whether electrical or optic fiber, and the ability to analyze the action of noise on these signals directly rather than only at the higher level of discrete noise allows for a more precise analysis.

Shannon’s treatment of continuous signals and the resulting theorems are largely the same as the discrete case, but in the limit as time units become small. Entropy is now an integral instead of a sum, and is maximized when the distribution is Gaussian (producing a white Gaussian noise signal). Entropy is more complex to work with in this setting, because it depends on the choice of coordinate system. He shows closed-form results for a few cases where the average power of the input signal is bounded, and the case where the peak or maximum power is bounded, both with a simple additive white Gaussian noise (AWGN) model.

Additionally, the concept of “lossless” transmission becomes meaningless in this setting: a continuous signal carries an infinite amount of information, only a finite amount of which can be recovered in the presence of any noise in the channel. Instead of discussing the probability with which the signal can be reconstructed perfectly, he describes a function v that places a limit on how close the input and output functions are, and then defines capacity subject to this limitation.

Sometimes “A Mathematical Theory of Communication” is credited for single-handedly founding information theory. Although it was influential, this is an overstatement – it was essentially a generalization of the 1928 foundational work of Harry Nyquist (“Certain factors affecting telegraph speed”), which described the Nyquist rate and allowed Shannon to represent continuous signals in an essentially discrete way, and Ralph Hartley (“Transmission of Information”), which was the first paper to attempt to measure the information that could be put through a channel (although its conclusions were comparatively conservative). It also adapted the terminology and techniques of statistical thermodynamics as described by Boltzmann and Gibbs in its description of information entropy. Shannon’s inspiration also came from new developments in signal coding techniques, such as pulse-code modulation, which was used by SIGSALY to digitize speech signals for encryption during World War II. The potential for differences in coding techniques to yield improvements in transmission rate raised questions about the fundamental limitations of transmission, creating an environment conducive to such a work.

Conversely, this paper was not the last foundational work of information theory. It did not begin to explore applications to compressed, reliable data storage (which can be seen as sending data over a channel “through time”), nor algorithmic information theory, which examines the computational limitations of information, in particular Kolmogorov complexity. The primitive language models it mentions would be replaced by sophisticated ones based on a variety of statistical techniques such as smoothing, and a huge variety of lossless and lossy codes would be described not only for English text but for quantized signals such as digital images and digital audio, forming the basis of modern standardized formats like ZIP, JPEG, and MP3 files.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Information theory and compression, Networking | 2 Comments »

IP = PSPACE

Posted by dcoetzee on April 4, 2009

Citation:  Adi Shamir. IP = PSPACE. Journal of the ACM 39, 4 (Oct. 1992), pp.869-877. (PDF)

Abstract: In this paper, it is proven that when both randomization and interaction are allowed, the proofs that can be verified in polynomial time are exactly those proofs that can be generated in polynomial space.

Discussion: PSPACE is the set of all problems solvable by a Turing machine in polynomial space (i.e., using polynomially many bits of memory). This 1992 paper by Adi Shamir provides an alternate characterization of PSPACE in terms of interactive proof systems.

These two graphs are isomorphic

These two graphs are isomorphic

To begin with we’re going to describe a problem called the graph isomorphism problem. A graph is a set of objects where some pairs of objects are linked or associated. For example, our object set might be {1,2,3} and the pairs {1,2} and {1,3} are associated (but not {2,3}). Graphs are frequently drawn as dots representing objects connected by lines representing the associations, as shown to the right. The objects are referred to as vertices and the associations between them as edges.

The second graph shown to the right looks just the same as the first graph, and is structurally identical, but the names of the vertices {1,2,3} have been shuffled around. Two graphs that have this relationship are called isomorphic. The graph isomorphism problem asks whether, given two graphs, those graphs are isomorphic.

The simplest and most common way to explain the complexity class NP is in terms of verification: the class NP is the set of problems such that, given an instance of the problem and a candidate solution, you can verify in polynomial time whether or not that solution is correct. The graph isomorphism problem is in NP because in order to verify that two graphs are isomorphic, all you need to know is how the names were changed around. For example, in the example to the right all you need to know is that 1 was changed to 2, 2 was changed to 1, and 3 remained 3.

One way of looking at this definition is as a communication between two parties. Say you have two computers hooked up over a network, a magical supercomputer that can solve any problem instantly (the prover), and an ordinary computer that can only run polynomial-time algorithms (the verifier). The prover is powerful, but the verifier is skeptical: it will not believe a solution exists unless the prover gives it a correct solution that it can check on its own. This system obeys two important properties:

  1. If a solution exists, the prover can give the verifier a proof of this that the verifier can verify in polynomial time.
  2. If no solution exists, then regardless of what the prover does, the verifier will not be convinced.

Notice that this is asymmetric: if two graphs are isomorphic, the prover can prove it to the verifier by telling it how to rearrange the names of the objects; but it two graphs are not isomorphic, there’s no known way for the prover to prove this to the verifier. This is called the graph nonisomorphism problem and is not known to be in NP. But what if we could generalize this system to one that can prove both of these? There are two key extensions we must make:

  1. We allow the verifier and prover to communicate back and forth as much as they want before the verifier makes its final decision. Each exchange of messages is called a round.
  2. We allow the verifier to use a source of random bits.

This generalized system is called an interactive proof system. How does this let us prove that two graphs are not isomorphic? It works like this:

  1. Suppose the input graphs are G and H. First the verifier generates m random bits. For each random bit, it generates a new graph: if the bit was zero, it generates it by randomly changing around the vertices of G, and if the bit was one, then it generates it by randomly changing around the vertices of H. It sends all these graphs to the prover.
  2. The prover looks at each graph and determines whether it’s isomorphic to G or to H, and sends its answers back to the verifier.
  3. The verifier checks the prover’s answers to see if they’re correct. If so, it is convinced that the graphs are not isomorphic.

The key here is that if G and H are isomorphic, then all the graphs are isomorphic to both of them, and there’s no way the prover can reliably distinguish between them. The prover might still guess the right answer on every graph – but without any clues, its chances of doing this are 1 in 2m, which are pretty long odds. If the graphs are not isomorphic, then with its infinite power the prover can easily get all the answers correct. Notice here that we have to change the definition of acceptance a bit: before it was not possible for the verifier to accept incorrectly, whereas now it is just very unlikely. This protocol is discussed in Goldreich et al’s “Proofs that Yield Nothing But Their Validity” (1991).

The ability of this sytem to solve problems not known to be in NP is pretty interesting, but the natural question to ask is, can it solve other problems too, perhaps even harder problems? How far can it go? The set of all problems that can be solved by an interactive proof system is called IP, and (as you might have guessed from the paper title) it is able to solve precisely the same set of problems that be solved by an ordinary machine in polynomial space.

The easier direction is to show that PSPACE contains IP: that is, anything that can be shown by an interacting prover and verifier can be shown in polynomial space. Although the prover is infinitely powerful, it is deterministic, and it can only produce polynomial-sized messages (or else the verifier couldn’t read it all). Hence, for each round, it can be regarded as a fixed function from the set of polynomial-length strings to the polynomial-length strings. In polynomial space with no time restriction, we can brute-force iterate over all possible such prover functions and calculate the probability that each one will lead to acceptance (by iterating over all choices of random bits). We then pick the optimal prover that leads to acceptance most often over all choices of random bits; if it leads to acceptance more than 2/3 of the time, we accept, otherwise we reject.

The more difficult direction is showing that IP contains PSPACE: that is, anything that can be computed in polynomial space, can be computed by an interactive proof system. To do this, we choose a particular PSPACE-complete problem and demonstrate an interactive proof system protocol for it; if it can solve this problem, it can solve all problems in PSPACE. The problem we choose is the true quantified boolean formula (TQBF) problem. This problem asks, given a formula over boolean (true/false) variables using AND, OR, NOT, and the quantifiers for-all and there-exists, is the formula true?

The key insight lies in an idea called arithmetization. The basics of this idea will be familiar to C programmers: we change the boolean variables into integer variables, letting zero correspond to false and nonzero correspond to true. The operation x AND y becomes x times y, and the operation x OR y becomes x plus y.  Because the formula is given in a canonical form where only variables are negated directly and not expressions, NOT x can be done as (1-x). Finally, quantifiers: we take the subexpression covered by the quantifier, assign zero to the variable being quantified over, and evaluate it. Then we assign one to the variable being quantified over and evaluate it again. Finally we take the sum (for there-exists) or product (for for-all) of these two values. The result is an arithmetic expression that is nonzero if and only if the quantified boolean formula is true. Moreover, its value (and the values of all subexpressions) is O(2^(2^n)), so it can be stored in polynomial space and manipulated in polynomial time.

For the prover to prove that an arithmetized formula is nonzero, first it chooses a prime p and all computation will be done mod p. It can be trusted to do this because a bad choice of p may make the result zero instead of nonzero, but not the other way around. Next, it tells the verifier the value mod p of the formula, called a. But the verifier won’t believe this by itself. To prove it, the prover first removes the outermost quantifier, leaving a polynomial in the one free variable. It reduces this polynomial to its simplest form; again, we rely on a canonical form of the QBF to ensure that the degree of this polynomial will not be too large. This polynomial q and the value a of the formula are handed to the verifier, who can easily verify that q(0)q(1) or q(0) + q(1) (depending on the type of quantifier) has the desired value a. Next, the verifier chooses a random value z < p for the free variable, substitutes it, computes the value q(z) of the remaining expression, and sends z to the prover. The prover removes the next quantifier and produces a new polynomial, and this repeats until all the quantifiers are gone and the verifier can verify the remainder by itself. Because the prover can’t anticipate the random values the verifier will choose for the variables along the way, it has very little chance of guessing the right nonzero value for the entire formula up front.

IP = PSPACE was a theoretically shocking result because it was one of the first major results in complexity theory that does not relativize. In simple terms, to say that a result relativizes means that it can be proven using a restricted set of proof techniques that includes most classical methods; in 1992, it was a well-known result that these classical techniques could not show that IP = PSPACE. This was presumed to be evidence that in fact IP ≠ PSPACE. In reality, all it meant was that new techniques had to be constructed that escaped the bounds of relativization. In the wake of this result Lance Fortnow wrote a paper “The Role of Relativization in Complexity Theory” revisiting relativization and its benefits and limitations (Bulletin of the European Association for Theoretical Computer Science, vol. 52, pp. 52-229, 1994.  Citeseer).

It was shown long ago that relativizing techniques cannot prove whether P = NP. At the time of IP = PSPACE, there was hope that perhaps arithmetization, as a fundamentally nonrelativizing technique, might be able to resolve this longstanding open question. Unfortunately, in 2008 it was shown that a broader class of techniques called algebrizing proofs, which include the technique of arithmeticization, cannot solve P vs NP either (S. Aaronson and A. Wigderson. Algebrization: A New Barrier in Complexity Theory, in Proceedings of ACM STOC’2008, pp. 731-740).

IP = PSPACE was just the first step on a long road of surprising and valuable results involving interactive proof systems. A related result proved soon thereafter was MIP = NEXPTIME: that is, if there are two provers, each communicating with the verifier but not each other, the set of problems that can be solved is the set of all problems whose proofs can be verified in exponential time, a very large class indeed. The next goal was to try to “scale down” these techniques to prove results about NP. This culminated in the famous PCP theorem, which in its best form states that:

PCP(log n, 3) = NP

What this means is the following: suppose that rather than a prover, the verifier is given a polynomial-length proof string with random access to any bit location. The verifier gets O(log n) random bits, and is only allowed to access 3 bits of the proof of its choosing (yes, just 3). This surprising theorem states that for any problem in NP, given a problem instance, if there is a solution then there is a proof string that will cause the verifier to accept with high probability; and if there is no solution then the verifier will reject with high probability regardless of the proof string. The PCP theorem forms the heart of the theory of inapproximability, in which it is used to show that for a variety of NP-hard optimization problems, there is no way to get a solution “close to” the optimal solution unless P = NP. It also leads to a philosophical revision of the idea of what a “proof” is: the idea that a proof can be made so robust that it can be verified with high probability by examining it in only a few locations is a departure from the traditional proofs that require careful exhaustive verification for correctness.

One interesting way of reframing the result IP = PSPACE raised by Madhu Sudan (ref) is in terms of games. Finding an optimal next move in generalized n by n versions of many two-player games has been proven PSPACE-complete, such as Tic-Tac-Toe and Reversi (see Game complexity). The result IP=PSPACE effectively asserts that for every such game, there is an “equally difficult” game where one of the players is replaced by a polynomial time algorithm and dice rolls, which can easily be implemented on a computer. It would be interesting to see how this concept could be exploited in game design.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Computability and complexity | Tagged: , , , , | 2 Comments »

The Byzantine Generals Problem

Posted by dcoetzee on April 1, 2009

Citation: Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst. 4, 3 (Jul. 1982), 382-401. (PDF)

Abstract:Reliable computer systems must handling malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

Discussion: This 1982 paper introduced the Byzantine Generals Problem, an abstract story problem that launched the field of Byzantine fault tolerance. A system is said to be Byzantine fault tolerant if it is resilient to unpredictable and even malicious behavior of faulty or compromised component systems.

For example, in a distributed computing system, a single computer may be taken over by a hacker, who can then attempt to trick the entire system into doing their bidding. In high security or safety-critical systems, such as spacecraft, weapon systems, or voting machines, there are often multiple independent systems built by multiple vendors with no interaction; they cannot rule out the possibility that the employees of one of these vendors furtively included code to attempt to circumvent the system. And this doesn’t just apply to malicious behavior: many software bugs will cause a system to behave incorrectly without necessarily crashing. Obviously if all the component systems are compromised or fail, you’re just plain out of luck – but this is much more unlikely.  The main question this paper answers is: how many compromised components do you need to compromise the integrity of the system as a whole?

To make this problem more abstract and memorable, the authors phrase it in terms of a story problem involving Byzantine generals (the Byzantine Empire or Eastern Roman Empire was the continuation of the Roman Empire in the Middle Ages). In this story, the Byzantine army surrounds an enemy city. Each general leads a single division, and the generals must come to a common decision whether to attack or retreat. They can communicate only by messenger (forging messages or killing messengers is not allowed). However, it’s well known that some of the generals are traitors. To deal with this, the generals decide to vote on whether to attack or retreat.

Let’s say we have three generals, named Armatus, Belisarius, and Comentiolus. Belisarius tells the others that he votes to attack, and Comentiolus tells the others that he votes to retreat. Now, Armatus, the traitor, tells Belisarius that he votes to attack, while telling Comentiolus that he votes to retreat. The result is that Belisarius ends up attacking while Comentiolus retreats, with disastrous results.

To protect against this possibility, the generals decide they need to be more suspicious of each other. Not only does each general send his vote to the other generals, but he also sends copies of the votes that he received to each of the other generals. Now it would seem that Armatus’s scheme is foiled, because Belisarius and Comentiolus will send each other messages and see that he sent them conflicting messages and is a traitor.

However, Armatus quickly devises a new scheme. He sends a message to Comentiolus saying that Belisarius voted to retreat. Since Comentiolus received an “attack” vote from Belisarius, he concludes that Belisarius is a traitor, disregards his message that Armatus told him to attack, and retreats, with the same disastrous results.

In fact, as long as they are limited to sending simple messages, there is essentially no way of resolving this dilemma. If there are three generals and at least one traitor, the traitor can manipulate the outcome of the decision. One way of resolving this is to rely on unforgeable signatures. In this case, each general signs their votes; the generals still send all the votes they receive to each other. Now Armatus cannot tell Comentiolus that Belisarius voted to retreat, because he does not have a signed “retreat” vote from Belisarius.

A different way of resolving the dilemma is to add a fourth (loyal) general, Droctulf. Now, regardless of what Armatus tells Comentiolus about Belisarius’s vote, Droctulf will tell Comentiolus the truth – that Belisarius voted to attack. Since two people said that he voted to attack (Belisarius himself and Doctulf) and only one said he voted to retreat (Armatus), Comentiolus now believes that Belisarius voted to attack.

More generally, the main results of this paper are:

  1. Simple messages: If there are m traitors, there must be at least 3m + 1 generals in order to achieve a reliable result matching the outcome of the vote.
  2. Signed messages: Any number of traitors can be handled; the generals will always agree on a plan matching the outcome of the vote.

The odd thing about this paper is that the formal results are not new; they were first shown (in greater detail in fact) in the authors’ 1980 paper “Reaching Agreement in the Presence of Faults” (J. ACM 27, 2 (Apr. 1980), 228-234, PDF). This paper does not mention the Byzantine Empire and is only 7 pages instead of 20. “The Byzantine Generals Problem” adds some new elements, such as dealing with missing communication paths, but these have received little attention. So what was this paper for? In short, it’s just good teaching – a colorful metaphor makes the problem and its solution more memorable, and reaches a larger audience, popularizing the technique and terminology.

Another interesting historical note: it will be obvious to anyone who reads this paper that “signed messages” can be implemented with public-key cryptography; is this merely a metaphor? Well, no – the first practical public-key cryptography system, RSA, was only invented in 1984 (see my post “The RSA paper: A Method for Obtaining Digital Signatures and Public-Key Cryptosystems“). If anything, it’s the reverse: this application may have helped to motivate it. In their original 1980 paper they relied on a probabilistic mechanism for authentication called “authenticators” which were subsequently abandoned, because they were only effective against random errors and not malicious adversaries.

The authors of this paper acknowledge that the number of messages that need to be exchanged to implement the full Byzantine General Problems solution is large (quadratic), which often makes it quite expensive in practice. Although the ideas were occasionally used over the years, e.g. in a clock synchronization scheme in 1991 by N. Shankar, performance remained a significant barrier to wider use. In 2000 M. Castro did a PhD thesis on “Practical Byzantine Fault Tolerance” which implemented a number of important practical optimizations for getting Byzantine fault tolerance working efficiently on a real network server, a replicated NFS server with only about 3% empirical overhead for its reliability features. Since then there has been increasing effort on adding practical Byzantine fault tolerance to a variety of real systems, such as a domain name system (2001, “A Scalable Byzantine Fault Tolerant Secure Domain Name System“), a block storage system (2007, “Low-overhead byzantine fault-tolerant storage”), a database replication system (2008, Byzantium), and a coordination service (2008, DepSpace). With much of the research in this area still new, it will be interesting to see how useful Byzantine fault tolerance is in practical commercial systems in the long run.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Cryptography, Operating systems | Tagged: , , | 3 Comments »