Papers in Computer Science

Discussion of computer science publications

Archive for March, 2009

Color indexing

Posted by dcoetzee on March 26, 2009

Citation: Swain, M. J. and Ballard, D. H. Color indexing. International Journal of Computer Vision 7, 1 (Nov. 1991), 11-32. (PDF)

Abstract: Computer vision is embracing a new research focus in which the aim is to develop vision skills for robots that allow them to interact with a dynamic, realistic environment. To achieve this aim, new kinds of vision algorithms need to be developed which run in real time and subserve the robot’s goals. Two fundamental goals are [identifying an object at a known location and] determining the location of a known object. Color can be successfully used for both tasks.

This article demonstrates that color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models. It shows that color histograms are stable object representations in the presence of occlusion and over change in view, and that they can differentiate among a large number of objects. For solving the identification problem, it introduces a technique called Histogram Intersection, which matches model and image histograms and a fast incremental version of Histogram Intersection, which allows real-time indexing into a large database of stored models. For solving the location problem it introduces an algorithm called Histogram Backprojection, which performs this task efficiently in crowded scenes.

Discussion: This 1991 computer vision paper introduced the concept of identifying and finding objects in images using their color histograms.

In 1991, as computer vision systems were moving away from offline processing of static photographs and into real-time use by mobile robots with inexpensive cameras attached, there was a pressing need for efficient algorithms for doing simple vision tasks like identifying, finding, and tracking objects. Until the publication of this paper, most of these techniques were based on the most obvious attribute, shape recognition, but this was both computationally expensive and fragile: the slightest rotation or occlusion of an object (stuff going in front of it) could radically alter its perceived shape. Color-based recognition is more robust: the colors of an object don’t change very much as it moves, rotates, or becomes occluded by other objects. It can work with extremely low-resolution images (in one experiment, the authors got acceptable performance on 8 × 5 pixel images!). On the other hand, it’s very sensitive to the color and intensity of lighting, and the image has to be normalized to account for this. It also, obviously, has difficulty distinguishing objects that have similar colors.

The details of the system are simple: divide the space of all colors up into fairly large “buckets” – typical would be 16 buckets along each color axis (e.g. red, green, blue). For example, the RGB color (170, 24, 255) might get bucketed in the bucket (10, 1, 15). Then, for each object you want to recognize or find, take a model photo of it and count the number of pixels falling into each bucket; this is called the color histogram. To compare two color histograms, they use the Histogram Intersection operator: this takes the minimum of the two counts (from each image) for each bucket, and then adds them up. This value is normalized by dividing it by the number of pixels in the source image. Images with very similar histograms will have intersection values close to 1, whereas ones with different histograms will have much smaller values. By comparing an image to be identified against each image in a database of models, each with a precomputed color histogram, this can rapidly locate the desired image.

Locating objects is very similar: each pixel is assigned a value based on its bucket and how common that bucket is in the model image of the object being sought. Then it looks for a region containing many large values. This is facilitated by using a convolution with a circle – effectively “blurring” the image and mixing nearby values – followed by location of the pixel with the largest value.

The approach of the paper is actually more general than it sounds: virtually any image feature that you can construct a histogram of, can be applied to the same tasks using the same approach. This includes features such as local geometry, local texture, rough estimated size, and so on. The sensitivity of the algorithm to any particular feature can be tuned by adjusting the number of bucket divisions along that dimension. For example, Niblack et al’s QBIC system (1993), still in use today by IBM in DB2, uses color, texture, and shape simultaneously. In 1999 a combination of features using this approach was applied effectively to content-based image retrieval in Tao and Grosky’s “Spatial Color Indexing: A Novel Approach for Content-Based Image Retrieval” (at Citeseer).

Unsurprisingly, the paper demonstrated that the performance of color indexing is severely degraded by changes in lighting; this is one reason that the work did not appear until 1991, after work on color correction and normalization had appeared that can be used in a preprocessing step to effectively cope with these issues. This original paper only examined differences in brightness, and performed a trivial normalization of brightness values to demonstrate its advantage. Dealing with illumination invariant color matching – particularly when the light changed in color – would be the subject of several later publications such as Matas et al’s “Color-Based Object Recognition under Spectrally Nonuniform Illumination” and “On Representation and Matching of Multi-Coloured Objects” (1995).

The color indexing paper shows its age when it comes to scale: concerned only with robotics applications, they considered 66 objects to be a “large database.” Today, object identification, classification, and location is studied primarily in the context of image retrieval and image filtering, where there are frequently millions of images to consider with thousands of objects and with no control over image conditions. The “incremental” histogram intersection optimization presented in this paper – essentially, only looking at a few of the buckets with the largest counts – enables it to scale to moderate-sized databases, but not anything as large as modern applications require. Since then, more scalable approaches to color indexing databases have been developed, such as Albuz et al’s “Scalable Color Image Indexing and Retrieval Using Vector Wavelets” (2001) and hierarchical clustering based approaches such as those of Abdel-Mottaleb et al (“Performance Evaluation of Clustering Algorithms for Scalable Image Retrieval”, 1998).

The paper employs combinatorial logic, and a bit of guesswork, to argue that the number of distinct color profiles is sufficiently large to allow a very large number of potential objects to be distinguished. In Stricker and Swain’s “The Capacity of Color Histogram Indexing” (1994, Citeseer) an interesting connection between color indexing and coding theory provides a bound on the actual number of distinguishable images in a color indexing database, which they call its capacity.

However, objects are not necessarily well-distributed among these in practice; in the original experiments the test objects were relatively easy to identify and distinguish and have clear color histograms, such as household items with illustrated packaging. Humans regularly distinguish objects with nearly identical color histograms, such as different types of trees or different brands of cars of the same color; and also regularly identify objects with very different color histograms as being the same, such as a person wearing two different sets of clothes. Color indexing has not, as far as I am aware, been extended to such difficult identification problems.

A peculiar property of color indexing is that, unlike many other object identification and location methods, it has no clear analogy in human visual processing – indeed, the paper cites work by cognitive psychology researcher Anne Treisman showing that humans have demonstrated poor performance at locating objects based on their colors. I’m not aware of any new psychology research investigating the role of color histograms in object location and attention in humans.

Keith Price maintains a bibliography of papers related to recognition by color indexing.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | 2 Comments »

Reducibility Among Combinatorial Problems

Posted by dcoetzee on March 22, 2009

Citation: Richard M. Karp, Raymond E. Miller, James W. Thatcher. Reducibility Among Combinatorial Problems. The Journal of Symbolic Logic, Vol. 40, No. 4 (Dec., 1975), pp.618-619. (PDF)

Abstract: A large class of computational problems involve the determination of properties of graphs, digraphs, integers, arrays of integers, finite families of finite sets, boolean formulas and elements of other countable domains. Through simple encodings from such domains into the set of words over a finite alphabet these problems can be converted into language recognition problems, and we can inquire into their computational complexity. It is reasonable to consider such a problem satisfactorily solved when an algorithm for its solution is found which terminates within a number of steps bounded by a polynomial in the length of the input. We show that a large number of classic unsolved problems of covering, matching, packing, routing, assignment and sequencing are equivalent, in the sense that either each of them possesses a polynomial-bounded algorithm or none of them does.

Discussion: This 1975 complexity theory paper showed that a diverse set of 21 classic intractable problems are NP-complete, highlighting the importance of NP-completeness and providing the first strong evidence that P ≠ NP.

To briefly review, a problem is said to be solvable in polynomial time if there is an algorithm that can produce a solution in a number of steps that is bounded by a polynomial of the input length in bits. The set of all problems solvable in polynomial time is called P. A problem is said to be verifiable in polynomial time if, given a problem instance and a candidate solution, there is a polynomial-time algorithm to determine whether or not the candidate is a correct solution. The set of all problems verifiable in polynomial time is called NP. The NP-complete problems are a particularly difficult-to-solve subset of problems in NP having the property that if any of them can be solved in polynomial time, then all problems in NP can be solved in polynomial time.

To understand the importance of this paper it is essential to ignore much of what we know about NP-completeness today and examine the context at the time: Cook, who introduced the concept of NP-completeness, had shown that NP-complete problems exist; this by itself was surprising, in that by solving a single problem efficiently, we could solve an entire class of problems efficiently, an attractive proposition. He also demonstrated a “natural” example of an NP-complete problem: the boolean satisfiability problem. Classical progress on this problem had been such that it was not unreasonable to suppose that a polynomial-time algorithm for this problem might yet be found if we just worked hard on the problem (see my prior post on the DPLL algorithm).

Karp’s paper dashed any hope of such a simple resolution. All of the 21 problems shown to be NP-complete in this paper were well-known, well-studied problems that had seen a barrage of algorithm development from researchers for many years; many of them hoped to find a polynomial-time solution to the problem, but failed to do so. Given the two possibilities, that either all of them had polynomial-time solutions, or none of them do, it seems more natural to conclude that the latter is the case; this is equivalent to the statement that P ≠ NP, which is now widely believed to be true, although it has not been proven (see P vs. NP problem).

Simultaneously, Karp’s paper also opened the floodgates for thousands of more problems that would be shown NP-complete in the following years; because problems are shown to be NP-complete by reducing known NP-complete problems to them, the more NP-complete problems are known, the easier it becomes. This paper critically aided in bootstrapping that process.

Moving on from impact to practicum, the methods of this paper were straightforward, even for the time; this was very much a case where what was proven mattered more than how. The algorithms, relevant complexity classes, and notions of completeness are defined in the conventional manner in terms of Turing machines, with remarks that the classes are robust against reasonable changes in the computing model and input encodings, and they review Cook’s results. They list some classical problems with nontrivial polynomial time solutions, such as 2SAT, minimum spanning tree, and shortest path.

Finally, they show their main result. To show a problem NP-complete, it suffices to show that is in NP (verifiable in polynomial time) and that an existing NP-complete problem can be reduced to it, in the sense that there is a polynomial-time algorithm that can take any instance of the existing NP-complete and translate it to an equivalent instance of the new problem with the same yes/no solution. Karp defines the 21 problems and includes a diagram (pg.96) showing which problems are reduced from which other problems in the proof. Compared to a typical modern homework assignment, Karp’s proof is very sparse: he omits as trivial showing that the problems are in NP, and the reductions are specified in only two or three lines each defining a reduction function; that the functions specify a correct reduction and are polynomial-time computable is considered self-evident. He also shows that a number of problems from automata theory are NP-hard; in fact, these problems turned out to be PSPACE-complete in later work.

Finally, he lists three problems whose NP-completeness was unknown at the time, and today they are considered very unlikely to be NP-complete: linear programming, primality testing, and graph isomorphism. In 1979, the ellipsoid method provided a polynomial-time solution for linear programming. In 2002, the AKS primality test solved primality testing in polynomial time; the related factoring problem is not believed to be NP-complete either, since that would imply NP=coNP. The graph isomorphism problem does not yet have a polynomial time solution, but in 1988 Schöning showed that if it is NP-complete, the polynomial time hierarchy collapses to the second level, which is also considered unlikely.

Returning to impact, Karp’s paper may have also had the unfortunate side effect of stymying research in some areas: researchers were no longer motivated to seek polynomial-time solutions to NP-complete problems, because under the assumption that P ≠ NP, such a search is fruitless. Although investing in research that is more likely to succeed is a good plan for the individual researcher, if it turns out that P = NP, the intuition to the contrary generated by Karp’s paper may delay and impede the eventual proof of this fact.

In some cases this change in research direction extended to not working on algorithms for these problems at all. But Karp’s 21 problems are still important practical problems that need efficient solutions, if not in general then at least in practice. A familiar result from computability theory is that the halting problem is unsolvable, but for most programs that arise in practice it’s trivial to tell whether they eventually terminate or not. Similarly, many NP-complete problems have effective practical solutions: the boolean satisfiability problem is “easy on average” and modern heuristic SAT solvers can handle typical inputs with millions of clauses; problems like the knapsack problem have a polynomial time approximation scheme that allows solutions very close to the optimum solution to be found quickly; and problems like the travelling salesman problem have been attacked with great effectiveness by virtually every new optimization method, as well as new computing paradigms such as DNA computing. These approaches gain credibility from the NP-completeness of their test problems, because they have a theoretical basis with which to fend off claims that they should instead have sought a solution with good worst-case behavior. Considering slow progress on the P vs. NP problem itself, practical approaches such as this are likely to dominate the study of NP-complete problems for the foreseeable future.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Computability and complexity | 3 Comments »

Optimization by simulated annealing

Posted by dcoetzee on March 19, 2009

Citation: S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi. Optimization by Simulated Annealing. Science, May 1983, Vol. 220, no. 4598, pp. 671-680. (PDF).

Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

Discussion: This 1983 paper introduced the heuristic optimization technique of simulated annealing, inspired by physical simulation algorithms in statistical mechanics, and applied it to problems of hardware design and the traveling salesman problem.

Annealing is a centuries-old technique in metallurgy, once practiced by blacksmiths and now widely used in industry. It aims to regularize the crystal microstructure of the material, removing defects, which softens it and makes it easier to work. The Gibbs free energy of the material, a measure of the potential energy contained in its structure, is effectively minimized by this procedure.

To perform annealing, a metal is heated to its annealing temperature, slightly above its freezing point, and kept at that temperature for a while. Then it is allowed to slowly cool until it recrystallizes. The system can be seen as a massively parallel solver for minimizing the potential energy: at high temperature, molecules travel rapidly around the substance in search of a global minimum, but will fail to converge on a local minimum. As the substance cools, they can no longer travel as far and they begin to seek and converge on the nearest local energy minimum.

From a more algorithmic point of view, this can be modelled as a random walk on a search graph. Each node represents a solution or state and has a certain fitness value, and the nodes adjacent to each node represent other similar solutions (in the sense that their fitness value is not expected to be dramatically different). At each step the algorithm selects an adjacent node at random. It will transition to the new state if it is a better solution; it may also transition to the new state if it a worse solution, with a probability determined by the global “temperature” parameter. This behavior allows the search to escape local minimums. As the temperature drops, this becomes less likely and the search drops into a nearby local minimum.

A simple example of this is the travelling salesman problem: given a set of cities and distances between each pair of cities, find a minimum-length tour that passes through each city once and returns to the starting point. Finding the minimum-length tour is an NP-complete problem and so not likely to admit an efficient solution any time soon, but simulated annealing can effectively find a solution close to the minimum. In this case a node in the search graph represents a particular tour of the cities, and the neighbors of a node are found by reversing two adjacent cities on the path. Solution quality is determined by total tour length.

Prior to simulated annealing, the most prominent heuristic methods were based on random-restart hill climbing, where we choose a number of random starting states and then make small changes to the solution as long as those small changes improve the fitness. From the physical perspective, this is the equivalent of taking, say, a set of 100 swords, heating them each to the annealing point, and then rapidly quenching them in water and picking the one that turns out softest. It’s ineffective at locating global minimums in large problems with many parameters.

The specific parameters used in simulated annealing, for example determining the probability with which a transition to a higher-energy state should be made, were borrowed from statistical mechanics and in particular the Metropolis–Hastings algorithm for simulation of microscopic systems.

Most of the applications demonstrated in the original simulated annealing paper were not simple classical ones like TSP but rather practical hardware design problems, largely because the authors had background in this area. This proceeds in stages: first the circuits are partitioned into groups small enough to fit on chips, then the locations of the chips are selected (placement), then routing of wires between the chips is done. The main parameter being minimized is the number of signals crossing between each group of circuits.

Unfortunately, simulated annealing is also a notoriously difficult technique to analyze and requires expertise to set up – its performance depends on how the probabilities are calculated, the rate at which the temperature is lowered, the allowed transitions between states, and so on, each of which generally must be empirically determined for a specific problem.

It’s also more applicable to some types of problems than others: for example, it tends to do well for problems for which many near-optimal solutions exist and are a relatively small number of steps away from the starting random solution. Physical substances tend to have this property: it’s extremely unlikely that they would enter a perfect lowest energy configuration, but with high probability annealing will bring them to a configuration close to the lowest possible energy.

Simulated annealing also tends to perform poorly for problems where the fitness function has many “deep wells,” very low local minimums that are difficult to escape from even at high temperature. Substances like glass exhibit this property in real life: even when cooled slowly, they will tend to fail to arrange themselves in a regular crystal structure, preferring the amorphous structure of glass.

Since the original work a number of improvements have been proposed to the method. Adaptive simulated annealing (Ingber, 1989) attempts to avoid the design problem of choosing parameters like the cooling schedule by automatically choosing these as the algorithm progresses, based on how the search is behaving. The methods of quantum annealing (Apolloni et al, 1989) and stochastic tunneling (Wenzel and Hamacher, 1999) seek to escape deep wells by occasionally “jumping” across nearby energy barriers. Other stochastic optimization techniques, such as genetic algorithms and ant colony optimization, generate new solutions in ways other than local modification.

An interesting question to ask as we enter the age of parallel computing is: if we reconsider the original motivating physical problem for simulated annealing, a group of particles independently seeking minima and responding to the global temperature, could a parallel method be deduced from this whereby a large number of small circuits or processors each simulate the behavior of a single component as the global temperature decreases? I haven’t yet seen any work in this area.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Algorithms and optimization | Tagged: , , , | 1 Comment »

A Machine Program for Theorem-Proving

Posted by dcoetzee on March 14, 2009

Citation: Davis, M., Logemann, G., and Loveland, D. A machine program for theorem-proving. Communications of the ACM 5, 7 (Jul. 1962), 394-397. (PDF)

Abstract: The programming of a proof procedure is discussed in connection with trial runs and possible improvements.

Discussion: This short 1962 paper introduced the Davis-Logemann-Loveland (DLL) or Davis-Putnam-Logemann-Loveland (DPLL) algorithm, the basic algorithm for solving the boolean satisfiability (SAT) problem; it remains the foundation of all the most efficient SAT solvers used today. The paper shows it age with obsolete terminology and special attention to implementation details, making it difficult to read in some places, but the algorithm itself is straightforward to describe in modern terms.

In the boolean satisfiability problem, we are given a formula made up of three operations, AND, OR, and NOT, and a set of boolean variables, each of which may take on either the value true or false. The goal is to determine whether there is a particular assignment of true-false values to the variables that will make the formula as a whole true. For example, given this formula:

(x1 OR NOT(x2)) AND (NOT(x1) OR x2),

a satisfying assignment would be x1=true,x2=true. If a formula has no satisfying assignments, it is said to be unsatisfiable; the name of this paper comes from the useful fact that many simple theorems can be proven to be true by showing that a particular boolean formula is unsatisfiable.

The most naive algorithm for solving the problem is exhaustive search, trying every possible combination of truth values; but if there are n variables, there are 2n of these.

Another simple but more effective method is backtracking search: choose a value for one variable, plug it in the formula, and then simplify the formula. There are many ways to simplify the formula, but one obvious way is using boolean identities such as NOT true = false, NOT false = true, (x AND false) = false, and (x OR true) = true. Then choose a value for another variable and continue in this manner until the formula reduces to either true or false. If it reduces to true, a satisfying assignment has been found; otherwise, we backtrack to the last variable choice (that we have not already tried both values for) and switch it to the other value. A conceptual binary tree is constructed, with a variable at each node, and each edge representing an assignment of a value to a variable; this tree is called the search tree, and the search effectively performs an in-order traversal of it. The leaves are either true or false. If the formula is unsatisfiable, eventually it will walk the entire tree without encountering a true leaf.

The trouble with backtracking search is that if the variables are chosen in a poor order, the formula won’t reduce very much, the search tree will be large, and the runtime will still be similar to exhaustive search. To deal with this, the early Davis-Putnam (DP) method took a different approach: they observed that any boolean formula could be reduced to a canonical form called conjunctive normal form (CNF). In this form, the formula is written as an

  • A literal is defined as either a variable or its negative (NOT(x)).
  • A clause is defined as an OR (disjunction) over one or more literals.
  • The formula is written as an AND (conjunction) of clauses.

The example formula above was in this form. This form is convenient because it’s expressive enough to express many interesting constraints, but also simple enough to build more efficient algorithms for. The original techniques used by the authors to convert formulas to conjunctive normal form actually do not scale in some cases, but later it would be shown that there is a way to efficiently reduce any boolean formula to a formula in conjunctive normal form that isn’t too much larger.

Once the formula is in this form, we can improve backtracking search with an important heuristic called unit propagation: the idea is to search the formula for clauses containing only a single literal (either x or NOT(x)); these are called unit clauses. Because this literal must be true to satisfy the clause, and every clause must be true to satisfy the formula, the value of that variable is determined, so we can assign that variable a single value and simplify the formula. Unit clauses rarely appear in the original formula, but often pop up as we begin to assign some of the variables and simplify it. The effect compared to ordinary backtracking search is that the search tree has some nodes with only one child, reducing its average degree and the overall search time. This is essentially the DPLL algorithm in its simplest form.

The DPLL technique was intended as an improvement over the classical Davis-Putnam (DP) algorithm from 1960. The DP algorithm was also based on conjunctive normal form, but instead of performing a backtracking search, it eliminated a variable x by separating the clauses into three groups: the clauses A containing x (without negation), the clauses B containing NOT(x), and the clauses R containing neither. Then, the original formula is unsatisfiable if and only if (A OR B) AND R is unsatisfiable. Unfortunately, once this is reduced back to conjunctive normal form, it may become much longer; if this is done several times, the formula can become rapidly very long and unwieldly. DPLL provided an early example of a space-time tradeoff: instead of checking (A OR B) AND R for unsatisfiability, it checks first A AND R, then B AND R; both of these are already in conjunctive normal form, so there’s no formula size blowup. This was even more important on computers of the day with limited storage; computational complexity was not yet born then, so much of their discussion of this matter is informal.

Another technique introduced in DPLL that has since fallen out of favor is pure literal elimination: this says that if a variable is either never negated or always negated, it can always be assigned the value that makes all its literals true. This similarly reduces average search time, but only in cases where this occurs, which are rare in practice.

This simple scheme is amenable to many improvements: if there are no unit clauses, we still have great freedom in choosing which variables to reduce in which order; we can visit the nodes of the search tree in different orders than a simple in-order search, accelerating discovery of a satisfying assignment; we can perform more sophisticated simplification of the formula at each step, shrinking the tree size. All of these can dramatically affect overall running time, and together modern improvements like this have allowed SAT solvers to scale to formulas with millions of clauses. But because the SAT problem is NP-complete, there is no known general algorithm that is efficient on all instances; even the best methods will encounter instances on which they make poor variable choices and end up spending immense amounts of computing time searching a large search tree.

BPLL schemes were effective at the time because 1960s computers had a single CPU and were memory-starved. Many modern speedup efforts central around parallelization of the traversal of the search tree. An example of this is massively parallel SAT solvers based on FPGAs, which replicate a simple circuit many times to traverse part of the search space in parallel with many other simple solvers. As we move out of the age of increasing clock speed and into the age of parallel systems, solvers that can exploit parallelism effectively will become more important.

An interesting question to ask is: could there be an efficient variant of a scheme like the original BP that expands the formula at each step, but requires no backtracking? On modern systems with much greater memory and external storage capacities, this might be a win on certain instances. I haven’t seen any work along these lines.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Algorithms and optimization, Formal verification | Tagged: , , , , | 1 Comment »

A Case for Redundant Arrays of Inexpensive Disks (RAID)

Posted by dcoetzee on March 12, 2009

Citation: Patterson, D. A., Gibson, G., and Katz, R. H. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD international Conference on Management of Data (Chicago, Illinois, United States, June 01 – 03, 1988). H. Boral and P. Larson, Eds. SIGMOD ‘88. ACM, New York, NY, 109-116.

Abstract: Increasing performance of CPUs and memories will be squandered if not matched by a similar performance increase in I/O. While the capacity of Single Large Expensive Disks (SLED) has grown rapidly, the performance improvement of SLED has been modest. Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED, promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability. This paper introduces five levels of RAIDs, giving their relative cost/performance, and compares RAID to an IBM 3380 and a Fujitsu Super Eagle.

Discussion: This 1988 paper popularized RAID, a method for constructing one logical disk with superior performance out of many physical disks that is now widespread in commercial applications, and invented the RAID “level” terminology describing a series of successive design improvements.

In 1988, as today, I/O was the bottleneck in most practical applications; the performance of a disk, as measured in metrics such as access time and throughput, is inherently constrained by its mechanical nature. Manufacturing limitations limit the speed at which the platters can spin and the speed at which the mechanical head can be robotically moved between tracks. Despite increases in RAM capacity and advances in caching technologies, I/O only dominates an even larger proportion of a typical application’s runtime today, and at the cost of data loss in the event of system failure; this anticipated event was then called the “I/O Crisis.” Although in the modern age we are moving towards solid-state technologies for large-capacity storage, for now they remain an order of magnitude more expensive per gigabyte than disks, and so are limited to targeting special markets like laptops, rather than servers.

Early efforts to increase disk performance revolved around purely mechanical improvements: more revolutions per minute (RPM), faster head movement, more platters, denser magnetic material. These improved performance but only modestly; performance of the best disks remained within a factor of 2 of performance of cheap disks. RAID takes a different approach by observing that by having multiple disks accessing different data at the same time, mechanical delays can be shortened or avoided.

To give a very simple example, suppose you have a large read-only database used by a number of users. If two users try to access two different parts of it at once, the mechanical head will have to move back and forth between the two parts, dramatically affecting access time and overall throughput for both. But if you write copies of the database onto two independent disks, both users can read their data concurrently with full throughput without either disk having to move its head. This solution is also reliable, because if either disk fails, all requests can continue to be serviced from the remaining disk with reduced performance (a RAID in this state is today said to be running in degraded mode). Many modern systems even allow faulty disks to be replaced on the fly without an interruption in service (hot swapping), and their contents reconstructed from the other disks. They’ll also frequently feature “standby spares,” which are disks that remain available but unused until a disk in the array fails; this makes the potential window for two simultaneous failures very small.

Of course, in real life databases are not usually read-only, and we need a RAID that we can write data to as well. In the example described above, the simplest way to do this is to write to both disks at the same time. On average, this will be somewhat slower than writing to a single disk, because the write will not be complete until both disks have written it, and one may start out farther from the area being written than the other. This is one of the simplest RAID systems, referred to in the paper as “First Level RAID” or “Mirrored Disks” and today as “RAID 1.”

A different approach is to try to maximize performance and disk capacity for a single user: instead of putting the same data on both disks, put half the data on one disk and half on the other. The data is split up at the block level, so that except for very small files, half of each file will reside on each disk. Then, the combined logical disk has the capacity of both disks, and can be read of written with high throughput by having both disks access blocks of the desired file simultaneously. The more disks are added, the more quickly large files can be accessed, provided adequate transfer infrastructure is available. This allows an array of inexpensive disks with slow mechanical parts to match and even exceed the performance of expensive disks at lower overall cost.

This scheme, which has since become known as “RAID 0″ or “striped disks”, is not even seriously considered by the authors of the RAID paper because of its devestating downside: a RAID 0 array of n disks has a mean time to failure that is n times faster than a single disk. If one of the disks is lost, half of every file is lost, so nothing is recoverable. For this reason this type of RAID is only used in scenarios where performance is paramount and none of the data is important.

The next level of RAID, “Second Level RAID”, is rarely used today but illustrates the progression of ideas. A Hamming code is an error-correcting code invented for data transmission that can detect and correct a single bit error in a group of bits. For example, a (7,4) Hamming code represents each group of 4 bits using 7 bits, and if any one bit is flipped, it can detect both that the flip occurred and which bit was flipped, so that it can be repaired. Codes like this are used in reliable ECC RAM to detect and correct random bit flips caused by radiation and manufacturing defects. A RAID built around this concept would have 7 disks, 4 disks containing data and 3 containing check bits, and whenever a bit is corrupted on one disk it could detect and correct this problem, assuming that the disk is still writable in that location.

However, this requires a lot of extra disks for functionality that is effectively unnecessary: by 1988, disks were already programmed to store checksums for each block. Corrupt blocks are automatically detected and transparently remapped to other locations. Indeed, RAIDs have no need to detect problems, as that’s the disk’s job; they only need to be able to reconstruct the contents of a known disk in case of catastrophic disk failure.

This line of thinking gave rise to “Third Level RAID” (now known as “RAID 3″) where there is only one check disk, and in each location it contains the parity bit computed from XORing the corresponding bits on the other disks. If any one of these disks is lost (including the check disk), the contents of the remaining disks can be XORed to reconstruct its contents. As long as there are more than two disks, this provides better space utilization than RAID 1, since only one disk is “wasted” on redundant information. Providing the cost of computing the parity bits is negligible and the number of disks is large, performance is similar to RAID 0, since only one extra disk needs to be written to, and it does not participate in reads. The RAID lasts as long as two drives do not fail simultaneously, which is unlikely to occur in normal conditions.

RAID 3 has two main disadvantages: the first is that it is still grounded in the “one transfer at a time” motivation of RAID 0, and cannot service multiple transfers simultaneously. Fourth level RAID (RAID 4) partly addresses this by allowing small reads or writes that don’t need to access more than one block to just access the one disk they need. Writes need to update the check disk as well, but it’s not necessary to read all disks to compute parity; the new parity values can be computed from the old parity values and comparison of the old and new values of the disk block being written.

The second disadvantage of both RAID 3 and 4, and the main reason they’re rarely used in practice, is because write access to the check disk becomes a bottleneck: even if there are many outstanding small writes to different disks, they must all read and write the check disk to complete. “Fifth Level RAID”, now called “RAID 5″ eliminates this problem by striping parity information across the disks: each disk is responsible for storing an equal fraction of the parity values.

Unsurprisingly for such a simple concept, the ideas behind RAIDs actually predated the RAID paper by decades; a 1978 IBM patent entitled “System for recovering data stored in failed memory unit” described what amounted to a RAID 5 system and described . Nevertheless, this paper made a vital contribution in popularizing and motivating the method, and providing a terminology that could be used to effectively differentiate the capabilities and guarantees of RAID systems.

Single disk failures are the most common failure mode for RAIDs and the only one addressed by their design. Although the RAID paper anticipated some other failure modes RAIDs leave unhandled – such as correlated failures due to an earthquake or power outage – some issues with RAIDs were not anticipated by the RAID paper. Most notable among these is that typical RAIDs do not implement transactional semantics, so any interruption in service may leave parts of the logical disk in an irrecoverable inconsistent state – for example if one disk has completed a write but another has completed it only partially, and yet another has not started it. This is confounded further by modern on-disk caching, which may claim to have completed a write to disk when it has not.

They also don’t enjoy the simple interoperability of single disks, which can be removed from virtually any PC and attached to virtually any other, or placed in a USB enclosure for easy portable use; they instead require careful, typically vendor-specific configuration. This may be a problem that can be addressed with suitable standards.

The original RAID paper did not construct a practical RAID system – it only performed basic simulations. This explains some of its speculation about RAIDs with “1000 disks,” which is now held to be generally infeasible. A number of follow-up papers on RAIDs appeared in the early 90s constructing and benching large-scale real RAID systems; one of the most notable is Katz et al’s RAID-II (Drapeau, A. L., Shirriff, K. W., Hartman, J. H., Miller, E. L., Seshan, S., Katz, R. H., Lutz, K., Patterson, D. A., Lee, E. K., Chen, P. M., and Gibson, G. A. 1994. RAID-II: a high-bandwidth network file server. In Proceedings of the 21st Annual international Symposium on Computer Architecture, 234-244). RAID-II was a RAID of over 100 high performance disks which provided 40 GB of capacity, a staggering amount in 1994. Adjusting for Kryder’s Law, which states that disk density doubles annually, a similar array today would store roughly 1.3 exabytes.

Today RAIDs are widespread in the server market and cheap, efficient SATA drive arrays are easy for hobbyists to construct. They continue to function virtually the same as described in this paper, and as anticipated by the authors have both software and hardware implementations (generally, RAID 5 implementations are reserved for hardware due to the computational overhead of computing parity). New RAID levels have emerged: a RAID 10 (or RAID 1+0) that is effectively a RAID 1 array of RAID 0 arrays; in this mode 50% of storage is redundant, but performance is similar to RAID 0 without the computational overhead of parity calculation. RAID 6 uses an extended parity scheme to recover from two-disk failures. In the future, it’s likely that new dense, inexpensive nonvolatile memory technologies such as phase-change memory will supplant disks in all markets, while helping to cope with their disadvantages with transactional semantics; until then, RAIDs continue to provide one of the most economical solutions for high-capacity, high-performance storage.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Operating systems | Tagged: , , , | Leave a Comment »

Random Early Detection Gateways for Congestion Avoidance

Posted by dcoetzee on March 9, 2009

Citation: Floyd, S. and Jacobson, V. Random early detection gateways for congestion avoidance. IEEE/ACM Trans. Netw. 1, 4 (Aug. 1993), 397-413. (PDF)

Abstract: This paper presents Random Early Detection (RED) gateways for congestion avoidance in packet-switched networks. The gateway detects incipient congestion by computing the average queue size. The gateway could notify connections of congestion either by dropping packets arriving at the gateway or by setting a bit in packet headers. When the average queue size exceeds a preset threshold, the gateway drops or marks each arriving packet with a certain probability, where the exact probability is a function of the average queue size.

RED gateways keep the average queue size low while allowing occasional bursts of packets in the queue. During congestion, the probability that the gateway notifies a particular connection to reduce its window is roughly proportional to that connection’s share of the bandwidth through the gateway. RED gateways are designed to accompany a transport-layer congestion control protocol such as TCP. The RED gateway has no bias against bursty traffic and avoids the global synchronization of many connections decreasing their window at the same time. Simulations of a TCP/IP network are used to illustrate the performance
of RED gateways.

Discussion: This 1993 paper introduced the first widely-used active queue management (AQM) algorithm, known as random early detection (RED); this algorithm is deployed today in many commercial routers in order to manage network congestion.

First some background. A router connects two or more networks together; it receives network packets, examines them to determine which network they should be sent to, and then sends them to that network. Because it handles traffic from many clients, sometimes packets will arrive more quickly than they can be processed. In this case they are temporarily placed in a fixed-size queue. Network congestion occurs when packets arrive so quickly that this queue fills completely; in this case the router is forced to drop, or discard, packets. Transport protocols like TCP are constructed with this possibility in mind; they detect dropped packets and retransmit them later. Moreover, because dropped packets are a good signal of network congestion, a well-behaved TCP client will decrease its sending rate when it notices a packet has been dropped.

The simplest method of queue management, known as tail drop or Drop Tail, is to drop any packet received while the queue is full. TCP clients respond to the drops by throttling their sending rate and soon the queue begins to empty and new packets can be handled. However, tail drop has three main problems:

  • Global synchronization: If many clients are sending data through the router simultaneously, it is likely that by dropping many consecutive packets, all clients will get some dropped packets. Consequently, they all reduce their sending rate simultaneously, leading to underutilization. The router responds by dropping no packets, leading to a cycle of underutilization and congestion.
  • Bias against bursty traffic: Some clients send large numbers of packets in short bursts; they may use only a small percentage of the bandwidth, but are disproportionately likely to have packets dropped by tail drop, since the average queue size is high and inserting many packets into a mostly-full buffer at once will cause many of them to be dropped. The result is less throughput for these clients.
  • High latency: because of tail drop’s high average queue size, each packet must wait in line to be processed for a relatively long time, leading to increased latency, a serious problem in real-time network applications like audio/video calls and interactive remote applications.

Random early detection is one solution to these problems. It operates by keeping a running estimate of the average queue size; if it’s too small, no packets will be dropped. If it’s too large, all packets will be dropped. Between these two thresholds, it smoothly increases the probability that each incoming packet will be dropped, from zero up to a fixed maximum probability maxp.

Although it seems strange at first to be dropping packets while there’s still perfectly good space in the queue, the point is to send an implicit signal to the sender that they need to cut their sending rate. By only dropping a small percentage of packets, only some connections will be affected, avoiding global synchronization, but still effectively limiting growth of the queue. By keeping the average queue size small, average latency is kept small, and there’s room for bursts of traffic, avoiding a bias against bursty senders. Because the average queue size is a running average taken over a period of time, it’s not significantly distorted by these bursts. Global synchronization may still occur if the average queue length exceeds the maximum threshold, but this is a last-resort measure expected to occur only in the face of severe congestion.

One important optimization that RED uses is that the probability that it will drop a packet depends not only on the average queue size but also on the number of packets count processed since the last dropped packet according to the formula p = pb/(1 – countpb). The result is that the dropped packets will be spread out more evenly with less clustering, which further decreases the risk of bias against bursty traffic and global synchronization.

A difficult problem in any networking publication is empirical studies: constructing a real large-scale network is far too expensive, and the delays introduced by measurement can affect results. The RED paper relied on simulations of small, simple networks (up to about 10 nodes). One experiment used clients performing full-speed file transfer to measure utilization in this best-case scenario; another featured more mixed traffic. It would be years before limitations of RED in large-scale networks would begin to emerge (discussed below).

Although the most well-known, RED was not the first active queue management algorithm; at least two similar schemes predated it. The Early Random Drop (ERD) algorithm, proposed by Hashem in 1989, was already well-studied by 1993 (Hashem, E., Analysis of random drop for gateway congestion control. Report LCS TR-465, Laboratory for Computer Science, MIT, Cambridge, MA). In this simple scheme, there is no tracking of average queue length; instead, whenever the queue exceeds a certain size, the gateway begins to drop packets with a fixed drop probability. Although this preliminary scheme was effective in reducing global synchronization, it still allowed misbehaving users who do not obey standard protocols to achieve higher throughput.

More similar to RED was DECbit, a 1988 scheme deployed with a custom transport protocol (Raj Jain and K. K. Ramakrishnan. Congestion Avoidance in Computer Networks with a Connectionless Network Layer: Concepts, Goals and Methodology. Proceedings of the IEEE Computer Networking Symposium, pp. 134-143). This scheme kept an estimate of average queue length and marked (set a bit in) packets that arrived when the queue length is high. If clients began to receive many marked packets, they would reduce their sending rate. In TCP, if a client observes even a single dropped packet, it will dramatically reduce its sending rate, exacerbating issues of global synchronization (see slow-start). With its custom protocol DECbit was less severely affected by this problem. However, the RED paper demonstrates that DECbit still demonstrates a bias against bursty traffic. These problems, together with the inability to deploy DECbit in existing TCP networks, have made it unpopular in industry.

One of the main practical problems with RED is that it has a variety of parameters that need to be tuned by a designer depending on specific network conditions:

  • minth: The minimum average queue size below which all packets are kept.
  • maxth: The maximum average queue size above which all packets are dropped (recommended to be at least twice minth).
  • maxp: The maximum value for the probability that each incoming packet will be dropped. If this probability is not sufficient to control traffic, the maximum threshold will be reached and all packets will be dropped. A typical value is 0.02.
  • wq: This factor controls how long a period the running average queue length is taken over. If it’s too large, the period will be short and the running average will be too strongly influenced by short bursts of traffic. If it’s too small, the period will be too long and the running average will respond too slowly to actual long-term congestion.

To allow RED to work in a more general setting with less tweaking, a technique known as Adaptive RED or Active RED was developed in 2001 that adjusts parameters in response to the average queue size (Floyd S, Gummadi R, Shenker S. Adaptive RED: an algorithm for increasing the robustness of RED’s active queue management. PDF). A number of other modern schemes for active queue management have been developed that are also more “automatic,” notably Blue (1999) and its variants (W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue Management Algorithms. U. Michigan CSE-TR-387-99. PDF).

Another issue with RED is that it has trouble scaling to wide area networks (WANs) such as the Internet: in 2003 Low et al showed that large propagation delays could cause RED to update its estimate of the average queue length too slowly (Low, S. H., Paganini, F., Wang, J., and Doyle, J. C. 2003. Linear stability of TCP/RED and a scalable control. Comput. Netw. 43, 5, 633-647. PDF). New variants such as Ohsaki et al’s SPRED were designed to cope with WANs (Ohsaki, H., Yamamoto, H., and Imase, M. SPRED: Active Queue Management Mechanism for Wide-Area Networks. In Proceedings of the 2007 international Symposium on Applications and the internet, January 15 – 19, 2007).

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Networking | Tagged: , , , , , , | Leave a Comment »

Safe kernel extensions without run-time checking

Posted by dcoetzee on March 6, 2009

Citation:  Necula, G. C. and Lee, P. Safe kernel extensions without run-time checking. SIGOPS Oper. Syst. Rev. 30, SI (Oct. 1996), 229-243. (PDF).

Abstract: This paper describes a mechanism by which an operating system kernel can determine with certainty that it is safe to execute a binary supplied by an untrusted source. The kernel first defines a safety policy and makes it public. Then, using this policy, an application can provide binaries in a special form called proof-carrying code, or simply PCC. Each PCC binary contains, in addition to the native code, a formal proof that the code obeys the safety policy. The kernel can easily validate the proof without using cryptography and without consulting any external trusted entities. If the validation succeeds, the code is guaranteed to respect the safety policy without relying on run-time checks.

The main practical difficulty of PCC is in generating the safety proofs. In order to gain some preliminary experience with this, we have written several network packet filters in hand-tuned DEC Alpha assembly language, and then generated PCC binaries for them using a special prototype assembler. The PCC binaries can be executed with no run-time overhead, beyond a one-time cost of 1 to 3 milliseconds for validating the enclosed proofs. The net result is that our packet filters are formally guaranteed to be safe and are faster than packet filters created using Berkeley Packet Filters, Software Fault Isolation, or safe languages such as Modula-3.

Discussion: This 1996 paper by then-PhD-student George Necula first described the concept of proof-carrying code. It received a Best Paper Award at the Operating Systems Design and Implementation (OSDI) 1996 conference.

The first experimental scenario for proof-carrying code is illustrative: in this scenario, an application wishes to receive notification from the kernel when packets arrive, but because going between the kernel and user-mode applications is expensive, it only wants to be notified of packets that are relevant to it.  To accomplish this in the most general way possible, the application gives the kernel a machine code function called a packet filter which is run against every packet that arrives to determine if the application is interested in it.

The problem with this is security: the packet filter runs in kernel mode, and any malicious code in it could overwrite kernel data structures, crash or freeze the machine, or access private data. A variety of solutions were proposed to solve this, but they all have performance problems:

  • Berkeley Packet Filters: packet filters are written in a domain specific language for packet filters and are interpreted. Safe, but very slow.
  • Software Fault Isolation: checks for all memory accesses are automatically added. Adds runtime overhead, does not generalize to any other safety properties besides memory safety.
  • Write the packet filter in a safe language: This works, but you have to trust the compiler, you miss optimization opportunities from hand-tuning, and you have to either prove the compiler produced it with cryptography, or run the compiler at the time the filter is installed (which is very slow).

Proof-carrying code solves the problem in a new way that avoids almost all performance overhead. The kernel publishes a security policy that describes properties a valid packet filter should have (for example, it should only read the packet and read/write its own memory area). A theorem prover can then be used to show that a given piece of machine code satisfies this policy. However, the kernel doesn’t run the theorem prover: theorem provers are slow, and too large and complex to be trusted for security. Instead, the theorem prover is run once offline and comes up with a proof that the code obeys the policy; this proof is then passed into the kernel along with the packet filter. The kernel can rapidly verify that the proof is valid. It can then run the packet filter as many times as it likes with no further checks. Just as importantly, the proof verification is so simple and the code for it so small that the verifier is easy to carefully scrutinize for errors.

Proof-carrying code is a very general idea: in this paper’s experiment, all the filters were hand-coded in assembly language with clever optimizations, but as long as they could come up with a proof to show it was safe, the kernel would accept it. This allowed it to run faster than all other solutions. In general the technique is language-agnostic: any language that can be compiled down to machine code, and correct proofs attached to it, will be accepted just as readily. The proof can also be used to show not only memory safety, but literally any property that you can prove: that the filter respects a certain locking protocol, that it treats a file handle as an opaque value, that it does not exceed its memory or CPU quota, or whatever. It also applies to a variety of situations in which untrusted code is handled, including downloading apps onto mobile devices, plug-ins for web browsers, drivers, scripts in online games, and so on.

Later, the ideas of proof-carrying code would be extended to the idea of a certifying compiler. An ordinary compiler takes a high-level language and maps it down through a series of intermediate forms and program optimizations to executable machine code. A certifying compiler is able to simultaneously take proofs in the source language (such as type safety) and map them down through the same series of transformations to proofs about the final machine code. Anyone who can verify the proofs can check that the machine code obeys the same type safety properties as the original source code; critically, they can do this without having to trust the compiler, which is a good thing, because optimizing compilers are much too large and complex to formally verify. If the compiler has a bug and produces incorrect unsafe code, the proofs will not verify and the code will not be allowed to run.

The first certifying compiler was designed by Necula himself in 1998 and compiled a typesafe subset of C to DEC Alpha machine code (Necula, G. C. and Lee, P. The design and implementation of a certifying compiler. SIGPLAN Not. 33, 5 (May. 1998), 333-344.) In 2000 a certifying compiler called Touchstone for a large subset of Java was published (Christopher Colby, Peter Lee, George C. Necula, Fred Blau, Mark Plesko, and Kenneth Cline. A certifying compiler for Java. In Proceedings of the Conference on Programming Language Design and Implementation, pages 95-107, June 2000, PDF). In 2008 I helped work on a large-scale certifying compiler for MSIL (the bytecode used by Microsoft’s .NET Framework) (Juan Chen, Chris Hawblitzel, Frances Perry, Mike Emmi, Jeremy Condit, Derrick Coetzee, Polyvios Pratikaki. Type-preserving compilation for large-scale optimizing object-oriented compilers. PLDI 2008: 183-192. PDF). All of these efforts emphasized not only that certifying compilers allowed consumers of output binaries to trust those binaries, but also that validation on a unit test suite was invaluable for finding compiler bugs.

So what’s the catch with proof-carrying code? One major concern is proof size: although a great deal of work went into balancing small proof representations with proofs that are quick to verify, there are still no promises that in theory some proof will not become exponentially large. The argument heuristically is that people don’t write programs that cause this behavior because they’re too complex, but the possibility remains that some programs may need runtime checks added to them in order to mitigate their proof size. Some publications have tried to address this by defining richer proof representations, such as Crary and Vanderwaart’s LTT paper “An expressive, scalable type theory for certified code.” (Technical Report CMU-CS-01-113, School of Computer Science, Carnegie Mellon University, May 2001. PDF).

Another problem is producing the proofs in the first place: building a certifying compiler is a lot of work just to preserve safety properties of the source language, and theorem provers are incomplete; there are many properties they cannot prove unassisted. Hand-tuned code, particularly assembly code, will baffle your average theorem prover and require human assistance; and new effort may be required if the code is updated. Proofs of complex properties, such as correctness of an algorithm, still require similar human intervention.

This paper was just the first preliminary paper in a series of papers related to proof-carrying code; the most widely-cited is its successor, Necula’s 1997 paper “Proof-Carrying Code.” (In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 106-119. POPL ‘97.) In late 1997 he published a longer paper with more details, notably including how to generate proofs of liveness as well as safety; in this case, proving that the filter does not take too many CPU cycles (Necula, G. C. and Lee, P. 1998. Safe, Untrusted Agents Using Proof-Carrying Code. In Mobile Agents and Security Lecture Notes In Computer Science, vol. 1419, 61-91). In 1998 came his aforementioned work with certifying compilers, as well as a 70-page detailed technical report about the proof encoding and the logic used to represent it (which was condensed to: Necula, G. C. and Lee, P. Efficient Representation and Validation of Proofs. In Proceedings of the 13th Annual IEEE Symposium on Logic in Computer Science. June 21 – 24, 1998).

Since that time, Necula, now an associate professor at UC Berkeley, has worked on a number of other systems to support better safety properties for legacy code, especially systems written in C. These include Deputy, a C compiler that detects safety problems and is designed to require little annotation effort, and CCured, a source-to-source translator that adds run-time checks to C programs to make them safe.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Operating systems | Tagged: , , , | Leave a Comment »

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications

Posted by dcoetzee on March 4, 2009

Citation: Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols For Computer Communications (San Diego, California, United States). SIGCOMM ‘01. ACM, New York, NY, 149-160. (PDF).

Abstract: A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

Discussion: This 2001 paper introduced Chord, one of the first published distributed hash tables and one of the most popular ones in practice due to its simplicity and good practical performance.

Distributed hash tables are data structures that provide a “lookup” function for a distributed group of computers: given the key for a piece of data, find the computer where that piece of data is stored. This is easy to do with a large centralized server that keeps track of where each piece of data is stored, but the problem becomes more complicated when we want to implement one on a peer-to-peer network where all nodes are running the same code and none is willing to commit the resources to track the entire network. Chord is one of the earliest and simplest data structures of this sort. In a system with n machines, it is able to perform a lookup by contacting only O(log N) other nodes, and only has to store data about O(log N) other nodes; no node has complete knowledge of the network.

The first task a distributed hash table must do is decide which nodes store which data items. Chord relies on the use of a hash function, typically a cryptographic hash function such as SHA1, to assign a unique integer identifier to each node (the hash of its IP address) and to each key that may be looked up. Then, a key k is assigned to a node n if and only if hash(k) ≤ hash(n) and there is no other node with a hash between hash(k) and hash(n). If there are no nodes with hashes greater than hash(k), it is assigned to the node with smallest hash. This can be visualized by imagining all the hash values laid out on a large circle or ring, increasing in the clockwise direction; to find the node for a key, you go to the key’s hash on the circle and proceed clockwise until you encounter a node’s hash. This system is called consistent hashing, and was conceived by D. Lewin et al in 1997 (Karger, D., Lehman, E., Leighton, F., Levine, M., Lewin, D., and Panigrahy, R. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, May 1997, pp. 654-663. PDF). It’s convenient for distributed hash tables because if a node joins or leaves the system, only a small number of data items have to be reassigned (those between it and the previous node), and only two nodes have to communicate to reassign them; the most obvious alternative, a system that assigns keys at random, would need to contact all nodes to locate the data to be reassigned.

The most trivial distributed hash table in this arrangement would be for each node to keep the IP address of the next node in the clockwise direction on the circle, called its successor; this connects them all in a circularly linked list. A node can follow these links by asking each node in turn for its successor, until the one storing the desired key is reached. This works fine, but is very slow: it requires roundtrip communication with a series of N/2 nodes on average, and N-1 nodes in the worst case.

To speed this up, Chord introduces the concept of fingers: the ith finger of a node n is the first node in the clockwise direction from the hash value hash(n) + 2i-1. Every node stores a table of fingers and the IP addresses of the corresponding nodes. To locate the node for a key, a node contacts the largest-indexed finger on its table that precedes the key on the circle, and forwards the request to that node. This repeats until the request reaches the node carrying the data. The request only needs to be forwarded at most log N times, because each node stores the most information about nodes closely following it on the circle, and the closer we get the more precise information we have.

When a node joins or leaves the network, it needs to set up its finger tables, and update the finger tables of other nodes to include it. Using the above lookup primitive, it is easy to locate each of the necessary nodes in log N time, so that nodes can join or leave the network in O(log2 N) time.

Part of why Chord is useful in peer-to-peer systems is that it’s resilient to nodes continuously joining and leaving the network: as long as the successor links remain correct, lookup remains correct; finger tables are just an optimization and they can be updated more infrequently without significantly impacting performance. A stabilization procedure is responsible for ensuring that successors remain correct and nodes have a consistent view of successors. Similarly, node failures, where a node drops from the network without warning, can be dealt with by adding a list of the next log N successors to each node; if the successor ever fails, the request can be forwarded to the next live successor. It’s unlikely all log N successors will fail simultaneously before the stabilization procedure can update the successor list. To prevent data loss in the case of node failure, a replication protocol can be built on top of Chord which simply stores the same data under two or more distinct keys.

In line with its goal as a practical system, a large section (section 6) was dedicated to experimental results. Most of these were based on simulations with large numbers of nodes (say 10000) to demonstrate scalability and stability. They show that the average number of nodes contacted for lookups is actually close to (log N)/2; and they give an important optimization, which is for each physical node to run a number of “virtual” nodes with the same IP address; this helps to divide the hash space more evenly and decrease the variance in the number of keys per physical node. They also ran an experiment on the Internet with a smaller network of 10 machines across the United States, showing a total latency of about 200 ms for lookups.

As is frequently the case with scientific discoveries, a number of researchers independently invented systems with similar capabilities to Chord at about the same time in 2001. These systems included Pastry and Tapestry, which both emphasize minimizing latency over simplicity, and CAN (content addressable network), which organizes its nodes in a d-dimensional Cartesian coordinate space instead of a ring, and requires more parameter tuning than Chord. Since then a number of other distributed hash tables technologies have been developed, such as Maymounkov and Mazières’s Kademlia in 2002 (PDF), which relies on an XOR-based metric topology.

Chord, particularly the popular MIT implementation MIT Chord, has undergone a great deal of performance refinement and further analysis since, including techniques like proximity routing (preferring to route to nodes with nearby nodes), geographic overlay construction (using physical location to inform the structure of the topology), and landmark routing. These all aim to lower the latency of lookup in practice. A Berkeley course project by Li Zhuang and Feng Zhou explained and evaluated a variety of these refinements.

An important limitation of Chord (discussed briefly at the end of section 4.3) is its fundamental assumption that the hash function behaves randomly. Although it uses cryptographic hashes that are collision-resistant, it isn’t difficult to generate IPs or keys that hash to a small segment of the hash space, by simply generating random values until you find one with a hash in that segment, rendering it vulnerable to certain attacks. Other distributed hash tables, such as Pastry, Tapestry, and Symphony, incorporate randomization into their topology.

A more extensive modern discussion of Chord is given in Chapter 2 of Monica Haladyna Braunisch’s 2006 Master’s Thesis at MIT (PDF).

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Distributed systems | Tagged: , , , , | 3 Comments »

Graph-based algorithms for Boolean function manipulation

Posted by dcoetzee on March 2, 2009

Citation: Bryant, R. E. 1986. Graph-Based Algorithms for Boolean Function Manipulation. IEEE Trans. Comput. 35, 8 (Aug. 1986), 677-691. (PDF)

Abstract: In this paper we present a new data structure for representing Boolean functions and an associated set of manipulation algorithms. Functions are represented by directed, acyclic graphs in a manner similar to the representations introduced by Lee [1] and Akers [2], but with further restrictions on the ordering of decision variables in the graph. Although a function requires, in the worst case, a graph of size exponential in the number of arguments, many of the functions encountered in typical applications have a more reasonable representation. Our algorithms have time complexity proportional to the sizes of the graphs being operated on, and hence are quite efficient as long as the graphs do not grow too large. We present experimental results from applying these algorithms to problems in logic design verification that demonstrate the practicality of our approach.

Discussion: This widely-cited 1986 paper introduced the notion of the reduced ordered binary decision diagram (ROBDD).

Let’s start with some background. A boolean function is a function that takes a set of boolean inputs (conventionally labelled x1 through xn) and produces a boolean (true/false) output. A binary decision diagram is a way of compactly representing a boolean function as a directed graph: each node tests a particular input parameter to see whether it’s true or false, and then proceeds to one of two child vertices depending on the result. Eventually, it reaches a terminal node labelled either true or false, and that’s the result.

BDDs can also be thought of as branching programs, functions that do nothing other than test boolean variables using if-else statements, return true or false, and call other boolean functions satisfying these constraints, as in this C++ example for the XOR function of 3 variables:

bool xor3(bool x1, bool x2, bool x3) {
    if (x1) {
        if (xor2(x2, x3))
            return false;
        else
            return true;
    } else {
        if (xor2(x2, x3))
            return true;
        else
            return false;
    }
}

bool xor2(bool x1, bool x2) {
    if (x1) {
        if (x2)
            return false;
        else
            return true;
    } else {
        if (x2)
            return true;
        else
            return false;
    }
}

BDDs have some nice properties: they can represent any boolean function, and they can compactly represent most boolean functions that occur in practice. It’s easy to complement them (just complement the terminal nodes) or combine them with AND, OR, or other boolean operations. Unfortunately, they’re so general that many problems involving them are computationally intractable. The diagram may contain cycles and fail to terminate, and some paths may be unreachable, as in these snippets:

bool nonterminating(bool x1) {
    if (x1) {
        if (nonterminating(x1)) {
            return true;
        } else {
            return false;
        }
    } else {
        return false;
    }
}

bool unreachable(bool x1, bool x2) {
    if (x1) {
        return false;
    } else {
        if (x1)
            return true;
        else
            return false;
    }
}

The above branching function unreachable() always returns false, since the “return true” statement is unreachable; but trying to determine in general whether or not a branching function always return false (satisfiability) is a difficult (NP-hard) problem. Likewise, trying to determine if two BDDs or branching programs compute the same boolean function is coNP-hard. Other difficult problems are generating a particular input that will cause a BDD to return true, or counting the number of inputs that make it return true.

BDDs are classical, dating back to a 1959 work by C. Y. Lee (Representation of switching circuits by binary-decision programs. Bell System Technical Journal, vol. 38, July 1959, pp 985-999). A 1978 work by S. B. Akers popularized BDDs and optimized them by reducing their size heuristically, but did not eliminate the issues described above (Binary decision diagrams. IEEE Transactions on Computers, Vol. C-27, No. 6, June 1978, pp. 509-516).

Bryant’s insight with reduced ordered BDDs (ROBDDs) is that by imposing additional constraints on BDDs, we can force them into a canonical form: any two ROBDDs representing the same boolean function will be identical. This means that it’s trivial to determine whether two ROBDDs are equivalent. This also makes the satisfiability problem simple (an ROBDD always return false only if it identical to the trivial ROBDD that always returns false). There are also efficient algorithms for ROBDDs to give a satisfying input (one that causes it to return true) and to count the number of satisfying inputs.

Compared to general BDDs, the main limitation of ROBDDs is that some boolean functions will require much larger graphs; in fact, any ROBDD describing integer multiplication functions will necessarily be of exponential size, as later proven by Bryant in 1991 (On the complexity of VLSI implementations and graph representations of boolean functions with application to integer multiplication. IEEE Transactions on Computers 40-2, pp.205-213).

The first constraint, the “ordered” part of reduced ordered BDDs, limits the order in which input parameters can be tested: an ROBDD cannot test an input multiple times, and must test inputs in increasing order by index. For example, if it has already tested x2, it can’t go back and test x1. This implies that cycles and nontermination can’t exist (an ROBDD is a directed acyclic graph).

A side effect of the ordered constraint is that the size and complexity of the representation depends critically on the ordering of the inputs; because the runtime of the algorithms that manipulate ROBDDs depends on the size of the graph, it’s important to choose a good ordering. Finding an optimal ordering is coNP-complete, so heuristics have to be used for this.

The second constraint, the “reduced” part of reduced order BDDs, specifies that the BDD graph must have the smallest possible number of nodes needed to represent the function. Because of the ordering constraint, there is an efficient algorithm to do this (section 4.2). Let the subgraph rooted at v be the set of all nodes reachable from node v by directed edges. A reduced BDD has two properties: 1. no node transitions to the same node regardless of whether the variable tested is true or false; 2. there are no two nodes such that the subgraphs rooted at those nodes are ismorphic. In the first case, because the test has no influence on the result, the test can simply be removed and all incoming edges directed to the next state. In the second case, all incoming edges into the two nodes with ismorphic subgraphs can be redirected to just one of them, and then any nodes with in-degree zero can be removed. To identify isomorphic subgraphs in the first place, the graph isomorphism algorithm for directed acyclic graphs is used.

ROBDDs are a general data structure, but they’re most useful in the contexts of logic synthesis and formal verification. In logic synthesis, their small size helps to generate a small logic circuit to implement a given boolean function. In formal verification, one can construct a formula representing whether a system has a desired property as a BDD, and then see whether that formula is ever false; if it is, it can given an example of a specific situation that makes it false and facilitates diagnosis. In this context, the ability of BDDs to describe a large, complex boolean function implicitly and succinctly is of benefit.

A large part of the effort going into this work (Section 5) was demonstrating that good input orderings could be found that heuristically would tend to produce small ROBDDs. Bryant sought to demonstrate this empirically by applying them to a variety of practical scenarios; in this paper, he limited his evaluation to “the problem of verifying that the implementation of a logic function (in terms of a combinatorial logic gate network) satisfies its specification (in terms of boolean expressions).” Essentially he verified that the addition and logical operation circuits of some arithmetic logic units (ALUs) were correct. The resulting verification took only a few minutes, even on computers of the time, but was only efficient “as long as we choose an ordering in which the successive bits of the input words are interleaved.” In 1992 he would follow up with a 26-page survey going into considerably more detail about specific implementation techniques, encoding of systems as boolean formulas, and applications in digital system design, finite-state system analysis, and truth maintenance systems (Symbolic Boolean manipulation with ordered binary-decision diagrams. ACM Comput. Surv. 24, 3 (Sep. 1992), 293-318.)

A number of alternatives to BDDs in system verification have been proposed; one of the most popular, propositional decision procedures, was evaluated by Biere, Cimatti, and Clarke in 1999 ( Biere, A., Cimatti, A., Clarke, E. M., and Zhu, Y. Symbolic Model Checking without BDDs. In Proceedings of the 5th international Conference on Tools and Algorithms For Construction and Analysis of Systems. PDF). This construction takes advantage of the fact that most verification procedures can be reduced to the problem of determining whether a given boolean formula is satisfiable; once this formula is generated, a boolean satisfiability algorithm (or SAT solver) can be applied to systematically search for an input that makes the formula true. This creates a time-space tradeoff: BDDs rapidly exceed space constraints for certain problems or when the input order is not chosen correctly, but as long as the graphs remain small they can determine satisfiability efficiently. SAT solvers allow smaller representations, but cannot reliably produce solutions for all possible formulas because the boolean satisfiability problem is NP-complete. Nevertheless, as SAT solvers have become more sophisticated and CPUs have become faster, this technique has begun to rapidly dominate the traditional approach of using BDDs.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Formal verification | Tagged: , , , | 1 Comment »

Induction of Decision Trees

Posted by dcoetzee on March 1, 2009

Citation: Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106. (PDF)

Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

Discussion: J. R. Quinlan, also known as Ross Quinlan, is a researcher out of Australia who is best known for having designed the popular ID3 and C4.5 algorithms that are most commonly used in practice for generating decision trees. This 1986 paper surveyed the area of decision tree learning, dealt with a number of outstanding practical issues with the ID3 algorithm, and successfully motivated wide use of decision trees in applications (to quote the conclusion, “[t]he aim of this paper is to demonstrate that the technology for building decision trees from examples is fairly robust”). It is commonly cited as the best-known detailed description of the ID3 algorithm, although that algorithm was actually first published in 1979. (Quinlan, J.R. Discovering rules by induction from large collections of examples. In Expert systems in the micro electronic age.)

In a typical decision tree problem, a list of objects is given providing a list of attributes and a class. For example, an object may have attributes describing a particular patient’s symptoms (does the patient have a cough? what is the patient’s temperature?) and a class describing the medical issue they were subsequently diagnosed with. The goal of a decision tree is to predict the class (the diagnosis) based on the other attributes (the symptoms). It does this by examining one attribute at a time and deciding what to do next based on whether or not that attribute satisfies some property; it continues in this manner until it has enough information to identify the class. For example, if a patient has a cough and a temperature above 98, they might be diagnosed with a cold, whereas if they have a cough and their temperature is below 98 they might be diagnosed with bronchitis.

At one time these trees were generated painstakingly by hand via interviews with experts. Decision tree learning attempts to generate trees automatically from a large collection of training cases (in this case, a database of patient symptoms and diagnoses) that can be used to predict outcomes in new cases. These examples may come from real-world historical data, or may be generated by experts by hand.

The subject of this work was a family of algorithm that the author calls “the TDIDT family of learning systems,” standing for “Top Down Induction of Decision Trees.” This is actually a subset of set of all decision tree learning algorithms; top-down indicates a strategy where initially all training cases belong to a single root node which is then successively split to form a tree. This is in contrast to incremental approaches where a tree is built based on the data seen so far and then modified as new data arrives, as was used for example by Winston in his 1975 paper “Learning structural descriptions from examples.”

The TDIDT family began with Hunt’s Concept Learning System (CLS) framework as early as 1963, which was built upon by Quinlan with ID3 in 1979. Quinlan was motivated by a specific learning problem posed by Donald Michie involving chess endgames, in which the learner was asked to identify a win for black or white based solely on pattern-based features without search. However, ID3 was still limited to attributes with a finite number of discrete values; systems such as ACLS, ASSISTANT, and later C4.5 allowed attributes with integer or even real continuous values, and could generate boolean predicates to examine them.

The primary goal of these systems is to produce a decision tree that performs well on unseen new cases; they typically seek to achieve this by producing a decision tree with a small number of nodes that still performs adequately on the training set. Because it has a small number of nodes, it is believed to generalize better, avoid overfitting, and to be a simpler representation of knowledge for humans to understand, which is one of the primary advantages of decision trees over more opaque predictors such as artificial neural networks.

The critical choices to be made in building decision trees are which tests to apply to attributes, and the order in which to apply them. The goal of ID3 is to do this efficiently for large-scale problems with many attributes and objects. It accomplishes this by using the idea of information entropy from information theory. This can be explained by analogy to data compression: say we want to store in compressed form the class of each object. Using simple arithmetic coding, which examines the frequencies of each class, we can store this information in a number of bits close to the information entropy lower bound. However, once we’ve performed a test on the object (such as “does the patient have a cough?”), we know more information about it, and so we can compress its class into a smaller number of bits. In ID3, each possible test is examined and then the compressed sizes of the classes of the two partitions (e.g. people with and without a cough) is summed. The test that minimizes this value is selected, because it provides the most information.

This entropy-based test has some drawbacks; it tends to favor attributes with more possible values. This can be dealt with either by restricting attributes to two values or by taking the information content of the attribute itself into account (section 7).

Additionally, ID3 implements an orthogonal improvement for scalability: the tree is built incrementally by starting out with a small subset of all training cases, building a tree, and then applying it to all training cases. A selection of incorrectly classified training cases is then added to the training set and the process is repeated. Because a random sample of the training set tends to be representative, this converges rapidly on a tree that suits the entire training set without the need to build a tree on the whole training set directly (an expensive proposition).

This approach works excellently in practice in situations where all data is correct and complete, but many practical scenarios also have to cope with noise (errors in the training data). The naive approach above would rapidly overfit to try and classify noisy entries. ID3 relies on a statistical approach based on the chi square test that is able to determine when an attribute does not have a significant effect on the relative frequencies of classes (that is, each partition corresponding to a particular value of the attribute has about the same proportion of items in each class).

Another problem with naive ID3 is unknown attribute values. The most obvious approach, adding a value for “unknown” to the attribute, yields poor performance. Quinlan’s approach was essentially to treat each unknown value as a probabilistic value that takes on each possible attribute value with a probability based on how frequently that attribute value occurs. This complicated decision tree lookup, since it may now have to follow multiple paths and then produce a probability distribution of output classes. This approach also extends naturally to attributes with partial information (distributions over values).

In 1993 Quinlan created an improved decision tree learning method called C4.5 based on ID3. In addition to support for attributes with infinitely many values, and some of the improvements discussed above, it also prunes completed trees by replacing subtrees with leaf nodes wherever this demonstrates improved generalization (as measured by evaluating a separate test set).

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Artificial intelligence | Tagged: , , , , | 1 Comment »