Papers in Computer Science

Discussion of computer science publications

Archive for the 'Operating systems' Category

The Byzantine Generals Problem

Posted by dcoetzee on 1st April 2009

Citation: Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst. 4, 3 (Jul. 1982), 382-401. (PDF)

Abstract:Reliable computer systems must handling malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

Discussion: This 1982 paper introduced the Byzantine Generals Problem, an abstract story problem that launched the field of Byzantine fault tolerance. A system is said to be Byzantine fault tolerant if it is resilient to unpredictable and even malicious behavior of faulty or compromised component systems.

For example, in a distributed computing system, a single computer may be taken over by a hacker, who can then attempt to trick the entire system into doing their bidding. In high security or safety-critical systems, such as spacecraft, weapon systems, or voting machines, there are often multiple independent systems built by multiple vendors with no interaction; they cannot rule out the possibility that the employees of one of these vendors furtively included code to attempt to circumvent the system. And this doesn’t just apply to malicious behavior: many software bugs will cause a system to behave incorrectly without necessarily crashing. Obviously if all the component systems are compromised or fail, you’re just plain out of luck – but this is much more unlikely.  The main question this paper answers is: how many compromised components do you need to compromise the integrity of the system as a whole?

To make this problem more abstract and memorable, the authors phrase it in terms of a story problem involving Byzantine generals (the Byzantine Empire or Eastern Roman Empire was the continuation of the Roman Empire in the Middle Ages). In this story, the Byzantine army surrounds an enemy city. Each general leads a single division, and the generals must come to a common decision whether to attack or retreat. They can communicate only by messenger (forging messages or killing messengers is not allowed). However, it’s well known that some of the generals are traitors. To deal with this, the generals decide to vote on whether to attack or retreat.

Let’s say we have three generals, named Armatus, Belisarius, and Comentiolus. Belisarius tells the others that he votes to attack, and Comentiolus tells the others that he votes to retreat. Now, Armatus, the traitor, tells Belisarius that he votes to attack, while telling Comentiolus that he votes to retreat. The result is that Belisarius ends up attacking while Comentiolus retreats, with disastrous results.

To protect against this possibility, the generals decide they need to be more suspicious of each other. Not only does each general send his vote to the other generals, but he also sends copies of the votes that he received to each of the other generals. Now it would seem that Armatus’s scheme is foiled, because Belisarius and Comentiolus will send each other messages and see that he sent them conflicting messages and is a traitor.

However, Armatus quickly devises a new scheme. He sends a message to Comentiolus saying that Belisarius voted to retreat. Since Comentiolus received an “attack” vote from Belisarius, he concludes that Belisarius is a traitor, disregards his message that Armatus told him to attack, and retreats, with the same disastrous results.

In fact, as long as they are limited to sending simple messages, there is essentially no way of resolving this dilemma. If there are three generals and at least one traitor, the traitor can manipulate the outcome of the decision. One way of resolving this is to rely on unforgeable signatures. In this case, each general signs their votes; the generals still send all the votes they receive to each other. Now Armatus cannot tell Comentiolus that Belisarius voted to retreat, because he does not have a signed “retreat” vote from Belisarius.

A different way of resolving the dilemma is to add a fourth (loyal) general, Droctulf. Now, regardless of what Armatus tells Comentiolus about Belisarius’s vote, Droctulf will tell Comentiolus the truth – that Belisarius voted to attack. Since two people said that he voted to attack (Belisarius himself and Doctulf) and only one said he voted to retreat (Armatus), Comentiolus now believes that Belisarius voted to attack.

More generally, the main results of this paper are:

  1. Simple messages: If there are m traitors, there must be at least 3m + 1 generals in order to achieve a reliable result matching the outcome of the vote.
  2. Signed messages: Any number of traitors can be handled; the generals will always agree on a plan matching the outcome of the vote.

The odd thing about this paper is that the formal results are not new; they were first shown (in greater detail in fact) in the authors’ 1980 paper “Reaching Agreement in the Presence of Faults” (J. ACM 27, 2 (Apr. 1980), 228-234, PDF). This paper does not mention the Byzantine Empire and is only 7 pages instead of 20. “The Byzantine Generals Problem” adds some new elements, such as dealing with missing communication paths, but these have received little attention. So what was this paper for? In short, it’s just good teaching – a colorful metaphor makes the problem and its solution more memorable, and reaches a larger audience, popularizing the technique and terminology.

Another interesting historical note: it will be obvious to anyone who reads this paper that “signed messages” can be implemented with public-key cryptography; is this merely a metaphor? Well, no – the first practical public-key cryptography system, RSA, was only invented in 1984 (see my post “The RSA paper: A Method for Obtaining Digital Signatures and Public-Key Cryptosystems“). If anything, it’s the reverse: this application may have helped to motivate it. In their original 1980 paper they relied on a probabilistic mechanism for authentication called “authenticators” which were subsequently abandoned, because they were only effective against random errors and not malicious adversaries.

The authors of this paper acknowledge that the number of messages that need to be exchanged to implement the full Byzantine General Problems solution is large (quadratic), which often makes it quite expensive in practice. Although the ideas were occasionally used over the years, e.g. in a clock synchronization scheme in 1991 by N. Shankar, performance remained a significant barrier to wider use. In 2000 M. Castro did a PhD thesis on “Practical Byzantine Fault Tolerance” which implemented a number of important practical optimizations for getting Byzantine fault tolerance working efficiently on a real network server, a replicated NFS server with only about 3% empirical overhead for its reliability features. Since then there has been increasing effort on adding practical Byzantine fault tolerance to a variety of real systems, such as a domain name system (2001, “A Scalable Byzantine Fault Tolerant Secure Domain Name System“), a block storage system (2007, “Low-overhead byzantine fault-tolerant storage”), a database replication system (2008, Byzantium), and a coordination service (2008, DepSpace). With much of the research in this area still new, it will be interesting to see how useful Byzantine fault tolerance is in practical commercial systems in the long run.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Tags: , ,
Posted in Cryptography, Operating systems | 3 Comments »