A Mathematical Theory of Communication
Posted by dcoetzee on 12th April 2009
Citation: Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, Vol. 27, pp. 379-423, 623-656, July, October, 1948. (PDF)
Abstract: The recent development of various methods of modulation such as PCM and PPM which exchange bandwidth for signal-to-noise ratio has intensified the interest in a general theory of communication. A basis for such a theory is contained in the important papers of Nyquist and Hartley on this subject. In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of the information.
Discussion: This seminal 1948 paper by Claude Shannon first introduced many of the central ideas and terminology of the then-nascent field of information theory, including entropy, channel capacity, the noisy-channel coding theorem, Shannon-Fano coding, language modelling, and Shannon-Fano-Elias coding, an early version of modern arithmetic coding.
A communication system is a system that sends data from a source to a destination over a channel. In the 1940s, communication systems such as radio, telephony, and telegraphy systems had been in use for many decades, and were effective, but had little theoretical basis. Simple questions remained unanswered, such as “how fast can I send a particular type of information over a particular channel?”, “what is the best way to represent the data while sending it over the channel?”, and “how do I measure how much information is represented by a piece of data?” This single sweeping paper answered these questions, both for discrete channels (designed to send a sequence of discrete symbols such as bits) and for continuous channels (designed to send continuously-varying signals such as audio or video signals).
The first point emphasized by this paper is how information should be measured: there are two approaches to this. One is to represent the amount of information by the number of possible values that the datum could represent; for example, a 32-bit register can store 4,294,967,296 distinct values. Although it is possible to describe information in these terms, it becomes rapidly unwieldly, and does not correspond to intuition (for example, one would expect two CDs to store twice as much information as one CD, not the square). Shannon chose to use a logarithmic measure, and to represent it in terms of “binary digits or bits“, a word popularized by this paper. This does lead to some counterintuitive results: for example, a piece of information taking on 10 distinct values stores about 3.3 bits of information, and it seems strange to talk of fractional bits.
With this measure, the capacity of a discrete channel, which transmits a sequence of discrete symbols such as bits, is defined as log N(T)/T as T approaches infinity. Here, N(T) is the number of possible messages one can transmit in time T. The source itself is approximated by a Markov process, which generates a sequence of symbols while using a fixed finite amount of state to assign probabilities to which symbol will be generated next.
The Markov process model is surprisingly general: many different models of an information source can be constructed as Markov models, ranging from the trivial one that just treats each character independently and assigns the symbols uniform probability, to one that examines several previous words in an English text to determine what the next word is most likely to be. Designing a Markov process that approximates the behavior of a real information source, like an English text document or an image, is a large part of the problem of language modelling, a practice with diverse applications. In this paper, armed with nothing but a book of random numbers, Shannon emulates four different simple language models for English text and shows subjectively their increasing similarity to actual English text.
Shannon posits a measurement of information or uncertainty that satisfies some basic axioms called entropy, denoted by H, a term borrowed from statistical mechanics. This is defined for a discrete source with n symbols as:
-sum(pi log pi) as i goes from 1 to n
For example, if you have a single bit that is either zero or one with probability 1/2, the total information conveyed by it is:
-[(1/2) log(1/2) + (1/2) log(1/2)] = 1 bit.
If on the other hand the bit is zero with probability 7/8, then it conveys much less information:
-[(7/8) log(7/8) + (1/8) log(1/8)] = about 0.164 bits.
One way of looking at this is that a sequence of k bits each independently produced with these probabilities can (with high probability) be compressed down to a sequence of slightly more than 0.164k bits. As k becomes larger, this bound can be approached more closely.
Shannon defines a channel’s capacity as the number of bits it can transfer per second – for a noise-free discrete channel where each symbol is the same length, this is trivial and is just a constant. Shannon’s noiseless channel coding theorem asserts that if a channel has capacity C and a source has entropy H, we can transmit only up to C/H symbols per second. He gives an explicit construction of a code capable of doing such a transmission, later dubbed Shannon-Fano-Elias coding, which is a near-optimal code very similar to the modern arithmetic coding that forms the basis of nearly all modern data compression methods. Conversely, to show that this rate cannot be exceeded, he shows that essentially any operation that can be done on a sequence of symbols in finite memory can only lower, not increase, its entropy; hence one cannot recover more any more information from the output of the channel than it already has, which is determined by the capacity.
When using a noisy channel, where the channel can modify symbols in transit, the situation is different. He presents an abstract model in which an “observer” machine watches both the input and output of the channel, and then sends “correction” data to the receiver, with the aid of which the receiver can perfectly reconstruct the input. The minimum capacity of the channel required for this correction data to be transmitted is called the equivocation, and denotes the uncertainty in the received data due to noise. The capacity of a noisy channel is defined as:
C = Max(entropy of input – equivocation), with maximum taken over all input sources.
By the use of forward error correction, or redundancy in the transmitted information, Shannon showed the surprising result that with this definition, one can transmit information at a rate of up to C/H symbols per second, with an arbitrarily low probability of errors. This cannot be exceeded; if one eliminates redundancy in an attempt to increase the transmission rate, the increase in errors will eliminate any added benefit. Shannon’s proof demonstrates this constructively by presenting a large class of theoretical error-correcting codes that achieve the desired bound in the average case, meaning most of them are near-optimal. On the other hand, none of these codes are practical for real use. He gives an isolated example, a Hamming code also designed in the 1940s, to demonstrate a more practical error correction code, but a more complete family of practical error correction codes, such as the Reed-Solomon error correction popular today, would have to wait for later research.
Shannon’s model was very general and admitted possibilities that today would be considered unnecessary for most practical scenarios. For example, he allows different symbols to require different lengths of time to transmit, and allows noise to affect different symbols differently. On the other hand, he also makes some important restrictions on the statistical behavior of the source that are often overlooked, the most notable of which being that it must be ergodic: this means that the long-term behavior of any one infinite character sequence must have the same statistical properties as the average over all possible sequences.
It’s easy to give an example of a set of sequences for which this is not the case: suppose we have a set of sequences where every other character is a, and the remaining characters are b, but we can start with either one. Since half of these sequences will have a‘s in the odd positions, and half of them have a‘s in the even positions, the character a occurs with equal probability in every position over all possible sequences. Yet in any particular infinite sequence, it occurs in half the positions with probability 1 and the other half with probability 0. Ergodic theory is a rich and complex mathematical theory only lightly touched upon in this work, but ergodic sequences form one of the fundamental assumptions of Shannon’s work.
The second and more mathematically sophisticated section of this paper, often overlooked today by computer scientists, deals with transmission of continuously-varying signals such as audio or video signals. In the 1940s this was of greater relevance, since most communication systems dealt with this type of signal; today, many of them are being replaced by digital systems. However, even these discrete signals must eventually be translated into continuous signals to be sent over wires, whether electrical or optic fiber, and the ability to analyze the action of noise on these signals directly rather than only at the higher level of discrete noise allows for a more precise analysis.
Shannon’s treatment of continuous signals and the resulting theorems are largely the same as the discrete case, but in the limit as time units become small. Entropy is now an integral instead of a sum, and is maximized when the distribution is Gaussian (producing a white Gaussian noise signal). Entropy is more complex to work with in this setting, because it depends on the choice of coordinate system. He shows closed-form results for a few cases where the average power of the input signal is bounded, and the case where the peak or maximum power is bounded, both with a simple additive white Gaussian noise (AWGN) model.
Additionally, the concept of “lossless” transmission becomes meaningless in this setting: a continuous signal carries an infinite amount of information, only a finite amount of which can be recovered in the presence of any noise in the channel. Instead of discussing the probability with which the signal can be reconstructed perfectly, he describes a function v that places a limit on how close the input and output functions are, and then defines capacity subject to this limitation.
Sometimes “A Mathematical Theory of Communication” is credited for single-handedly founding information theory. Although it was influential, this is an overstatement – it was essentially a generalization of the 1928 foundational work of Harry Nyquist (“Certain factors affecting telegraph speed”), which described the Nyquist rate and allowed Shannon to represent continuous signals in an essentially discrete way, and Ralph Hartley (“Transmission of Information”), which was the first paper to attempt to measure the information that could be put through a channel (although its conclusions were comparatively conservative). It also adapted the terminology and techniques of statistical thermodynamics as described by Boltzmann and Gibbs in its description of information entropy. Shannon’s inspiration also came from new developments in signal coding techniques, such as pulse-code modulation, which was used by SIGSALY to digitize speech signals for encryption during World War II. The potential for differences in coding techniques to yield improvements in transmission rate raised questions about the fundamental limitations of transmission, creating an environment conducive to such a work.
Conversely, this paper was not the last foundational work of information theory. It did not begin to explore applications to compressed, reliable data storage (which can be seen as sending data over a channel “through time”), nor algorithmic information theory, which examines the computational limitations of information, in particular Kolmogorov complexity. The primitive language models it mentions would be replaced by sophisticated ones based on a variety of statistical techniques such as smoothing, and a huge variety of lossless and lossy codes would be described not only for English text but for quantized signals such as digital images and digital audio, forming the basis of modern standardized formats like ZIP, JPEG, and MP3 files.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.
Posted in Information theory and compression, Networking | 2 Comments »