Posted by dcoetzee on December 19th, 2009
Citation: Wahbe, R., Lucco, S., Anderson, T. E., and Graham, S. L. 1993. Efficient software-based fault isolation. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles (Asheville, North Carolina, United States, December 05 – 08, 1993). SOSP ’93. ACM, New York, NY, 203-216. (PS) (PDF)
Abstract: One way to provide fault isolation among cooperating software modules is to place each in its own address space. However, for tightly-coupled modules, this solution incurs prohibitive context switch overhead. In this paper, we present a software approach to implementing fault isolation within a single address space. Our approach has two parts. First, we load the code and data for a distrusted module into its own fault domain, a logically separate portion of the application’s address space. Second, we modify the object code of a distrusted module to prevent it from writing or jumping to an address outside its fault domain. Both these software operations are portable and programming language independent.
Our approach poses a tradeoff relative to hardware fault isolation: substantially faster communication between fault domains, at a cost of slightly increased execution time for distrusted modules. We demonstrate that for frequently communicating modules, implementing fault isolation in software rather than hardware can substantially improve end-to-end application performance.
Discussion: This 1993 paper by Wahbe and Lucco (now at Microsoft), Thomas E. Anderson (now at the University of Washington), and Susan L. Graham describes a software-based method for isolating modules on RISC machines, with immediate application to fine-grained privilege separation. Address spaces, implemented in hardware, are used to isolate processes in modern commodity OS’s, and software fault isolation (SFI) is an alternative with several advantages: most notably, it requires no hardware support, and communication cost between protection domains is much lower.
Suppose a system is divided into modules (components) – for example, a multitasking operating system may have a kernel with various applications running on top of it. A system is said to provide fault isolation if it can recover from the failure of a module without risking the integrity of the rest of the system. For example, if you’re playing a game and it crashes, that shouldn’t cause your web browser – or your entire computer – to crash, or even to malfunction. Fault isolation is valuable for mitigating failures in large software systems where failures are inevitable.
Fault isolation also forms the foundation for a valuable security technique called privilege separation: if each module is only permitted to perform a limited set of certain operations, then even if a module is compromised by an attacker, the attacker only gains access to the privileges held by that module, rather than the entire system. For example, if you’re playing a game it should only have access to the game data files, not the files containing your bank information, which it doesn’t need; then, even if an attacker hijacks your game, they still can’t access your bank information. Privilege separation allows security vulnerabilities in large systems to be mitigated and allows the effort of security verification and auditing to be concentrated on the modules that hold the most dangerous privileges.
In a typical fault isolation system, each module owns some subset of memory where it stores its code and private data structures. To ensure that a misbehaving module does not compromise the integrity of other modules, modules must not be able to write to memory owned by other modules. Moreover, modules must communicate in a controlled manner, usually through an explicit interface, so that other modules can’t trick a module into modifying its own memory maliciously on their behalf.
Today, the primary mechanism used to enforce these properties is dynamic address translation, which maps virtual addresses (memory locations accessed by the application) to physical addresses (locations on the memory device itself) on-the-fly. This functionality is implemented by a specialized piece of hardware called a memory management unit (MMU). By introducing this layer of indirection, the operating system can ensure that a process has no ability to read or write memory belonging to other processes simply by configuring the MMU to provide no mapping from any virtual address to these locations. Because only the operating system kernel can reconfigure the MMU, this protection is secure enough to use with malicious code.
The downside to MMU-based protection is that communication is expensive. Suppose module A wants to send a message to module B. Normally, if both modules had the same view of memory, it would just call module B and pass it a pointer to the message. This takes only a few instructions and is very fast. When two modules have different address space mappings, however, the cost rises dramatically: just switching from running one module to running the other requires a context switch, which involves saving and restoring the complete register set, reconfiguring the MMU, and flushing the MMU’s cache (the translation lookaside buffer). On top of that, it can’t just pass a pointer to the message, because for module B that same pointer may map to a different location in physical memory. There are many different ways to send messages from one process to another, but these all involve overhead.
In applications where the number of messages sent is large, this overhead rapidly becomes prohibitive. An example is a Gigabit Ethernet driver, which has to send a new packet to the network stack about every 12 microseconds. Considering the CPU time needed to parse the headers, this leaves very little time for expensive context switches. As a consequence, in practice components like this are simply not modularized; if they crash, the system crashes.
More generally, any system that is decomposed into fine-grained modules will tend to exhibit a lot of communication between modules – the performance disadvantage of doing this under the MMU model outweighs the security and robustness advantages of having smaller modules.
Software fault isolation (SFI)
Software-based fault isolation, or SFI, aims to provide a substantially different model for fault isolation emphasizing fast communication between modules. In this model, all modules have the same view of memory (run in the same address space), but different subsets of memory are owned by different modules, and every write to memory is checked to make sure the current module owns that memory. Additionally, function calls between modules are controlled so that each module can only be entered at a predetermined set of entry points described in its interface specification.
Despite the name, the important thing here is not that these mechanisms are implemented in software – this offers the convenience of deploying the solution on existing commodity platforms, but is not essential, and indeed most SFI systems rely on some creative combination of software mechanisms and (existing) hardware mechanisms.
In its simplest form, this model is straightforward to provide: imagine you have a trusted compiler and you use it to compile an application consisting of multiple modules. Each module is assigned a fixed range of memory for its code and data. Whenever the compiler emits a write instruction, it also emits a check to make sure that the address being written to is in the range of the module being executed. Likewise, whenever the compiler emits an indirect branch or jump, it inserts a check to make sure that that jump either lies within the current module, or is a valid entry point of some other module. This simple solution has two main problems:
- It’s really slow.
- It’s possible for a malicious module to circumvent the checks. For example, it could insert an indirect branch to one of its own write instructions, skipping the check in front of it.
To deal with the circumvention problem, Wahbe et al use a dedicated register that holds an address. When the compiler emits code, it maintains two invariants:
- All writes must be performed to the address stored in the dedicated register;
- The dedicated register should almost always point to a valid address inside the current module; if any instruction invalidates this invariant, the program must either fail or restore the invariant before the next write or indirect branch instruction.
Now, a module is free to jump to any instruction it wants within its own bounds, without risking writing to another module’s memory. This does imply that the dedicated register can’t be used for any other purpose, but on a RISC machine with 32 registers this is not a problem. The same trick can be used to protect indirect branches (jumps to data must be excluded; this can be done either by using a second dedicated register for code, or by marking all data non-executable).
This leaves open the question of how to allow calls between modules (called cross-fault-domain RPCs by Wahbe et al). The scheme implemented in this work stores a jump table inside each module, with one entry for each possible cross-module call. The trusted compiler permits this jump table (and only this jump table) to contain branches to points outside the module. This allows the checks on indirect jumps to be very simple (they just have to make sure modules only jump to their own code region). Rather than transferring control directly to another domain, these jump tables transfer control to call stubs which perform several important operations before invoking the actual call:
- Because each module needs to access its own data on the call stack, each module is given a private stack in its own data region. When transferring control to another module, we have to switch to its stack and copy across any stack-based arguments. A similar mechanism is used by commodity OSs when trapping in and out of the kernel.
- Standard calling conventions include callee-save registers, registers which must not be altered by the function being called. Since the called module may be malicious, the call stub saves these registers instead.
- The dedicated register invariants must be restored upon return.
Finally, we return to the problem of performance: inserting complex checks before every write and indirect jump is expensive, especially if those checks involve branches. The insight of Wahbe et al is that if the code and data region for each module is the set of all addresses with some common bit prefix (say, addresses 0x1f00000 through 0x1fffffff) then we can make sure that all writes and jumps go into this region by merely bitmasking the target address to have the correct prefix. If the original address lies outside the region, this will cause some random part of the module to get overwritten or jumped to – but only malicious or invalid code will encounter this behavior, so it’s not a problem (except, perhaps, during debugging).
Another important optimization involves the stack: writes to the stack are considerably more common than writes to the heap in typical programs, and are usually made at a small fixed offset from either the stack pointer or the frame pointer. Rather than check every one of these, they take advantage of the stack’s locality by maintaining the invariant that the stack pointer (or frame pointer) points inside the module’s data region, and only checking it when it’s modified. Since writes through the stack pointer may have a small offset attached, they create “guard regions” containing nothing useful before and after each module’s data region; even if the stack pointer is at the beginning or end of the region and the offset is set to the minimum or maximum value, it still can’t write past the guard regions.
To actually implement privilege separation with SFI, a little something extra is needed – if all modules in an application could make the same system calls, they would effectively have the same privilege. As described in section 3.4, Wahbe et al’s scheme uses a simple but flexible model in which one module is permitted to make system calls freely and no other module is permitted to make any; it acts as an arbitrator and can implement arbitrary fine-grained policies by observing which module invoked it.
Evaluations show that with all the optimizations described above, the overall scheme is quite efficient – runtime overhead ranged from 0 to 12%, with an average of 4.3%. However, this is only checking writes – checking reads, which is done in the MMU model and is important for protecting sensitive module-private data such as passwords, requires a much higher overhead due to the larger number of reads (21.8% on average).
One practical issue with Wahbe et al’s scheme as a security solution is that it depends on a large, trusted compiler for correctness – later works such as MIT’s PittSFIeld mitigated this problem by using a separate, smaller verifier that is run on machine code when loading it.
Practical impact and later work
Despite the apparent utility of Wahbe et al’s scheme, sometimes termed classical SFI, it is not widely used in practice. This may be because the scheme depends critically on a number of features of RISC machines, and RISC machines are not widely deployed in the PC market (they are, however, quite common in the mobile and embedded market). It may be because it was never effectively developed into a robust product or integrated well with other tools. More recent work, such as MIT’s PittSFIeld (2006) and Vx32 (2008), have effectively extended SFI to CISC platforms such as the x86 and put more effort into promulgating a robust prototype.
Although SFI is a hot research area now, I remain skeptical of approaches like Vx32 that depend on creative exploitation of legacy hardware that is in the process of being phased out. The way I see it, SFI presents a useful fault isolation model that is independent of its software-based implementation, and deserves explicit hardware support to mitigate its performance costs. Doing an efficient check for each read and write is exactly the sort of thing hardware is good at (and software is really bad at). The hard question is exactly what hardware support should be implemented to support typical applications – SFI is so poorly deployed, that few applications have ever been architected with fine-grained modularization and privilege separation in mind, and the array of possible design choices is overwhelming. Some proposed schemes include Mondriaan Memory Protection (2002), and the Hard Object (2009) project I worked on at UC Berkeley. Just as important is more work on programming tools to describe and implement module isolation and interfaces in a generic way that can be implemented using a variety of fault isolation schemes, ranging from no protection to dynamic address translation to SFI to new hardware schemes – this will help to bootstrap the process of getting large enough applications built that new platforms for isolation can evaluated. Regardless, this is going to be an important area of research in the near future and I’m excited to see what techniques will gain adoption.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.