<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Papers in Computer Science</title>
	<atom:link href="http://papersincomputerscience.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://papersincomputerscience.org</link>
	<description>Discussion of computer science publications</description>
	<lastBuildDate>Fri, 25 Mar 2011 20:24:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>A Learning-Based Approach to Reactive Security</title>
		<link>http://papersincomputerscience.org/2011/03/25/a-learning-based-approach-to-reactive-security/</link>
		<comments>http://papersincomputerscience.org/2011/03/25/a-learning-based-approach-to-reactive-security/#comments</comments>
		<pubDate>Fri, 25 Mar 2011 20:21:56 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=186</guid>
		<description><![CDATA[This 2010 paper, a collaboration between security and machine learning researchers, makes the bold claim that rather invest your resources in making your system as secure as possible up-front, in many scenarios it's just as good &#8212; or even <em>preferable</em> &#8212; to fix security problems as attackers discover and exploit them, a paradigm they call <em>reactive security</em>.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learning-based approach to reactive security. In Proceedings of Financial Cryptography and Data Security (FC10), 2010. (<a href="http://www.adambarth.com/papers/2010/barth-rubinstein-sundararajan-mitchell-song-bartlett.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender’s strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker’s incentives and knowledge.</p>
<p><strong>Discussion</strong>: This 2010 paper, a collaboration between security and machine learning researchers, makes the bold claim that rather invest your resources in making your system as secure as possible up-front, in many scenarios it&#8217;s just as good — or even <em>preferable</em> — to fix security problems based on what systems attackers target, a paradigm they call <em>reactive security</em>. It justifies this with a simple game-theoretic model in which a defender with finite resources, typically a high-level security manager, must allocate them among particular lower-level security tasks.</p>
<div style="float: right;"><a href="http://papersincomputerscience.org/wp-content/uploads/2011/03/Reactive_security_example.png"><img class="alignnone size-full wp-image-189" title="Reactive_security_example" src="http://papersincomputerscience.org/wp-content/uploads/2011/03/Reactive_security_example.png" alt="" width="300" height="155" /></a></div>
<p>The primary model used by the paper, of interest on its own, is a two-player game taking place on a graph: the attacker begins at a <em>start vertex</em> and moves through the graph by executing successful attacks. Each vertex has a <em>payoff </em>value that the attackers receives once that vertex is reached, and each edge has a cost, indicating how much the attacker must pay to cross it (representing effort invested in an attack). The initial cost of an edge is based on the surface area of the system being attacked, and the defender can <em>invest</em> in defending an edge, temporarily increasing its cost for one round, but has only finite resources to go around during each round. Attackers and defenders alternate in rounds: the defender picks a set of edges to invest in, then the attacker gets to execute a series of attacks, always beginning at the start vertex. Finally, edge costs are hidden to the defender until the attacker uses the edge; this models how it&#8217;s difficult to determine <em>a priori</em> where security weaknesses in a system are.</p>
<p>Although the examples in the paper have vertices representing system components like machines on a network, I think vertices in the graph are best thought of not as particular systems being exploited, but rather a <em>set of resources</em> controlled by the attacker, or more generally, <em>the current attacking capabilities of the attacker</em>. At the start vertex, they control nothing but their own system; as each edge is traversed, they add new attacking resources to their collection. This allows systems like the one shown to the right, where an attacker may or may not have <em>full</em> control over a front-end system before attacking a back-end system, to be modelled.</p>
<p>The main result of the paper is this: under the assumption that the defender&#8217;s investment in an edge is linearly reflected in the attacker&#8217;s cost for that edge, a specific machine learning algorithm for reactive security (based on placing more weight on recently attacked nodes and exponentially decaying the weight over time) performs comparably to a <em>pure proactive</em> approach, in which the defender knows all edge costs <em>a priori</em> and picks a fixed, optimal strategy. This is done essentially by reducing the problem to a standard online learning problem and using known results. Moreover, they argue that in cases where the edge costs are estimated incorrectly or the attacker acts unexpectedly due to incomplete knowledge, the reactive security algorithm is superior to the proactive approach. Besides providing support for reactive security, this model also provides formal support for other pragmatic measures like <a href="http://en.wikipedia.org/wiki/Defense_in_depth_%28computing%29"><em>defense in depth</em></a>, which is investing in defense measures that are never needed unless some other defensive measure is first overcome.</p>
<p>Although the model is simple, it is quite general: not only can vertices represent machines on a LAN, but also components of an application, or a hybrid thereof. Traversing an edge can correspond to an attack from outside, a privilege elevation on a single machine, or to an attack between machines on a LAN. Two different attacks on the same system can be expressed by distinct edges with distinct costs, if the attacker possesses a different set of resources in each case, which makes sense. Even social engineering can be modelled: once the attacker has invested in tricking or bribing the employee, they can add the employee to their set of resources, lowering the cost of attacking new systems.</p>
<p>On the other hand, the model also has a number of weaknesses. A couple were made explicit in the paper:</p>
<ul>
<li>The model assumes that the mapping from defender investment in an edge to attacker cost to overcome it is linear. This is justified heuristically with the claim that the rate of discovering new security defects in software is roughly constant. Experience shows however that exploits that are easy to exploit can be hard to fix (e.g. DoS attacks), and exploits that are hard to exploit can be easy to fix (e.g. a buffer overflow in code that doesn&#8217;t directly handle user input). Moreover, the same &#8220;investment&#8221; in a system can be spent many different ways, and the model offers little insight into which way is most effective. One solution is to successively use the model itself to decompose vertices into subgraphs.</li>
<li>Reactive security is unsuitable for dealing with situations in which an attack is so devastating to the company that it cannot recover. For example, a company that suffers a high-profile blow to its reputation in the press may see a drop in sales that ultimately drives it into bankruptcy, or may have major investors pull out. Even an optimal response in such a scenario is too little, too late.</li>
</ul>
<p>Some other weaknesses are less obvious:</p>
<ul>
<li>The assumption that edge costs become known as soon as an attacker exploits that edge. In reality, the detection of a single exploit of a system gives very little information about the overall security of a system, particularly after that exploit is repaired. The frequency of exploits of an edge might be a better indicator (the learning may already effectively take this into account).</li>
<li>The assumption that edges have a fixed cost which only increases temporarily if the defender invests in them. In reality, the cost of an attack depends on many dynamic factors, including the skill set and resources available to the attacker marketplace, and the behavior of the system under attack. As new attackers enter the attacker marketplace, as attackers learn new techniques, as transaction costs among attackers change (or as they form teams), or as new hacking tools become available, the cost of attack can go up or down. Patches, upgrades, installation of new applications, or even changes in load can expose new vulnerabilities or fix old ones (in particular, patches implemented in reaction to particular attacks do not go away after the round is over). All these factors can change quickly and are largely invisible to the defender.</li>
<li>Similarly, payoff is assumed to be fixed from round to round. Payoff changes by the minute based not only on what new information the attacked resources are storing, but also the marketplace value of the information, which can be rapidly shifting and unpredictable. Reactive investment in defense of a system that, tomorrow, is of little interest to attackers is a bad idea.</li>
<li>Edge costs are assumed to be independent, in the sense that investing in one edge does not affect the cost of any other edge. In reality, as is easy to see in the diagram above, improvements to one edge may quite directly affect the cost of another related edge as well. Another common example would be rolling out an operating system upgrade in the data center, which could increase the cost of all edges simultaneously.</li>
<li>Attackers are assumed, after each round, to lose control over all their attacked resources. Long-lived attacks such as rootkits can go undetected during attempts to clean up after an attack, allowing the attacker to start in the next round with more resources in the bag at the beginning. In the worst case, the rootkit itself delivers a positive payoff to the attacker, and the attacker doesn&#8217;t need to take any additional action at all!</li>
</ul>
<p>Although the system provides formal evidence that reactive security can be beneficial, it also provides formal evidence that <em>extreme</em> reactive security, pouring all your resources into the system that was just attacked last week, is a terrible strategy. Simple attack strategies which alternate between systems can exploit this kind of reactionary tactic.</p>
<p>Despite the many weaknesses and limitations enumerated above, as I would expect to find in any nascent research area, I find this work exciting and think it opens up a range of possibilities for software security management and challenges the intuition that reacting to attackers is impulsive or short-sighted. Perhaps future work in this area may provide richer models that will offer more new and surprising strategies to defenders.</p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2011/03/25/a-learning-based-approach-to-reactive-security/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Conditioned-safe Ceremonies and a User Study of an Application to Web Authentication</title>
		<link>http://papersincomputerscience.org/2010/06/13/conditioned-safe-ceremonies-and-a-user-study-of-an-application-to-web-authentication/</link>
		<comments>http://papersincomputerscience.org/2010/06/13/conditioned-safe-ceremonies-and-a-user-study-of-an-application-to-web-authentication/#comments</comments>
		<pubDate>Sun, 13 Jun 2010 10:30:38 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=181</guid>
		<description><![CDATA[This paper introduces conditioned-safe ceremonies, an informal model for security protocols that explicitly models the actions of users. Rather than conservatively considering users to be unpredictable agents capable of any action, it takes advantage of their properties as creatures of habit to help facilitate the desired, secure outcome.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Chris Karlof, J. Doug Tygar, David Wagner. Conditioned-safe Ceremonies and a User Study of an Application to Web Authentication. Sixteenth Annual Network and Distributed Systems Security Symposium,  2009. (<a href="http://www.cs.berkeley.edu/~daw/papers/condsafe-ndss09.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: We introduce the notion of a conditioned-safe ceremony. A “ceremony” is similar to the conventional notion of a protocol, except that a ceremony explicitly includes human participants. Our formulation of a conditioned-safe ceremony draws on several ideas and lessons learned from the human factors and human reliability community: forcing functions, defense in depth, and the use of human tendencies, such as rule-based decision making. We propose design principles for building conditioned-safe ceremonies and apply these principles to develop a registration ceremony for machine authentication based on email. We evaluated our email registration ceremony with a user study of 200 participants. We designed our study to be as ecologically valid as possible: we employed deception, did not use a laboratory environment, and attempted to create an experience of risk. We simulated attacks against the users and found that email registration was significantly more secure than challenge question based registration. We also found evidence that conditioning helped email registration users resist attacks, but contributed towards making challenge question users more vulnerable.</p>
<p><strong>Discussion</strong>: This paper from NDSS 2009 introduces <em>conditioned-safe ceremonies</em>, an informal model for security protocols that explicitly models the actions of users. Rather than conservatively considering users to be unpredictable agents capable of any action, it takes advantage of their properties as creatures of habit to help facilitate the desired, secure outcome.</p>
<p>Many of us are familiar with the problem of security warnings sometimes known as <em>click fatigue</em>: when a user is asked to dismiss a security warning frequently during normal operation, they begin to disregard it in all situations. This was for example the primary criticism of Windows Vista&#8217;s <a href="http://en.wikipedia.org/wiki/User_Account_Control">User Account Control</a> (UAC) feature. There are two reasons for this: one is that security is not the primary concern of users who are focused on completing their primary task; the other is that humans asked to perform a process repeatedly will naturally begin to streamline the process by omitting optional steps and completing mandatory steps using rapid rule-based processing and simple pattern matching. If a situation called for a particular response in the past, visually similar stimuli will encourage users to perform the same task nearly automatically. In psychology this kind of decision-making strategy that settles upon an adequate solution is known as <a href="http://en.wikipedia.org/wiki/Satisficing"><em>satisficing</em></a>, and is very difficult to reverse.</p>
<p>Unfortunately this is precisely the type of user behavior exploited by phishers: a typical user presented with a log-in form resembling one they have used many times in the past will thoughtlessly enter their credentials, having long since eliminated the optional steps of carefully examining the security indicators that would expose it as a fraudulent replica.</p>
<p>A <a href="http://en.wikipedia.org/wiki/Cryptographic_protocol">cryptographic protocol</a> &#8211; such as <a href="http://en.wikipedia.org/wiki/Transport_Layer_Security">SSL</a> &#8211; is usually described in terms of a number of nodes representing participating machines which exchange messages over channels. A <em>ceremony</em>, coined by Intel&#8217;s Jesse Walker,<em> </em>extends the concept of protocols by incorporating nodes for the human users themselves and explicitly representing communication between users and their machines via I/O devices (Carl Ellison. <a href="http://eprint.iacr.org/2007/399.pdf">Ceremony Design and Analysis</a>. Cryptology ePrint Archive, Report 2007/399, 2007). This model opens up opportunities for modelling user behavior.</p>
<p>A <em>conditioned-safe </em>ceremony is one designed under the assumption that users will satisfice and behave according to habit; it operates by <em>conditioning </em>the user to follow certain <em>rules</em> during the ceremony. A simple example of conditioning is the Windows log-in screen, which  asks the user to press CTRL+ALT+DEL before logging in. This key signals  to the operating system that the information entered in the log-in  dialog should not be made available to applications or keyboard  interception drivers. Because users are always asked to do this, and are  not permitted to skip this step, they develop a consistent habit of  doing so. The inability of a user to log in without first pressing this  key is called a <em>forcing function</em>: forcing functions encourage  conditioning and discourage omitting steps which may seem unimportant.</p>
<p>A conditioned-safe ceremony should satisfy several important properties:</p>
<ul>
<li>It should only condition <em>safe </em>rules &#8211; &#8220;rules that are harmless to apply in the presence of an adversary.&#8221;</li>
<li>It should condition at least one <em>immunizing </em>rule &#8211; &#8220;a rule which when applied during an attack causes the attack to fail.&#8221;</li>
<li>Conditioned rules should be safe to  follow under all circumstances, without any complex  decision-making.</li>
<li>It should not assume users will reliably perform any action that is not conditioned by the ceremony.</li>
</ul>
<p>There are two different types of errors a user can make during a ceremony which may threaten its security:</p>
<ul>
<li>An error of <em>omission</em>: The user was expected to apply a rule but took no action.</li>
<li>An error of <em>commission</em>: The user took an unexpected action not conditioned by the ceremony.</li>
</ul>
<p>An attacker may attempt to induce either type of error. For example, if an application pops up a Windows log-in box, a user may unthinkingly enter their password without pressing the protective CTRL+ALT+DEL hotkey, because they were not instructed to do so in this instance. An error of <em>commission</em> is usually induced by the attacker giving the user specific instructions, such as &#8220;visit this URL in your web browser&#8221; or &#8221; Users tend to be suspicious of unfamiliar instructions, making attacks of this type more difficult. An ideal conditioned-safe ceremony should protect against as many errors of both types as possible.</p>
<p>The paper presents an example conditioned-safe ceremony for machine validation: the first time a user logs into a site from a particular machine, they must validate their identity. To do this, the site sends them an e-mail containing a link that they must click; after the link is clicked, a cookie is installed and the user has full access to the site from their current machine. The link only works once. The goal of the attacker is to trick the user into <em>not</em> clicking on the link in the e-mail, instead giving it to the attacker; to accomplish this, they display a phishing web page giving specific instructions on how to do this.  This involves both an error of omission (not clicking on a link that they usually click on), and an error of commission (pasting the link into the website, an action they do not normally take). The expectation is that, if users tend to perform actions they are accustomed to, they will ignore or fail to complete the attacker&#8217;s instructions, and the attack will fail.</p>
<p>Sure enough, the experiment bears this out: although as many as 40% of users fall for the attack described above, an alternative design that does not follow the principles of conditioned-safe ceremonies leads to attack  success rates of over 90%. Interviews with the subjects who didn&#8217;t fall for the attack show that over half of them didn&#8217;t notice the attacker&#8217;s special instructions, or thought they were unimportant &#8211; the same inattentive attitude that makes security warnings useless now <em>benefits</em> users!</p>
<p>The most exciting thing about this work to me is that it&#8217;s one of the first to adopt and exploit a successful model of human behavior in the design of security protocols &#8211; a critical step, as humans all too often remain the weakest link in any secure system. Previous efforts have given great insight into the types of errors  people make, but not into how designs can work around these limitations in human behavior.</p>
<p>On the other hand, the informal model presented in this work stands in stark contrast to the mathematical models used in cryptography, where cryptographic protocols are routinely subjected to formal verification techniques such as model checking and theorem proving &#8211; an important future direction is to generalize these same tools to ceremonies. Moreover, although satisficing behavior is evidently an important component in user behavior, it is obviously not the only such component: 40% of users were persuaded to commit multiple errors in the ceremony. It should come as no surprise that humans are complex creatures that cannot be adequately modelled by a simple set of conditioned rules. In this case, what processes underlie these divergent behaviors, and how can they be modelled? Another important question involves modelling of errors, or divergence of users from the model: can we empirically predict the likelihood of certain sets of errors occurring, and then formally validate that attacks are not possible  in the most likely scenarios? The attacks in this work were relatively <em>ad hoc</em> and don&#8217;t seem to rule out the possibility of another attack involving only a single user error.</p>
<p>In short, the area of ceremonies is fertile ground for the development of new models that can effectively predict the behavior of the system as a whole, facilitating the development of protocols that will subtly push users towards making all the right security decisions, even when security is the last thing on their mind.</p>
<p><em>The author releases all rights to all content herein and grants this  work into the public domain, with the exception of works owned by  others such as abstracts, quotations, and WordPress theme content.</em></p>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:0;width:1px;height:1px;overflow:hidden;">http://www.cs.berkeley.edu/~daw/papers/condsafe-ndss09.pdf(P</div>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2010/06/13/conditioned-safe-ceremonies-and-a-user-study-of-an-application-to-web-authentication/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Efficient Software-Based Fault Isolation</title>
		<link>http://papersincomputerscience.org/2009/12/19/efficient-software-based-fault-isolation/</link>
		<comments>http://papersincomputerscience.org/2009/12/19/efficient-software-based-fault-isolation/#comments</comments>
		<pubDate>Sat, 19 Dec 2009 22:37:27 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Operating systems]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=174</guid>
		<description><![CDATA[This 1993 paper describes a software-based method for isolating modules on RISC machines, with immediate application to fine-grained privilege separation. Address spaces, implemented in hardware, are used to isolate processes in modern commodity OS's, and software fault isolation (SFI) is an alternative with several advantages: most notably, it requires no hardware support, and communication cost between protection domains is much lower.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Wahbe, R., Lucco, S., Anderson, T. E., and Graham, S. L. 1993. Efficient software-based fault isolation. In <em>Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles</em> (Asheville, North Carolina, United States, December 05 &#8211; 08, 1993). SOSP &#8217;93. ACM, New York, NY, 203-216. (<a href="http://www.cs.washington.edu/homes/tom/pubs/sfi.ps">PS</a>) (<a href="http://crypto.stanford.edu/cs155/papers/sfi.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: One way to provide fault isolation among cooperating software modules is to place each in its own address space. However, for tightly-coupled modules, this solution incurs prohibitive context switch overhead. In this paper, we present a software approach to implementing fault isolation within a single address space. Our approach has two parts. First, we load the code and data for a distrusted module into its own <em>fault domain</em>, a logically separate portion of the application&#8217;s address space. Second, we modify the object code of a distrusted module to prevent it from writing or jumping to an address outside its fault domain. Both these software operations are portable and programming language independent.</p>
<p>Our approach poses a tradeoff relative to hardware fault isolation: substantially faster communication between fault domains, at a cost of slightly increased execution time for distrusted modules. We demonstrate that for frequently communicating modules, implementing fault isolation in software rather than hardware can substantially improve end-to-end application performance.</p>
<p><strong>Discussion</strong>: This 1993 paper by <a href="http://www.microsoft.com/presspass/exec/wahbe/">Wahbe</a> and <a href="http://www.microsoft.com/presspass/exec/de/Lucco/default.mspx">Lucco</a> (now at Microsoft), <a href="http://www.cs.washington.edu/homes/tom/">Thomas E. Anderson</a> (now at the University of Washington), and <a href="http://www.eecs.berkeley.edu/~graham/">Susan L. Graham</a> describes a software-based method for isolating modules on RISC machines, with immediate application to fine-grained privilege separation. Address spaces, implemented in hardware, are used to isolate processes in modern commodity OS&#8217;s, and software fault isolation (SFI) is an alternative with several advantages: most notably, it requires no hardware support, and communication cost between protection domains is much lower.</p>
<p><strong>Background</strong></p>
<p>Suppose a system is divided into <em>modules </em>(components) &#8211; for example, a multitasking operating system may have a kernel with various applications running on top of it. A system is said to provide <em>fault isolation</em> if it can recover from the failure of a module without risking the integrity of the rest of the system. For example, if you&#8217;re playing a game and it crashes, that shouldn&#8217;t cause your web browser &#8211; or your entire computer &#8211; to crash, or even to malfunction. Fault isolation is valuable for mitigating failures in large software systems where failures are inevitable.</p>
<p>Fault isolation also forms the foundation for a valuable security technique called <a href="http://en.wikipedia.org/wiki/Privilege_separation"><em>privilege separation</em></a>: if each module is only permitted to perform a limited set of certain operations, then even if a module is compromised by an attacker, the attacker only gains access to the privileges held by that module, rather than the entire system. For example, if you&#8217;re playing a game it should only have access to the game data files, not the files containing your bank information, which it doesn&#8217;t need; then, even if an attacker hijacks your game, they still can&#8217;t access your bank information. Privilege separation allows security vulnerabilities in large systems to be mitigated and allows the effort of security verification and auditing to be concentrated on the modules that hold the most dangerous privileges.</p>
<p>In a typical fault isolation system, each module <em>owns</em> some subset of memory where it stores its code and private data structures. To ensure that a misbehaving module does not compromise the integrity of other modules, modules must not be able to write to memory owned by other modules. Moreover, modules must communicate in a controlled manner, usually through an explicit interface, so that other modules can&#8217;t trick a module into modifying its own memory maliciously on their behalf.</p>
<p>Today, the primary mechanism used to enforce these properties is <em>dynamic address translation</em>, which maps virtual addresses (memory locations accessed by the application) to physical addresses (locations on the memory device itself) on-the-fly. This functionality is implemented by a specialized piece of hardware called a <a href="http://en.wikipedia.org/wiki/Memory_management_unit">memory management unit</a> (MMU). By introducing this layer of indirection, the operating system can ensure that a process has no ability to read or write memory belonging to other processes simply by configuring the MMU to provide no mapping from any virtual address to these locations. Because only the operating system kernel can reconfigure the MMU, this protection is secure enough to use with malicious code.</p>
<p>The downside to MMU-based protection is that communication is expensive. Suppose module A wants to send a message to module B. Normally, if both modules had the same view of memory, it would just call module B and pass it a pointer to the message. This takes only a few instructions and is very fast. When two modules have different address space mappings, however, the cost rises dramatically: just switching from running one module to running the other requires a context switch, which involves saving and restoring the complete register set, reconfiguring the MMU, and flushing the MMU&#8217;s cache (the <a href="http://en.wikipedia.org/wiki/Translation_lookaside_buffer">translation lookaside buffer</a>). On top of that, it can&#8217;t just pass a pointer to the message, because for module B that same pointer may map to a different location in physical memory. There are <a href="http://en.wikipedia.org/wiki/Inter-process_communication">many different ways</a> to send messages from one process to another, but these all involve overhead.</p>
<p>In applications where the number of messages sent is large, this overhead rapidly becomes prohibitive. An example is a Gigabit Ethernet driver, which has to send a new packet to the network stack about every 12 microseconds. Considering the CPU time needed to parse the headers, this leaves very little time for expensive context switches. As a consequence, in practice components like this are simply not modularized; if they crash, the system crashes.</p>
<p>More generally, any system that is decomposed into fine-grained modules will tend to exhibit a lot of communication between modules &#8211; the performance disadvantage of doing this under the MMU model outweighs the security and robustness advantages of having smaller modules.</p>
<p><strong>Software fault isolation (SFI)<br />
</strong></p>
<p>Software-based fault isolation, or SFI, aims to provide a substantially different model for fault isolation emphasizing fast communication between modules. In this model, all modules have the same view of memory (run in the same address space), but different subsets of memory are owned by different modules, and every write to memory is checked to make sure the current module owns that memory. Additionally, function calls between modules are controlled so that each module can only be entered at a predetermined set of entry points described in its interface specification.</p>
<p>Despite the name, the important thing here is not that these mechanisms are implemented in software &#8211; this offers the convenience of deploying the solution on existing commodity platforms, but is not essential, and indeed most SFI systems rely on some creative combination of software mechanisms and (existing) hardware mechanisms.</p>
<p>In its simplest form, this model is straightforward to provide: imagine you have a trusted compiler and you use it to compile an application consisting of multiple modules.  Each module is assigned a fixed range of memory for its code and data. Whenever the compiler emits a write instruction, it also emits a check to make sure that the address being written to is in the range of the module being executed. Likewise, whenever the compiler emits an indirect branch or jump, it inserts a check to make sure that that jump either lies within the current module, or is a valid entry point of some other module. This simple solution has two main problems:</p>
<ol>
<li>It&#8217;s really slow.</li>
<li>It&#8217;s possible for a malicious module to circumvent the checks. For example, it could insert an indirect branch to one of its own write instructions, skipping the check in front of it.</li>
</ol>
<p>To deal with the circumvention problem, Wahbe et al use a <em>dedicated register</em> that holds an address. When the compiler emits code, it maintains two invariants:</p>
<ol>
<li>All writes must be performed to the address stored in the dedicated register;</li>
<li>The dedicated register should almost always point to a valid address inside the current module; if any instruction invalidates this invariant, the program must either fail or restore the invariant before the next write or indirect branch instruction.</li>
</ol>
<p>Now, a module is free to jump to any instruction it wants within its own bounds, without risking writing to another module&#8217;s memory. This does imply that the dedicated register can&#8217;t be used for any other purpose, but on a RISC machine with 32 registers this is not a problem. The same trick can be used to protect indirect branches (jumps to data must be excluded; this can be done either by using a second dedicated register for code, or by marking all data non-executable).</p>
<p>This leaves open the question of how to allow calls between modules (called <em>cross-fault-domain RPCs</em> by Wahbe et al). The scheme implemented in this work stores a jump table inside each module, with one entry for each possible cross-module call. The trusted compiler permits this jump table (and only this jump table) to contain branches to points outside the module. This allows the checks on indirect jumps to be very simple (they just have to make sure modules only jump to their own code region). Rather than transferring control directly to another domain, these jump tables transfer control to call stubs which perform several important operations before invoking the actual call:</p>
<ul>
<li>Because each module needs to access its own data on the call stack, each module is given a private stack in its own data region. When transferring control to another module, we have to switch to its stack and copy across any stack-based arguments. A similar mechanism is used by commodity OSs when trapping in and out of the kernel.</li>
<li>Standard calling conventions include callee-save registers, registers which must not be altered by the function being called. Since the called module may be malicious, the call stub saves these registers instead.</li>
<li>The dedicated register invariants must be restored upon return.</li>
</ul>
<p>Finally, we return to the problem of performance: inserting complex checks before every write and indirect jump is expensive, especially if those checks involve branches. The insight of Wahbe et al is that if the code and data region for each module is the set of all addresses with some common bit prefix (say, addresses <tt>0x1f00000</tt> through <tt>0x1fffffff</tt>) then we can make sure that all writes and jumps go into this region by merely bitmasking the target address to have the correct prefix. If the original address lies outside the region, this will cause some random part of the module to get overwritten or jumped to &#8211; but only malicious or invalid code will encounter this behavior, so it&#8217;s not a problem (except, perhaps, during debugging).</p>
<p>Another important optimization involves the stack: writes to the stack are considerably more common than writes to the heap in typical programs, and are usually made at a small fixed offset from either the stack pointer or the frame pointer. Rather than check every one of these, they take advantage of the stack&#8217;s locality by maintaining the invariant that the stack pointer (or frame pointer) points inside the module&#8217;s data region, and only checking it when it&#8217;s modified. Since writes through the stack pointer may have a small offset attached, they create &#8220;guard regions&#8221; containing nothing useful before and after each module&#8217;s data region; even if the stack pointer is at the beginning or end of the region and the offset is set to the minimum or maximum value, it still can&#8217;t write past the guard regions.</p>
<p>To actually implement privilege separation with SFI, a little something extra is needed &#8211; if all modules in an application could make the same system calls, they would effectively have the same privilege. As described in section 3.4, Wahbe et al&#8217;s scheme uses a simple but flexible model in which one module is permitted to make system calls freely and no other module is permitted to make any; it acts as an <em>arbitrator</em> and can implement arbitrary fine-grained policies by observing which module invoked it.</p>
<p>Evaluations show that with all the optimizations described above, the overall scheme is quite efficient &#8211; runtime overhead ranged from 0 to 12%, with an average of 4.3%. However, this is only checking writes &#8211; checking reads, which is done in the MMU model and is important for protecting sensitive module-private data such as passwords, requires a much higher overhead due to the larger number of reads (21.8% on average).</p>
<p>One practical issue with Wahbe et al&#8217;s scheme as a security solution is that it depends on a large, trusted compiler for correctness &#8211; later works such as MIT&#8217;s PittSFIeld mitigated this problem by using a separate, smaller verifier that is run on machine code when loading it.</p>
<p><strong>Practical impact and later work<br />
</strong></p>
<p>Despite the apparent utility of Wahbe et al&#8217;s scheme, sometimes termed <em>classical SFI</em>, it is not widely used in practice. This may be because the scheme depends critically on a number of features of RISC machines, and RISC machines are not widely deployed in the PC market (they are, however, quite common in the mobile and embedded market). It may be because it was never effectively developed into a robust product or integrated well with other tools. More recent work, such as MIT&#8217;s <a href="http://people.csail.mit.edu/smcc/projects/pittsfield/">PittSFIeld</a> (2006) and <a href="http://pdos.csail.mit.edu/~baford/vm/">Vx32</a> (2008), have effectively extended SFI to CISC platforms such as the x86 and put more effort into promulgating a robust prototype.</p>
<p>Although SFI is a hot research area now, I remain skeptical of approaches like Vx32 that depend on creative exploitation of legacy hardware that is in the process of being phased out. The way I see it, SFI presents a useful fault isolation model that is independent of its software-based implementation, and deserves explicit hardware support to mitigate its performance costs. Doing an efficient check for each read and write is exactly the sort of thing hardware is good at (and software is really bad at). The hard question is exactly what hardware support should be implemented to support typical applications &#8211; SFI is so poorly deployed, that few applications have ever been architected with fine-grained modularization and privilege separation in mind, and the array of possible design choices is overwhelming. Some proposed schemes include <a href="http://groups.csail.mit.edu/cag/scale/mondriaan/index.html">Mondriaan Memory Protection</a> (2002), and the <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-97.html">Hard Object</a> (2009) project I worked on at UC Berkeley. Just as important is more work on programming tools to describe and implement module isolation and interfaces in a generic way that can be implemented using a variety of fault isolation schemes, ranging from no protection to dynamic address translation to SFI to new hardware schemes &#8211; this will help to bootstrap the process of getting large enough applications built that new platforms for isolation can evaluated. Regardless, this is going to be an important area of research in the near future and I&#8217;m excited to see what techniques will gain adoption.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/12/19/efficient-software-based-fault-isolation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>On understanding data abstraction, revisited</title>
		<link>http://papersincomputerscience.org/2009/12/05/on-understanding-data-abstraction-revisited/</link>
		<comments>http://papersincomputerscience.org/2009/12/05/on-understanding-data-abstraction-revisited/#comments</comments>
		<pubDate>Sat, 05 Dec 2009 04:57:11 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Programming languages]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=165</guid>
		<description><![CDATA[This October 2009 essay by William R. Cook seeks to clarify the fundamental difference between two often-confused forms of data abstraction: abstract data types (ADTs) and objects (as in object-oriented programming).]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: William R. Cook. On understanding data abstraction, revisited. ACM SIGPLAN Notices 44, 10, 557-572. October 2009. (<a href="http://www.cs.utexas.edu/~wcook/Drafts/2009/essay.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: In 1985 Luca Cardelli and Peter Wegner, my advisor, published an ACM Computing Surveys paper called “On understanding types, data abstraction, and polymorphism”. Their work kicked off a flood of research on semantics and type theory for object-oriented programming, which continues to this day. Despite 25 years of research, there is still widespread confusion about the two forms of data abstraction, abstract data types and objects. This essay attempts to explain the differences and also why the differences matter.</p>
<p><strong>Discussion</strong>: This October 2009 essay by Assistant Professor William R. Cook of the University of Texas at Austin, an active programming languages researcher, seeks to clarify the fundamental difference between two often-confused forms of data abstraction: <a href="http://en.wikipedia.org/wiki/Abstract_data_type">abstract data types</a> (ADTs) and <a href="http://en.wikipedia.org/wiki/Object_%28computer_science%29">objects</a> (as in <a href="http://en.wikipedia.org/wiki/Object-oriented_programming">object-oriented programming</a>).</p>
<p>Both abstract data types and objects have the same basic purpose: they allow the same functionality to be implemented in different ways, and allow the implementation to be modified without affecting client code. For example, when using a &#8220;<a href="http://en.wikipedia.org/wiki/Associative_array">dictionary</a>&#8221; module mapping keys to values, the client shouldn&#8217;t have to care whether that functionality is internally implemented using a hash table or a red-black tree; and the implementer should be free to modify this kind of implementation detail at any time.</p>
<p>The &#8220;objects&#8221; provided by mainstream object-oriented programming languages such as Java, C++, and C# are actually a sort of hybrid of abstract data types and true objects, able to effectively simulate both abstractions. To demonstrate the distinction, here we outline an implementation for an &#8220;IntQueue&#8221; class representing a first-in first-out queue of integers in Java using both methods. First using objects:</p>
<p>[sourcecode language="java"]<br />
interface IIntQueue {<br />
  public boolean isEmpty();<br />
  public void enqueue(int i);<br />
  public int dequeue();<br />
  public IIntQueue append(IIntQueue q);<br />
}</p>
<p>class IntQueue implements IIntQueue {<br />
  java.util.ArrayList list;</p>
<p>  public IntQueue() { list = new java.util.ArrayList(); }<br />
  public boolean isEmpty() { return list.size() == 0; }<br />
  public void enqueue(int i) { list.add(i); }<br />
  public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }<br />
  public IIntQueue append(IIntQueue q) {<br />
    IntQueue result = new IntQueue();<br />
    while (!this.isEmpty()) { result.enqueue(this.dequeue()); }<br />
    while (!q.isEmpty()) { result.enqueue(q.dequeue()); }<br />
    return result;<br />
  }<br />
}</p>
<p>public class Program {<br />
  public static void main(String[] args) {<br />
    IIntQueue q1 = new IntQueue(), q2 = new IntQueue();<br />
    q1.enqueue(1); q1.enqueue(2);<br />
    q2.enqueue(3); q2.enqueue(4);<br />
    IIntQueue q3 = q1.append(q2);<br />
    while (!q3.isEmpty()) {<br />
      System.out.println(q3.dequeue());<br />
    }<br />
  }<br />
}<br />
[/sourcecode]</p>
<p>This Java sample follows two important disciplines:</p>
<ol>
<li>Concrete class names, such as IntQueue, are only ever used in &#8220;new&#8221; expressions, not as parameter types, return types, or even local variable types &#8211; and this includes in the definition of the IntQueue class itself. Instead, we are constrained to exclusively use interfaces (pure virtual classes in C++) for our static types.</li>
<li>Reference equality (==) is never used.</li>
</ol>
<p>Although they may seem draconian, adhering to this &#8220;pure object&#8221; discipline introduces some powerful flexibility. For example, instead of having the &#8220;append&#8221; method construct the combined queue all at once, we can introduce a new class that allows us to rewrite it as a constant-time operation:</p>
<p>[sourcecode language="java"]<br />
class IntQueue implements IIntQueue {<br />
  // (same as before)<br />
  public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }<br />
}</p>
<p>class LazyAppendIntQueue implements IIntQueue {<br />
  IIntQueue q1, q2;</p>
<p>  public LazyAppendIntQueue(IIntQueue q1, IIntQueue q2) { this.q1 = q1; this.q2 = q2; }<br />
  public boolean isEmpty() { return q1.isEmpty() &amp;&amp; q2.isEmpty(); }<br />
  public void enqueue(int i) { q2.enqueue(i); }<br />
  public int dequeue() { if (q1.isEmpty()) return q2.dequeue(); else return q1.dequeue(); }<br />
  public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }<br />
}<br />
[/sourcecode]</p>
<p>This modification requires no changes to the client code; indeed, no client code could possibly detect any change in behavior, as long as it&#8217;s constrained to accessing objects through the IIntQueue interface.</p>
<p>An alternate way of optimizing the append() method is to append the underlying arrays, and then make this the underlying array of a new IntQueue. In the strict objects model, however, this is impossible: the append() method can see the representation of <em>this</em>, but can only access its argument through the IIntQueue interface. We might try to fix this by adding a method to the interface to get the underlying array, but that would severely limit the possible implementations of IIntQueue (in particular, LazyAppendIntQueue above would be excluded). The restriction that objects cannot access other objects except through their interface, even other instances of the same type, is called autognosis, and is a fundamental limitation of objects.</p>
<p>Now let&#8217;s consider how this same data structure could be implemented in the style of an abstract data type (ADT):</p>
<p>[sourcecode language="java"]<br />
class IntQueue {<br />
  java.util.ArrayList list;</p>
<p>  public IntQueue() { list = new java.util.ArrayList(); }<br />
  public boolean isEmpty() { return list.size() == 0; }<br />
  public void enqueue(int i) { list.add(i); }<br />
  public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }<br />
  public IntQueue append(IntQueue q) {<br />
    IntQueue result = new IntQueue();<br />
    result.list = this.list;<br />
    result.list.addAll(q.list);<br />
    return result;<br />
  }<br />
}<br />
[/sourcecode]</p>
<p>This looks more like typical Java code &#8211; each method of IntQueue is free to ravage the internal state of any instance of IntQueue, and interfaces are not used; consequently, different implementations of the same module are not interchangeable at runtime. They are, however, interchangeable at compile-time &#8211; if we had two different implementations of &#8220;class IntQueue&#8221; in two different namespaces, we could decide with a single &#8220;import&#8221; statement which one we&#8217;d like to use. Different parts of the program might choose to use different implementations, but they wouldn&#8217;t &#8220;mix&#8221; &#8211; they couldn&#8217;t be passed to one another&#8217;s append() methods. For these reasons, ADTs are considerably less flexible than true objects, but what they lose in flexibility they make up for in simplicity and the potential for optimization.</p>
<p>Considering how much more ADTs look like normal Java code, you might be asking yourself where objects are really used in practice. In general, they tend to pop up in places where a single, relatively stable interface has many different implementations; common examples include &#8220;windows, filesystems, or device drivers.&#8221; Filesystems are the most illustrative example: by providing a single fixed API, not only can many different filesystems be accessed through the same uniform filesystem interface, but they can be composed in interesting ways, such as mounting an ISO (a file on one filesystem contains the underlying storage for a different filesystem).</p>
<p>The filesystem example also exposes one of the strongest disadvantages of objects: once many implementations have been written to a stable interface, changing that interface in any way is very difficult. ADT signatures on the other hand are straightforward to extend; for example, if a client of Dictionary needs to iterate over the dictionary in order, you can just add that functionality to the version based on red-black trees, and then compel that client to use that particular implementation.</p>
<p>Following a development process based on refactoring, it&#8217;s not uncommon to introduce an abstraction using ADTs and later refine it into a true object when more flexibility is needed. Conversely, the need for optimization using autognosis may require introducing some elements of ADTs. Using type inspection like &#8220;instanceof&#8221; and reflection, it may be possible to get the &#8220;best of both worlds&#8221; by optimizing certain common cases, but falling back on the generic interface for unrecognized types. This is sort of a poor man&#8217;s version of <a href="http://en.wikipedia.org/wiki/Multiple_dispatch">multiple dispatch</a>.</p>
<p>Neither Cook nor I promote the use of either ADTs or true objects over the other. The most important takeaway from this is: know how to use both ADTs and true objects, recognize which one you&#8217;re using, and know their advantages and disadvantages. By avoiding the conceptual conflation that hybrid data abstraction systems tend to induce, you can better predict where to expect issues with extensibility and efficiency in your system.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/12/05/on-understanding-data-abstraction-revisited/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Embedded image coding using zerotrees of wavelet coefficients</title>
		<link>http://papersincomputerscience.org/2009/07/08/embedded-image-coding-using-zerotrees-of-wavelet-coefficients/</link>
		<comments>http://papersincomputerscience.org/2009/07/08/embedded-image-coding-using-zerotrees-of-wavelet-coefficients/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 06:04:03 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Information theory and compression]]></category>
		<category><![CDATA[Signal processing]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=155</guid>
		<description><![CDATA[This 1993 paper described one of the earliest successful image compression algorithms using the discrete wavelet transform, and was the first published fully embedded code for images. With such a code, an image file can be truncated at any point and the result will be a valid approximation to the image, enabling a fine-grained tradeoff between file size and quality.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Shapiro, J.M., &#8220;Embedded image coding using zerotrees of wavelet coefficients,&#8221; <em>Signal Processing, IEEE Transactions on</em> , vol.41, no.12, pp.3445-3462, Dec 1993. (<a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=258085">IEEE</a>) (<a href="http://www.cs.tut.fi/~tabus/course/SC/Shapiro.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: The embedded zerotree wavelet algorithm (EZW) is a simple, yet remarkably effective, image compression algorithm, having the property that the bits in the bit stream are generated in order of importance, yielding a fully embedded code. The embedded code represents a sequence of binary decisions that distinguish an image from the &#8220;null&#8221; image. Using an embedded coding algorithm, an encoder can terminate the encoding at any point thereby allowing a target rate or target distortion metric to be met exactly. Also, given a bit stream, the decoder can cease decoding at any point in the bit stream and still produce exactly the same image that would have been encoded at the bit rate corresponding to the truncated bit stream. In addition to producing a fully embedded bit stream, EZW consistently produces compression results that are competitive with virtually all known compression algorithms on standard test images. Yet this performance is achieved with a technique that requires absolutely no training, no pre-stored tables or codebooks, and requires no prior knowledge of the image source.</p>
<p>The EZW algorithm is based on four key concepts: 1) a discrete wavelet transform or hierarchical subband decomposition, 2) prediction of the absence of significant information across scales by exploiting the self-similarity inherent in images, 3) entropy-coded successive-approximation quantization, and 4) universal lossless data compression which is achieved via adaptive arithmetic coding.</p>
<p><strong>Discussion</strong>: This 1993 paper described <a href="http://en.wikipedia.org/wiki/Embedded_Zerotrees_of_Wavelet_transforms">EZW</a>, one of the earliest successful image compression algorithms using the <a href="http://en.wikipedia.org/wiki/Discrete_wavelet_transform">discrete wavelet transform</a>, and was the first published <em>fully embedded code</em> for images. With such a code, an image file can be truncated at any point and the result will be a valid approximation to the image, enabling a fine-grained tradeoff between file size and quality. It is also, unusually for its time, a compression scheme that requires no training on an existing database of images, instead adapting to the statistics of the image at hand.</p>
<p>If you don&#8217;t have a background in signal processing there&#8217;s quite a bit of background required for this one, but I&#8217;ll summarize the necessary concepts here.</p>
<p>The first concept, adaptive <a href="http://en.wikipedia.org/wiki/Arithmetic_coding">arithmetic coding</a>, is an entropy coding strategy used in a huge variety of both lossless and lossy coding schemes. To simplify discussion, let&#8217;s consider the simple scenario where we&#8217;re decoding a bilevel (black and white) image file pixel-by-pixel in scan order, and we have access to all preceding pixels in scan order. Suppose the previous pixel is white. How can we use this information to predict the color of the next pixel? One simple way is to keep a count of how many times so far a white pixel has been followed by a black pixel, and how many times it&#8217;s been followed by a white pixel; these frequencies are used to estimate the probabilities that the next pixel will be black, or white. This is the first part of the entropy coding scheme, called the <em>model</em>; it is <em>adaptive</em> because it depends on what other pixels have been decoded so far.</p>
<p>The second half is the arithmetic coding. Like all entropy coding schemes, it encodes a sequence of symbols, using less bits to represent a symbol that is expected to occur with high probability, and more bits to represent a symbol that is expected to occur with low probability. The details of how this coding is done are not important here; the important thing is that it is <em>near-optimal</em>, in the sense that if a symbol is expected to appear with probability <em>p</em>, it can be encoded in close to −log<sub>2</sub><em> p</em> bits, even if this value is not an integer, which is the lower bound of <a href="http://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem">Shannon&#8217;s source coding theorem</a>. For example: a symbol expected to appear with probability 0.5 can be encoded in −log<sub>2</sub>0.5 = 1 bit, a symbol expected to appear with probability 0.95 can be encoded in −log<sub>2</sub>0.95 or about 0.074 bits, and a symbol expected to occur with probability 0.001 requires about 10 bits. Compared to Huffman coding, arithmetic coding is particularly useful in situations where the number of symbols is small, or where the probabilities are close to 1.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Discrete_wavelet_transform">discrete wavelet transform</a> (DWT) is, like the <a href="http://en.wikipedia.org/wiki/Discrete_cosine_transform">discrete cosine transform</a> (DCT) used by JPEG files, a way of representing an image in the frequency domain instead of the spatial domain. It is a reversible operation that converts an array of pixels into an equal-sized array of <em>coefficients</em> each representing how much energy the image has in a certain range of frequencies, in a certain part of the image. The reason for doing this is that natural images like photographs tend to have most of their energy at low frequencies due to their smooth, continuously-varying regions, with high-frequency information like edges and fine details being limited to only a small portion of the image. The result is that after undergoing such a transformation, most of the coefficients will be close to zero, and can be truncated to zero with little visual impact on the image. The main difference between the DCT and the DWT is the DCT is applied to a fixed-size spatial region: in JPEG, it&#8217;s always applied to an 8×8 block of pixels. The DWT is applied in a multiresolution fashion: it uses a high-pass filter to isolate local changes in brightness (details) in the image between adjacent pixels, then low-pass filters and downscales the image and repeats this process. As a consequence its coefficients vary not only in the frequency range they cover but also in the size of the spatial region they cover, allowing them to represent both very fine details and larger-scale details compactly. Since a range of frequencies is called a <em>subband</em>, this is also called a <em>hierarchical subband decomposition</em>.</p>
<p><a href="http://commons.wikimedia.org/wiki/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png"><img class="alignright size-medium wp-image-157" title="Jpeg2000_2-level_wavelet_transform-lichtenstein" src="http://papersincomputerscience.org/wp-content/uploads/2009/07/jpeg2000_2-level_wavelet_transform-lichtenstein.png?w=300" alt="Jpeg2000_2-level_wavelet_transform-lichtenstein" width="300" height="300" /></a>The array of DWT coefficients can be visualized as a new image, which at low bit rates will be mostly black (close to zero), but with white in some places, representing the significant coefficients. See the example to the right (<a href="http://commons.wikimedia.org/wiki/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png">Alessio Damato / CC-BY-SA</a>). The main challenge in representing such an image is not so much representing the <em>values</em> of the significant coefficients,  but their <em>position</em>, using a data structure called a <em>significance map</em>.</p>
<p>This is where the zerotrees come in. Notice how the three images in the upper-left closely resemble the three large images. In particular, where the smaller images are black, the larger images also tend to be black. Zerotrees take advantage of this by encoding the smaller images first, and assigning one of four values to each coefficient: positive significant coefficient, negative significant coefficient, zerotree root, and isolated zero. A <em>zerotree root</em> is an insignificant (near zero) coefficient for which the corresponding coefficients in all larger images are also insignificant. An isolated zero is an insignificant coefficient that does not have this property. Once a zerotree root has been encoded, none of its descendants need to be encoded, since they&#8217;re already known to be insignificant. Additionally, since isolated zeros are rare, this symbol is assigned a small probability and the use of arithmetic coding ensures that not too much space is wasted accounting for this possibility. The parameter that determines how much space our zerotree occupies is the threshold between significant and insignificant coefficients.</p>
<p>This is already a great way of compressing images, but how do we turn this into a fully-embedded code that can be truncated at any point? The key to this lies in a successive-approximation scheme. In this scheme, initially all coefficients are assumed to be insignificant, so the initial zerotree is trivial (one zerotree symbol). We then lower the threshold (divide it by 2) and compute and output a new zerotree, and repeat this indefinitely. Any significant coefficients already detected by previous zerotrees are assumed to be zero, so that they don&#8217;t get encoded more than once. Simultaneously, each time we produce a new zerotree we also add one bit of precision to the magnitudes of all the already-known significant coefficients. We don&#8217;t have to finish writing out any particular zerotree, since each one prioritizes writing out coarse details (smaller coefficient images) before finer details (larger coefficient images); any coefficients left unencoded are assumed to be insignificant.</p>
<p>What&#8217;s happened to EZW today? Its raw compression performance and speed have long been since been exceeded by other systems, such as <a href="http://en.wikipedia.org/wiki/Set_partitioning_in_hierarchical_trees">SPIHT</a> (Said, Amir and Pearlman, William A. A new fast and efficient image codec based upon set partitioning in hierarchical trees. June 1996. <a href="http://guohanwei.51.net/compression/SPIHT/SPIHT.pdf">PDF</a>),which benefits from a more complex partitioning scheme then the simple splitting-into-four shown here, and other improvements. But its ideas remain the foundation of all the best modern image compression systems. These ideas were incorporated into the <a href="http://en.wikipedia.org/wiki/JPEG_2000">JPEG 2000</a> image format, which uses a very similar wavelet-based fully-embedded code. This eliminates JPEG&#8217;s blocking artifacts due to its multiresolution nature, permits a fine-grained quality-for-size tradeoff, gives significantly better quality at low bit rates, and allows lossy and lossless encoding in a single framework, among other advantages. JPEG 2000 hasn&#8217;t seen wide adoption in consumer applications &#8211; one can only speculate as to why &#8211; but these techniques remain a valuable way to look at image compression.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/07/08/embedded-image-coding-using-zerotrees-of-wavelet-coefficients/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Soft Heap: An Approximate Priority Queue with Optimal Error Rate</title>
		<link>http://papersincomputerscience.org/2009/07/06/the-soft-heap-an-approximate-priority-queue-with-optimal-error-rate/</link>
		<comments>http://papersincomputerscience.org/2009/07/06/the-soft-heap-an-approximate-priority-queue-with-optimal-error-rate/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 04:39:18 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Algorithms and optimization]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=152</guid>
		<description><![CDATA[This 2000 paper by Bernard Chazelle introduced a data structure called a soft heap, which is a heap data structure that can perform any operation in amortized constant time. In exchange for this ideal performance bound, a new limitation is imposed [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Chazelle, B. The soft heap: an approximate priority queue with optimal error rate. <em>Journal of the ACM</em> 47, 6 (Nov. 2000), 1012-1027. (<a href="http://www.cs.princeton.edu/~chazelle/pubs/sheap.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: A simple variant of a priority queue, called a <em>soft heap</em>, is introduced. The data structure supports the usual operations: insert, delete, meld, and findmin. Its novelty is to beat the logarithmic bound on the complexity of a heap in a comparison-based model. To break this information-theoretic barrier, the entropy of the data structure is reduced by artificially raising the values of certain keys. Given any mixed sequence of <em>n</em> operations, a soft heap with error rate ε (for any 0 &lt; ε ≤ 1/2) ensures that, at any time, at most ε<em>n</em> of its items have their keys raised. The amortized complexity of each operation is constant, except for insert, which takes O(log 1/ε) time. The soft heap is optimal for any value of ε in a comparison-based model. The data structure is purely pointer-based. No arrays are used and no numeric assumptions are made on the keys. The main idea behind the soft heap is to move items across the data structure not individually, as is customary, but in groups, in a data-structuring equivalent of &#8220;car pooling.&#8221; Keys must be raised as a result, in order to preserve the heap ordering of the data structure. The soft heap can be used to compute exact or approximate medians and percentiles optimally. It is also useful for approximate sorting and for computing minimum spanning trees of general graphs.</p>
<p><strong>Discussion</strong>: This 2000 paper by Bernard Chazelle introduced a data structure called a <a href="http://en.wikipedia.org/wiki/Soft_heap">soft heap</a>, which is a <a href="http://en.wikipedia.org/wiki/Heap_(data_structure)">heap data structure</a> that can perform any operation in <a href="http://en.wikipedia.org/wiki/Amortized_analysis">amortized</a> constant time. In exchange for this ideal performance bound, a new limitation is imposed: at any time, any element in the heap may have its key increased at the discretion of the data structure; such an element is said to be <em>corrupted</em>. There is an upper limit of ε<em>n</em> corrupted elements, where <em>n</em> is the number of operations completed so far and ε is a fixed parameter. For example, if ε = 0.01, and we insert 1000 elements into the heap, at most 10 elements will be corrupted.</p>
<p><a href="http://en.wikipedia.org/wiki/Heap_(data_structure)">Heaps</a> are one of the fundamental data structures studied in computer science. Like dictionaries or hash tables, they store a set of elements, each with an associated key. Heaps support four fundamental operations:</p>
<ul>
<li>insert: Insert a new element, given the element and its key.</li>
<li>extract minimum: Get the element with minimum key and delete it from the heap.</li>
<li>decrease key: Update an existing element&#8217;s key to a smaller value.</li>
<li>merge or meld: Combine two existing heaps together into one large heap.</li>
</ul>
<p>The most obvious application of these operations is a <a href="http://en.wikipedia.org/wiki/Priority_queue">priority queue</a>: if the keys indicate priority (with smaller values indicating higher priority), then <em>insert</em> adds a new task to the queue, and <em>extract minimum</em> retrieves the current highest-priority task. Heaps also form the basis of the <a href="http://en.wikipedia.org/wiki/Heapsort">heapsort</a> sorting algorithm, which in its simplest form inserts each element into a heap one-by-one and then uses extract minimum repeatedly to get the result list in order.</p>
<p>The sorting application demonstrates an important lower bound regarding heaps: it is impossible to design a heap data structure that can do both the <em>insert</em> and <em>extract minimum</em> operations in amortized constant time, for if this were possible we would be able to do a <a href="http://en.wikipedia.org/wiki/Comparison_sort">comparison sort</a> in linear (O(<em>n</em>)) time; but any such sort requires Ω(<em>n</em> log <em>n</em>) time. Indeed, the best known heap data structure, the <a href="http://en.wikipedia.org/wiki/Fibonacci_heap">Fibonacci heap</a>, developed in 1984, can do all these operations in amortized constant time except for <em>extract minimum</em>, which requires Θ(log <em>n</em>) amortized time.</p>
<p>But what if there were some way to modify the heap data structure that let us break through this lower bound and do all operations in amortized constant time without contradicting the above result? This is exactly what the soft heap does. A soft heap is like a normal heap, but with the troubling property that the key you assign to an element when you insert it may not stay that way; more formally, the data structure may choose at its discretion to increase the key of any element at any time. Any element that has had its key increased is called <em>corrupted</em>. It&#8217;s easy to see how this might be helpful in reducing complexity; if the data structure just decided to increase the keys of all elements to infinity, it would be trivial to do all operations in constant time.</p>
<p>But to make soft heaps actually <em>useful </em>and not just <em>fast</em>, we need an additional constraint. This constraint is as follows: each soft heap has a fixed parameter ε called the <em>error rate</em>. Roughly speaking, this controls what proportion of the elements may be corrupted at any given time. More formally, it specifies that if we start with the empty heap and do <em>n</em> operations, the resulting heap will have at most ε<em>n</em> corrupted elements. If all <em>n</em> operations are inserts, then indeed ε will be an upper bound on the proportion of corrupted elements. If <em>extract minimum</em> operations are included, the relationship is no longer so clear &#8211; it&#8217;s possible <em>all</em> elements may become corrupt.</p>
<p>As Chazelle so elegantly says, &#8220;despite this apparent weakness, the soft heap is optimal and—perhaps even more surprising—useful.&#8221; Perhaps the simplest nontrivial application of soft heaps is <a href="http://en.wikipedia.org/wiki/Selection_algorithm">finding the median</a> of a list of values in linear time. There is a classical algorithm for doing this, but the implementation and analysis of the algorithm based on the soft heap is much simpler. All we have to do is set ε = 1/3, then insert all <em>n</em> elements into the heap, and then do <em>extract minimum</em> <em>n</em>/3 times. Depending on exactly which elements got corrupted, the final element extracted, the <em>pivot</em>, will be somewhere between the <em>(n</em>/3)th and (2<em>n</em>/3)th smallest element in the list. We then partition the list into the sublist of elements less than or equal to the pivot, and those greater than the pivot, and recursively invoke this procedure on the one containing the median. This is similar to the <a href="http://en.wikipedia.org/wiki/Quicksort">quicksort</a> algorithm, but only recursively processing one of the sublists instead of both, and it runs in worst-case linear time.</p>
<p>Historically, soft heaps were invented with a specific application in mind: computing <a href="http://en.wikipedia.org/wiki/Minimum_spanning_tree">minimum spanning trees</a>. Chazelle has achieved some fame for discovering the asymptotically best known algorithms for several problems, and this is one of them: his soft-heap based algorithm can compute minimum spanning trees in O(<em>n</em> α(<em>n</em>)) time, where α is the inverse <a href="http://en.wikipedia.org/wiki/Ackermann_function">Ackermann function</a>, a very slowly-growing function that is less than 5 for all remotely practical values of <em>n</em>. I don&#8217;t discuss the details of this algorithm here because they&#8217;re pretty complex, but you can see Chazelle&#8217;s paper (Bernard Chazelle. A Minimum Spanning Tree Algorithm with Inverse-Ackermann Type Complexity. JACM 47(6):1028–1047, 2000, <a href="http://www.cs.princeton.edu/~chazelle/pubs/mst.pdf">PDF</a>). In 2002, Pettie and Ramachandran presented a minimum spanning tree algorithm based on soft heaps that is known to be asymptotically optimal.</p>
<p>So how does it work? In Chazelle&#8217;s original implementation, the soft heap was a simple variant on the <a href="http://en.wikipedia.org/wiki/Binomial_heap">binomial heap</a>. Instead of storing a single element at each node, it would store a list of elements, associated with a single key that is an upper bound on the original keys of all the elements. This is the source of the &#8220;corruption&#8221; &#8211; any values in this list whose original keys are smaller than the common key are corrupted. Such &#8220;multivalued&#8221; nodes could only appear near the top levels of the trees making up the heap, which limits how many corrupted elements there can be. These lists are formed during the soft heap&#8217;s special &#8220;sift-up&#8221; routine, which under certain conditions will run twice on the same node, causing it to merge (concatenate) element lists with one of its children. There are a number of subtleties to this process, however, and to explicate them Chazelle gives complete C source code for the data structure in a <a href="http://en.wikipedia.org/wiki/Literate_programming">literate programming</a> style.</p>
<p>Recently a simpler implementation of soft heaps by Kaplan and Zwick has appeared (<a href="http://www.siam.org/proceedings/soda/2009/SODA09_053_kaplanh.pdf">A simpler implementation and analysis of Chazelle&#8217;s soft heaps</a>. In <em>Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms</em> (Jan 4-6, 2009).) They use the original soft heaps concept of storing a list of items at each node, and of occasionally running sift on elements twice during sifting, but rather than being based on binomial heaps, this implementation is based on a set of heap-ordered binary trees. Also, their condition for running sift twice on a node is somewhat less arbitrary: each node <em>x</em> has an associated &#8220;size&#8221; value that increases exponentially with rank, and the size |list(<em>x</em>)| of the element list at any internal node <em>x</em> must always satisfy size(<em>x</em>) ≤ |list(<em>x</em>)| ≤ 3size(<em>x</em>).</p>
<p>How do you know if soft heaps might be useful for something you&#8217;re doing? Due to their peculiar behavior, they tend to be most useful for inventing new algorithms with better worst-case performance than existing algorithms, particularly in situations where the &#8220;approximate rank&#8221; &#8211; or approximate position in sorted order &#8211; of an element in a list is useful. It tends to be applied in places where ordinary heap data structures have traditionally been applied in the past, like graph algorithms and sorting. Beyond this though, there may still remain a rich set of untapped applications of soft heaps to be found that employ them in novel ways.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/07/06/the-soft-heap-an-approximate-priority-queue-with-optimal-error-rate/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Pattern Classification</title>
		<link>http://papersincomputerscience.org/2009/06/17/pattern-classification/</link>
		<comments>http://papersincomputerscience.org/2009/06/17/pattern-classification/#comments</comments>
		<pubDate>Wed, 17 Jun 2009 05:45:35 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Artificial intelligence]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=146</guid>
		<description><![CDATA[This classic machine learning textbook discusses a variety of well-studied application-independent methods of classifying inputs into one of several distinct classes, as well as some of the theory behind them. The book's organization is constructed around the amount of information available to the classifier, beginning with scenarios where the actual distribution of features of classes is unknown, and finishing with scenarios where even the classes themselves must be inferred.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: R.O. Duda, P.E. Hart, and D.G. Stork. <em>Pattern Classification</em>, 2nd edition. New York: John Wiley &amp; Sons, 654 pages, 2001, ISBN: 0-471-05669-3.</p>
<p><strong>Discussion</strong>: This classic machine learning textbook discusses a variety of well-studied application-independent methods of classifying inputs into one of several distinct classes, as well as some of the theory behind them. The book&#8217;s organization is constructed around the amount of information available to the classifier, beginning with scenarios where the actual distribution of features of classes is known, and finishing with scenarios where even the classes themselves must be inferred. Among the many techniques considered are <a href="http://en.wikipedia.org/wiki/Decision_tree">decision trees</a>, <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">neural networks</a>, <a href="http://en.wikipedia.org/wiki/Genetic_algorithm">genetic algorithms</a>, <a href="http://en.wikipedia.org/wiki/Boltzmann_machine">Boltzmann learning</a>, <a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">nearest neighbor classification</a>, <a href="http://en.wikipedia.org/wiki/K-means_algorithm"><em>k</em>-means clustering</a>, <a href="http://en.wikipedia.org/wiki/Principal_components_analysis">principal</a> and <a href="http://en.wikipedia.org/wiki/Independent_component_analysis">independent component analysis</a>, <a href="http://en.wikipedia.org/wiki/Formal_grammar">formal grammars</a>, <a href="http://en.wikipedia.org/wiki/Linear_discriminant_analysis">linear discriminants</a>, <a href="http://en.wikipedia.org/wiki/Resampling_(statistics)">resampling</a> methods like <a href="http://en.wikipedia.org/wiki/Bootstrap_aggregating">bagging</a> and <a href="http://en.wikipedia.org/wiki/Boosting">boosting</a>, and <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">maximum-likelihood estimation</a>; most of these are treated in great detail.</p>
<p>Perhaps the best demonstration of the level of detail that the book goes into is its discussion of neural networks in chapter 6. Whereas any general book on artificial intelligence is bound to include basic discussion of neural networks and their dominant learning algorithm, <a href="http://en.wikipedia.org/wiki/Backpropagation">backpropagation</a>, the discussion here is both denser and covers a broad range of extensions and relations to other methods. First of all, it contextualizes neural networks and places them in their historical context by preceding the chapter with a thorough chapter on linear discriminants, a powerful tool in their own right that are basically equivalent to neural networks with no hidden layers. Unlike more generalized references which tend to merely describe the most common implementation of neural networks without any justification, every design choice going into the neural network is derived and justified and alternative choices are discussed, from the update rules to the activation function to the number of layers and hidden units to the initial weights to the training protocol to the stopping condition. Perhaps most important from an engineer&#8217;s point of view is section 6.8, which discusses a variety of extensions needed to make neural networks efficient and accurate in practice, such as input standardization, momentum, weight decay, hints, use of second-order information, and so on. Each of these occupies no more than a page, but still effectively conveys the essence of the technique. Besides these essentials, the chapter also informally proves the power of three-level networks, and incorporates a variety of interesting visualizations of network structure, network weights, error surfaces, convergence, and so on, which helps a lot in developing an intuition for the behavior of neural networks. In short, this chapter contains about as much information on neural networks as I would normally expect to get from an entire book on the subject.</p>
<p>The book isn&#8217;t just a bunch of survey papers stapled together though; it has a number of things that tie it together. Most important is its organization of progressively less well-specified problems. Chapter 2 begins with Bayesian decision theory and describes how to perform classification optimally, given the complete probability distribution of each feature for each class. This also brings in the first major theoretical result, the lower bound on classification error given by the <em>Bayes error</em>. From there it progresses to maximum-likelihood estimation, in which the distribution type (e.g. Gaussian) is known but not its parameters (such as mean and standard deviation), and these parameters must be determined from manually-classified example data. After this comes simple nonparametric techniques like nearest neighbor estimation and linear discriminants that make no assumptions about the distribution but only work well when the classification function is simple and the dimension isn&#8217;t too high. Neural networks can describe more complex functions, but still get tripped up by complex classification functions with many local minima and maxima. The next chapter addresses stochastic methods like simulated annealing and genetic algorithms that cope well with complex classification functions, at the expense of much greater training time. Skipping over a couple chapters, the final one deals with unsupervised classification, where the example data have not been assigned to classes and the classes are not even known and must be inferred; a typical technique in this scenario is clustering. This organization is effective at emphasizing the point that a classifier needs to take advantage of as much information as the application domain makes available to be effective.</p>
<p>Chapters 8 and 9 don&#8217;t fall well into the natural progression. Most of the chapters deal with numeric features, like width or age, leaving discrete features such as color or gender, and the classifers that process them such as decision trees and formal grammars, to chapter 8. Chapter 9, entitled &#8220;algorithm-independent machine learning&#8221;, is based around theory and metalearning techniques that can be used to enhance all classifiers. It begins with the fundamental <a href="http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization">No Free Lunch theorem</a>, which essentially states that no classifier performs better in all cases than any other classifier, including the naive classifer that merely guesses at random. It also discusses resampling methods that aim to improve unstable classifiers or combine different types of classifiers, and describes how to compare one classifier with another, which is essential when espousing the benefits of any new classifier.</p>
<p>The progressive organization, while useful, also necessarily implies that some essential basic topics, such as decision trees (chapter 8), resampling methods (chapter 9), and <em>k</em>-means clustering (chapter 10), are not introduced until long after more specialized topics have been visited, such as Bayesian belief networks (chapter 2) and hidden Markov models (chapter 3). As a consequence it probably makes more sense to just read the first few sections of each chapter first, rather than reading the book straight through.</p>
<p>Another frustration with this book is that there are a number of forward references: in particular, cross-validation is mentioned several times before it&#8217;s finally explained in Chapter 9, and probabilistic neural networks were discussed in chapter 4 in the context of Parzen windows, with neural networks discussed in chapter 6. They also assume quite a bit of math background in areas that aren&#8217;t covered in-depth in most undergraduate programs, like matrix calculus and statistics, with only a terse appendix for a refresher on this material. It would have been helpful to demonstrate of these techniques in the one-dimensional case, where the familiar intuitions of real number calculus apply.</p>
<p>Well dense and thorough, there are necessarily some things the book doesn&#8217;t cover, in order to fit into 600 pages. Some theorems are not proved in the text, particularly the No Free Lunch Theorem, but also many of the convergence results and probability bounds. This is generally not a big deal since the book is intended more for those looking to apply machine learning to an application domain, rather than those looking to study theoretical machine learning and conceive new techniques &#8211; and its references are thorough enough that this information can be located if needed. On the other hand, it also doesn&#8217;t contain any real code or references to real machine learning libraries &#8211; it&#8217;s not a programmer&#8217;s handbook. Finally, the second edition in particular concentrates heavily on application-independent models, and does not delve into the intricacies of the immense number of applications of machine learning; problems like determining the best features to use and how to compute them are discussed but motivated only by toy examples and not real-world problems like speech recognition, image processing, or expert systems. It helps when reading it to have an application domain in mind to mentally plug into the models.</p>
<p>In short, if you&#8217;re looking for a dense and application-independent survey of major classification algorithms in machine learning, this is the book to get you started, and will take you a few steps beyond anything you might have studied in an AI intro class. But I would precede it with some reading on matrix calculus, and follow it with some reading on a specific application area of your choice, to help make the concepts more concrete.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/06/17/pattern-classification/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Learning from hints in neural networks</title>
		<link>http://papersincomputerscience.org/2009/05/24/learning-from-hints-in-neural-networks/</link>
		<comments>http://papersincomputerscience.org/2009/05/24/learning-from-hints-in-neural-networks/#comments</comments>
		<pubDate>Sun, 24 May 2009 14:52:03 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Artificial intelligence]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=142</guid>
		<description><![CDATA[This short 1990 machine learning paper introduced a technique for learning with hints in a neural network; that is, it allows a neural network to learn a function for which we have some limited information about its properties.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Abu-Mostafa, Y. S. 1990. Learning from hints in neural networks. <em>Journal of Complexity,</em> 6, 2 (Jun. 1990), 192-198. (<a href="http://www.work.caltech.edu/pub/Abu-Mostafa1990hints.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: Learning from examples is the process of taking input-output examples of an unknown function <em>f</em> and infering an implementation of <em>f</em>. Learning from hints allows for general information about <em>f</em> to be used instead of just input-output examples. We introduce a method for incorporating any invariance hint about <em>f</em> in any descent method for learning from examples. We also show that learning in a neural network remains NP-complete with a certain, biologically plausible, hint about the network. We discuss the information value and the complexity value of hints.</p>
<p><strong>Discussion</strong>: This short 1990 machine learning paper introduced a technique for <em>learning with hints</em> in a <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">neural network</a>; that is, it allows a neural network to learn a function for which we have some limited information about its properties.</p>
<p>In machine learning, neural networks are a simple way of representing functions that is sufficiently powerful to approximate any function. They consist of a set of at least three layers of processing nodes connected by edges labelled with weights. All the inputs into each node are multiplied by the weights on their edges, then a sigmoid function (a particular strictly increasing function bounded between -1 and 1) is applied to produce the output. By adjusting the weights, we can gradually modify the function being computed.</p>
<p>Neural networks are particularly useful in classification problems; for example, one might construct a neural network that takes as input an image and outputs 1 if it looks like a hamburger, or else -1. It would be really difficult to manually write a program that does this. Instead, neural networks can be <em>trained </em>with a set of inputs and their associated outputs; the weights are adjusted based on the examples using the <a href="http://en.wikipedia.org/wiki/Backpropagation">backpropagation</a> algorithm until it computes a function close to the actual one.</p>
<p>For our purposes, the most important thing to note is that the backpropagation works by feeding the training input to the network and determining how far the output is from the desired training output, called the <em>error</em>. It then adjusts the weights in a way that decreases that error. It does this repeatedly until it reaches a stopping point.</p>
<p>Backpropagation is efficient and general but can run into two important related problems:</p>
<ul>
<li><em>Insufficient data</em>: There may not be enough training data to learn weights that generalize well to new inputs.</li>
<li><em>Overfitting</em>: The resulting function may end up being oversensitive to parameters of the training examples that are actually irrelevant.</li>
</ul>
<p>For example, say you train a hamburger recognizer on many images off the Internet, and then I give it a picture of an upside-down hamburger. Because it&#8217;s never seen an upside-down hamburger before, it&#8217;s quite likely to claim that it&#8217;s not a hamburger, despite the fact that intuitively we know that orientation does not affect an image&#8217;s hamburgerness. Likewise it may fail to recognize an image that is smaller or larger than those in its training set, or where lighting is unusual. These kinds of restrictions on the function representation are called <em>invariants</em>. Invariants cannot be directly expressed as input-output examples; they are larger restrictions on the scope of functions under consideration.</p>
<p>The most obvious way to deal with invariants is to expand your training set &#8211; turn all your training images upside-down, and add them to your training set. But when we begin to consider more and more combinations of invariants, this approach can rapidly grow infeasible. Not only does the training set become large, but if there are not enough inputs in the original training set to teach the invariant, then it will not be properly learned.</p>
<p>The key observation of Abu-Mostafa&#8217;s work is that we can don&#8217;t need to rely entirely on training examples. Whereas training examples specify the constraint that <em>f</em> (<em>x</em>) = <em>y</em>, where <em>f</em> is the function we&#8217;re learning and <em>x </em>and <em>y</em> are the input/output, for invariants it&#8217;s more useful to deal with <em>equality examples</em>, which are pairs where <em>f </em>(<em>x</em><sub>1</sub>) = <em>f</em> (<em>x</em><sub>2</sub>). The backpropagation algorithm can be easily modified to accomodate this: instead of computing the error based on the distance between the actual output and desired output, we compute it as the distance between the two outputs produced by the two examples. There&#8217;s no requirement to know what the value of<em> f </em>(<em>x</em><sub>1</sub>) or <em>f</em> (<em>x</em><sub>2</sub>) is. Using this advantage in our example, we can take any image, even if we don&#8217;t know whether or not it looks like a hamburger, rotate it, and use the two images to train our network. We could even generate random images and rotate these.</p>
<p>Another way of framing this is that we want to encourage the network to learn <em>new features</em> that describe the input but in some way summarize or interpret the original input features. These new features can then be leveraged by the network to guide the final output. To take an example from the book <em>Pattern Classification</em> (Richard Duda, Peter Hart, David Stork, section 6.8.12): if the input is a soundwave, and we want to determine what speech phoneme it represents,  a useful intermediate feature would be deciding whether it&#8217;s a vowel or a consonant. To encourage the network to learn this feature, we can add a new output node that is set to 1 or -1 depending on whether the input soundwave is a vowel or consonant. By incorporating an initial training phase that modifies the network weights to predict this output well, we now have a starting network that can already distinguish vowels and consonants, which is a big help in making finer subclassification. Without the hints, the network may have learned this on its own, but the more domain-specific information we can give it, the quicker it will train and the better it will generalize. In the case of our original example, these intermediate features would be properties of the original image that are insensitive to the invariants like rotation.</p>
<p>For learning with hints, neural networks with more than three layers of nodes are often helpful as well. The idea is that the first layer can convert the input features to the new intermediate features of interest, and then normal learning can be applied to these. Where the invariants are very simple, they can even be applied to the inputs before training (and prediction), placing them in canonical form. For example, to help deal with brightness variation in images, it helps to scale all images to the same brightness range.</p>
<p>My apologies for the long delay in this post &#8211; I&#8217;m currently engaged in reading the book <em>Pattern Classification</em>, and intend to follow up here with a discussion of it when I&#8217;m done.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/05/24/learning-from-hints-in-neural-networks/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Bagging Predictors</title>
		<link>http://papersincomputerscience.org/2009/05/07/bagging-predictors/</link>
		<comments>http://papersincomputerscience.org/2009/05/07/bagging-predictors/#comments</comments>
		<pubDate>Thu, 07 May 2009 10:38:45 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Artificial intelligence]]></category>
		<category><![CDATA[bagging]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=140</guid>
		<description><![CDATA[This 1996 machine learning paper described a very simple mechanism called bagging (short for "bootstrap aggregating") for improving the accuracy of predictors by averaging the results of a number of independent predictors trained over samples from the same training set. It is one of a family of resampling methods.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Leo Breiman. Bagging predictors. <em>Machine Learning,</em> 24, 2 (Aug. 1996), 123-140.</p>
<p><strong>Abstract</strong>: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.</p>
<p><strong>Discussion</strong>: This 1996 <a href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a> paper described a very simple mechanism called <em>bagging</em> (short for &#8220;<a href="http://en.wikipedia.org/wiki/Bootstrap_aggregating"><em>bootstrap aggregating</em></a>&#8220;) for improving the accuracy of predictors by averaging the results of a number of independent predictors trained over samples from the same training set.</p>
<p>Suppose we&#8217;re creating a decision tree to predict, based on a number of medical tests, whether or not a patient has cancer. A typical way to do this is to gather some cases from a medical database, then use them to train a predictor model, such as a <a href="http://en.wikipedia.org/wiki/Decision_tree">decision tree</a> or <a href="http://en.wikipedia.org/wiki/Neural_network">neural network</a>.</p>
<p>The issue with this straightforward approach is that many predictors used in machine learning, including decision trees and neural networks, are <em>unstable</em>: training them on a slightly different data set can produce significantly different results. For example, if we split our training set in half, and use each half to build a decision tree, those two decision trees are likely to be quite different and yield different predictions. This is a result of <a href="http://en.wikipedia.org/wiki/Overfitting"><em>overfitting</em></a>, the phenomenon where a model fits itself too well to the training data, and not well enough to the complete population of past and future cases underlying the sample.</p>
<p>To overcome this, we introduce ideas from a statistical technique called <a href="http://en.wikipedia.org/wiki/Bootstrapping_(statistics)"><em>bootstrapping</em></a>. To motivate this concept, suppose you have a random sample of 1000 test scores from a pool of 1 million, and you want to estimate what the median test score is. Because we can&#8217;t know this exactly, we instead seek a confidence interval that has a 98% chance of containing the median. Finding a confidence interval for the <em>mean</em> test score is easy using standard techniques, because the sample mean and population mean are analytically closely related, but the same is not true of the median.</p>
<p>When analysis fails, we take a more computational approach. The idea is to use the sample itself as an approximate model of the underlying distribution: we can replicate each score 1000 times to obtain a pool of 1 million scores that will statistically resemble the real 1 million scores. We then take (say) 200 independent samples of size 1000 from this pool, and compute their medians. Finally, we take the 1% and the 99% percentiles of these median scores to get our confidence interval. In practice, rather than duplicating scores many times, it&#8217;s more typical to simply sample with replacement, so that each score can be chosen multiple times. Provided the original sample is a good empirical approximation of the complete population, the result is a very good prediction of the actual median.</p>
<p>What does this have to with machine learning? Well, suppose instead of test scores we have medical test results, and instead of the median, we&#8217;re estimating whether they have cancer. We draw (with replacement) bootstrapping samples from the original data, and for each one we use it as training data to construct a predictor. Then, for new cases we take a majority vote among the predictions to determine the overall prediction. Provided that the data is a good empirical approximation of the complete population of past and future cases, the result will be a good prediction that resists overfitting.</p>
<p>This technique is extremely general, and can be applied to virtually any type of classification problem and any model. It can even use completely different models for different bootstrap samples, which can make it easy to sidestep difficult questions about choosing good models and model parameters. It has three main disadvantages:</p>
<ul>
<li>Performance: For each of the bootstrap samples, we have to construct a separate model from scratch, and each query must bear the overhead of querying each of these models and combining their results. Fortunately, bagging is easily parallelizable &#8211; because of this, and because it can extract good accuracy from fast-to-query models like decision trees, it can actually be more efficient in some cases.</li>
<li>Relies on instability: If the model being used is <em>stable</em>, or in other words does not change significantly between bootstrap samples, the resulting system can actually be less accurate than a model built directly on the original data. An example of a stable model is a nearest-neighbor classifier. To vividly illustrate this point, the paper describes an experiment with a linear regression model where the stability of the model depends on the number of variables used. Sure enough, there is a crossover point in accuracy between the bagged and the unbagged versions of this predictor.</li>
<li>Interpretability: One of the widely cited advantages of models like the decision tree is that it&#8217;s easy for a human to understand how it makes it decisions &#8211; like a flowchart, it just performs a series of tests. But when you have twenty decision trees, each making different decisions on the same inputs, it&#8217;s harder to get an intuitive feel for their aggregate decision making &#8211; it&#8217;s like predicting the decision that a committee will make by learning about its members.</li>
</ul>
<p>Bagging is just one of a family of <em>resampling ensemble methods</em>; another popular one is called <a href="http://en.wikipedia.org/wiki/Boosting"><em>boosting</em></a>, a technique that iteratively combines multiple predictors by encouraging each new model to focus on cases misclassified by previous models. Boosting can learn more quickly than bagging because of its focus on misclassified cases, but is also less robust to errors in training data. A 2005 paper by Kotsiantis and Pintelas described a scheme for combining the two <span lang="EN-US">(&#8220;Combining Bagging and Boosting&#8221;, International         Journal of Computational Intelligence,</span> <a href="http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/ijci%20paper%20kotsiantis.pdf">PDF</a>).</p>
<p>Bagging has already played a strong role in the design of machine learning systems. Now that we&#8217;ve hit the clock speed barrier and are seeing more highly parallel and multicore systems, it&#8217;s bound to become an even more attractive option. Keep it in mind if you ever need to design a simple but accurate machine learning system.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/05/07/bagging-predictors/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Random Oracles are Practical: A Paradigm for Designing Efficient Protocols</title>
		<link>http://papersincomputerscience.org/2009/05/03/random-oracles-are-practical-a-paradigm-for-designing-efficient-protocols/</link>
		<comments>http://papersincomputerscience.org/2009/05/03/random-oracles-are-practical-a-paradigm-for-designing-efficient-protocols/#comments</comments>
		<pubDate>Sun, 03 May 2009 08:16:40 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Cryptography]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=133</guid>
		<description><![CDATA[This influential 1993 paper introduced the Random Oracle Model, designed to enable a methodology for the design of efficient formally sound cryptographic protocols such as encryption and authentication, and is one of the milestones in the philosophy of cryptographic protocol design.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Bellare, M. and Rogaway, P. Random oracles are practical: a paradigm for designing efficient protocols. In <em>Proceedings of the 1st ACM Conference on Computer and Communications Security</em> (Fairfax, Virginia, United States, November 3-5, 1993). CCS 1993. ACM, New York, NY, 62-73. (<a href="http://www-cse.ucsd.edu/~mihir/papers/ro.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: We argue that the random oracle model — where all parties have access to a public random oracle — provides a bridge between cryptographic theory and cryptographic practice. In the paradigm we suggest, a practical protocol P is produced by first devising and proving correct a protocol PR for the random oracle model, and then replacing oracle accesses by the computation of an &#8220;appropriately chosen&#8221; function h. This paradigm yields protocols much more efficient than standard ones while retaining many of the advantages of provable security. We illustrate these gains for problems including encryption, signatures, and zero-knowledge proofs.</p>
<p><strong>Discussion</strong>: This influential 1993 paper introduced the <em>random oracle model</em>, designed to enable a methodology for the design of efficient formally sound cryptographic protocols such as encryption and authentication, and is one of the milestones in the philosophy of cryptographic protocol design.</p>
<p>The holy grail of cryptography is cryptographic protocols &#8211; algorithms for tasks like encryption and signing &#8211; that are provably secure against attack. However, in the study of algorithms, negative results such as &#8220;no efficient algorithm exists to break this scheme&#8221; are notoriously difficult to prove; this is one reason why major problems like whether P=NP defy solution.</p>
<p>Like complexity theorists, cryptographers turned to conditional results: they assume that some <em>cryptographic primitive</em> is available &#8211; a very simple building block such as <a href="http://en.wikipedia.org/wiki/One-way_function">one-way functions</a>, or <a href="http://en.wikipedia.org/wiki/Trapdoor_function">one-way trapdoor functions</a> &#8211; and then build more complex primitives, including complete protocols, using them. In &#8220;The Random Oracle Metholodgy, Revisited,&#8221; discussed later, Ran Canetti defended this practice:</p>
<blockquote><p><em>One of the great contributions of complexity-based modern cryptography, developed in the past quarter of a century, is the ability to base the security of many varied protocols on a small number of well-defined and well-studied complexity assumptions. Furthermore, typically the proof of security of a protocol provides us with a method for transforming [an] adversary that breaks the security of said protocol into an adversary that refutes one of the well-studied assumptions. The Random Oracle Methodology does away with these advantages.</em></p></blockquote>
<p>One may draw an analogy with the P=NP problem: why do we suppose that NP-hard problems are difficult to solve? Because if we could solve any one of them in polynomial time, it would resolve the longstanding P=NP problem and provide fast algorithms for a variety of well-studied problems. Likewise, if we could show that a protocol relying on a particular cryptographic primitive is insecure, it would solve a longstanding open problem (e.g. do one way functions exist?) and simultaneously break many other protocols relying on that primitive.</p>
<p>However, unlike the case of P=NP, systems proven conditionally secure by reduction to cryptographic primitives are rarely implemented. Indeed, until the publication of &#8220;Random Oracles are Practical,&#8221; most cryptographic systems <em>used in practice</em> had no formal justification whatsoever. This is because conditionally secure formal systems are much too inefficient for practical use. Yet there were <em>ad hoc </em>efficient cryptographic protocols in use in practice that clearly were good at withstanding concerted attack &#8211; without any formal backing, what was it that gave them their strength? Bellare and Rogaway write:</p>
<blockquote><p><em>Theorists view certain primitives (e.g. one-way functions) as &#8220;basic&#8221; and build more powerful primitivees (e.g. <a href="http://en.wikipedia.org/wiki/Pseudorandom_function_family">pseudorandom functions</a>) out of them in inefficient ways; but in practice, powerful primitives are readily available and the so-called basic ones seem to be no easier to implement. In fact theorists deny themselves the capabilities of practical primitives which satisfy not only the strongest kinds of assumptions they like to make, but even have strengths which have not been defined or formalized.</em></p></blockquote>
<p>To facilitate the design of more efficient protocols,  the authors advance the following methodology: suppose that all parties in a cryptographic protocol, including the attacker, have access to a random oracle. A random oracle is like an ideal pseudorandom number generator: you <em>seed</em> it with a particular initial value, and then it gives you an arbitrarily long sequence of random bits. The sequence depends on the seed, but has no other predictable patterns (and in particular, does not cycle).</p>
<p>In real life, random oracles cannot be implemented; deterministic programs cannot produce arbitrarily-long random outputs. But this paper asserts the following <em>thesis</em>: if a protocol can be proven secure in the presence of a random oracle, and we then replace calls to the random oracle with calls to a good <a href="http://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator">cryptographically secure pseudorandom number generator</a>, a process called <em>instantiation</em>, the resulting protocol is expected to be secure in practice. Moreover, schemes designed using this method are expected to be more efficient than conditionally secure schemes.</p>
<p>It&#8217;s not hard to see that this methodology is not perfectly sound in theory. For example, one can design a cryptographic protocol that passes zero to the oracle, and if the result is f(0), it reveals its secret key. In the random oracle model, this is very unlikely to occur, so the system remains secure; but if the system is instantiated with <em>f</em>, then it always occurs, and the system is completely insecure. Part of the thesis is that &#8220;an appropriate instantiation for a random oracle ought to work for any protocol which did not intentionally frustrate our method by anticipating the exact mechanism which would instantiate its oracles&#8221; (section 6, Instantiation).</p>
<p>To provide some practical justification for its thesis, &#8220;Random Oracles are Practical&#8221; presents a very simple, efficient encryption algorithm for long messages that is secure against a variety of attacks in the random oracle model. We begin by assuming (as in the conditionally secure model) that we have access to two primitives:</p>
<ul>
<li>a trapdoor permutation <em>f</em>, which is a function that, like RSA, can encrypt or decrypt a short <em>k</em>-bit value;</li>
<li>a random hash function H that inputs arbitrarily long strings and outputs <em>k</em>-bit results.</li>
</ul>
<p>Now suppose we want to encrypt the message <em>x</em>. We choose a <em>k</em>-bit random value <em>s</em> which will be used as the seed for the random oracle. We XOR the output of the oracle with the message <em>x</em> to generate the encrypted message. Finally, we append the encryption <em>f</em>(<em>s</em>) of the seed with the trapdoor permutation, so that the receiver can determine <em>s</em>, and the hash H(<em>sx</em>) of the seed together with the message to protect against tampering. This scheme is secure against three types of attacks in the random oracle model:</p>
<ul>
<li>In a <em>chosen-plaintext</em> attack, the polynomial-time adversary supplies two strings, one of them is encrypted, and the adversary must determine by examining the ciphertext which one was encrypted. They don&#8217;t have to do this perfectly, just significantly more than 50% of the time.</li>
<li>In a <em>chosen-ciphertext</em> attack, the adversary is additionally given access to the decryption function, and they can ask it to decrypt any string <em>except</em> the encrypted string they&#8217;re given.</li>
<li>Finally, a <em>malleability adversary</em> is one that, given the encryption of one string, can determine the encryption of a related string, such as an extension of the original string or one where some characters are replaced by other characters.</li>
</ul>
<p>Sure enough, when the random oracle is replaced by a cryptographically secure pseudorandom number generator, the result is a system that is secure in practice, in the sense that no practical attack against the system has ever been found. The authors go on to describe similar efficient systems for signing and non-interactive zero-knowledge proofs that are sound in the random oracle model.</p>
<p>Over the years, many researchers took the advice of this paper and designed new efficient protocols based on the random oracle model. Encouragingly, these systems withstood the test of the time and have not been broken. This seems to say that the methodology is sound. But in 2002, a strong result by Canetti, Goldreich, and Halevi cast doubt upon the thesis (&#8220;The Random Oracle Methodology, Revisited.&#8221; <em>J. ACM</em> 51, 4 (Jul. 2004), 557-594. <a href="http://eprint.iacr.org/1998/011.pdf">PDF</a>). Their main result is as follows:</p>
<blockquote><p>There exist signature and encryption schemes that are secure in the Random Oracle Model, but for which <em>any implementation</em> of the random oracle results in insecure schemes. [...] Moreover, each of these schemes has a &#8220;generic adversary&#8221;, that when given as input the description of an implementation of the oracle, breaks the scheme that uses this implementation.</p></blockquote>
<p>In other words, not only can these encryption schemes be broken for any possible instantiation, but they supply a constructive method to do so <em>fully automatically</em>.</p>
<p>The proof method is similar to the example cited earlier of an encryption protocol revealing secret information if the oracle returns <em>f</em>(0) when given zero for the seed, where <em>f</em> is the cryptographically secure pseudorandom number generator used to instantiate the protocol. The difference is that instead of modifying the protocol to always fail for one particular <em>f</em>, it causes it to fail on a different <em>f</em> for each input. The input that causes failure under instantiation with <em>f</em> is simply the description of <em>f</em> (i.e., the machine code for <em>f</em>). Because the attacker has access to this information, they know just how to break the system. Fundamentally, this attack relies on a sort of &#8220;white box&#8221; reverse engineering &#8211; not only can the attacker <em>execute</em> the function <em>f</em>, but they can examine its code as well.</p>
<p>This does not say that the practical schemes designed to be secure in the random oracle model are insecure — to the contrary, they have stood the test of time quite robustly — but it does call the theoretical advantages of the paradigm into doubt. The authors of &#8220;The Random Oracle Methodology, Revisited&#8221; disagree sharply about the philosophical implications of this result for protocol design. Ran concludes that &#8220;the Random Oracle model is a bad abstraction of protocols for the purpose of analyzing security,&#8221; and strongly prefers the older approach of reductions to hard problems. Oded sees the random oracle model as an effective &#8220;sanity check&#8221; for rapidly ruling out insecure designs. Shai speculates on two possible reasons why existing protocols based on the random oracle model have defied attack:</p>
<ol>
<li>The systems are more difficult to break because a proof of security in the random oracle model rules out a large class of attacks &#8211; those that work in the random oracle model. Methods of attack that overcome this will take longer to develop.</li>
<li>Existing protocols have some as-yet-unidentified feature that makes them more resilient than the contrived protocols described in this work.</li>
</ol>
<p>Shai&#8217;s view of the utility of the model is more optimistic: he sees it as an &#8220;engineering tool&#8221; that is better than having no formal proof of security at all, and makes finding attacks more difficult.</p>
<p>What&#8217;s the future of the random oracle model? Perhaps we will identify some class of protocols for which instantiation is a sound procedure, or at least for which the generic attacks in this paper are inapplicable. Or perhaps we will adopt a new model that provides stronger guarantees. For now, formalizing and justifying the design of cryptographic protocols remains a difficult problem and a contentious philosophical debate.</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/05/03/random-oracles-are-practical-a-paradigm-for-designing-efficient-protocols/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

