<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Papers in Computer Science &#187; Distributed systems</title>
	<atom:link href="http://papersincomputerscience.org/category/distributed-systems/feed/" rel="self" type="application/rss+xml" />
	<link>http://papersincomputerscience.org</link>
	<description>Discussion of computer science publications</description>
	<lastBuildDate>Fri, 25 Mar 2011 20:24:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Virtual Time</title>
		<link>http://papersincomputerscience.org/2009/04/23/virtual-time/</link>
		<comments>http://papersincomputerscience.org/2009/04/23/virtual-time/#comments</comments>
		<pubDate>Thu, 23 Apr 2009 08:44:15 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Distributed systems]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.org/?p=125</guid>
		<description><![CDATA[Citation: David R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7, 3 (Jul. 1985), 404-425. (PDF) Abstract: Virtual time is a new paradigm for organizing and synchronizing distributed systems which can be applied to such problems as distributed discrete event simulation and distributed database concurrency control. Virtual time provides a flexible abstraction [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: David R. Jefferson. Virtual time. <em>ACM Transactions on Programming Languages and Systems,</em> 7, 3 (Jul. 1985), 404-425. (<a href="http://www.cs.uga.edu/~maria/pads/papers/p404-jefferson.pdf">PDF</a>)</p>
<p><strong>Abstract</strong>: Virtual time is a new paradigm for organizing and synchronizing distributed systems which can be applied to such problems as distributed discrete event simulation and distributed database concurrency control. Virtual time provides a flexible abstraction of real time in much the same way that virtual memory provides an abstraction of real memory. It is implemented using the Time Warp mechanism, a synchronization protocol distinguished by its reliance on lookahead-rollback, and by its implementation of rollback via antimessages.</p>
<p><strong>Discussion</strong>: This 1985 paper introduced <em>virtual time</em>, a concept that allows a distributed system to be organized around a linear global clock; rather than maintain a synchronized clock, it achieves efficiency by having each node maintain its own local virtual time and performing rollback when a node receives a message &#8220;in the past.&#8221; Although not widely adopted, it has served as an influential model of a general system with optimistic concurrency and rollback.</p>
<p>To motivate the concept, let&#8217;s for a moment consider a magical distributed system with the following properties:</p>
<ul>
<li>The system is a set of nodes, each capable of doing any amount of local processing instantaneously.</li>
<li>Each node can choose at any time to send a message to any other node. Moreover, it can specify precisely when that message will be received &#8211; the only constraint is that it cannot be received in the past.</li>
</ul>
<p>Such a system is capable of intuitively describing many different distributed systems. For example, you could have a distributed simulation (such as a physical simulation) in which each message tells a node to simulate a particular kind of event, and the times at which the messages arrive correspond to the times at which the events they simulate occur. Consequently all events are simulated in order. You could ensure that the transactions of a database system are committed in order by assigning a time to them indicating when they&#8217;re each supposed to atomically occur. Finally, most simply, you could create a system in which all messages are received in order, by merely setting the received times to be the same as the sending time. (These scenarios are based on section 5 from the paper.)</p>
<p>In real life, of course, we don&#8217;t have instantaneous processing, and we can&#8217;t control when messages will be received. The concept behind this paper is to get rid of &#8220;real time&#8221;, and replace it by a new sort of time called <em>virtual time</em> (by analogy with virtual memory) that we have more control over. We can control the rate at which virtual time advances, and in fact time it can advance at different rates at different nodes in the system: each node keeps a <em>local clock</em>. To control the time at which messages are received, we introduce a queue at each node that queues up all received messages and does not process them until their virtual time arrives.</p>
<p>Not having to agree on a global virtual time greatly decreases the communication cost, but the problem with this flexibility is that once the local clocks have gotten out of sync, some strange scenarios can arise: if node A is lagging behind node B, it may send a message to B that arrives &#8220;in the past&#8221; from B&#8217;s perspective. But B may have already taken action based on the fact that it didn&#8217;t receive that message, including sending new messages to other nodes.</p>
<p>There are a number of solutions to this problem. The most obvious one is to have each node wait until its clock is the furthest in the past of all nodes before proceeding. This corresponds to pessimistic concurrency, and is the equivalent of having a global lock in a threaded program, with all the same consequences of low concurrency and poor throughput. The one taken by Jefferson is a form of optimistic concurrency: each local clock simply jumps to the received time of the next waiting event, so that the node is never idle, maximizing throughput. If it ever receives an new message &#8220;in the past,&#8221; it does a <em>rollback</em> to a point in time before that message was received and then proceeds forward again. To facilitate this it takes periodic snapshots of its state, tagged with virtual times. This is all transparent to the node&#8217;s program; from its perspective, virtual time never decreases. This is unlike, say, transactions, where a failed transaction needs to be detected and retried.</p>
<p>The big remaining problem is that during rollback there are certain things that can&#8217;t be undone locally, most importantly the sending of messages to other nodes. There are, again, multiple ways of dealing with this, but the one used by Jefferson is that whenever a message send is rolled back, an &#8220;antimessage&#8221; indicating this is sent to the same node with the same received time. At the destination, if the corresponding message is still queued, they collide and eliminate one another. If the corresponding message has already been processed, the antimessage arrives in the past and causes that node in turn to rollback and not receive the original message. One requirement for implementing this is that all nodes keep a queue of messages already processed.</p>
<p>A nontrivial problem is to show that the system as a whole makes progress amidst all these cascading rollbacks. Another serious problem is the performance issue of how to prevent the nodes from running out of memory from storing unbounded queues of messages and state snapshots. To deal with these issues, Jefferson defines the concept of <em>global virtual time</em> (GVT), which is the earliest virtual time at which anything in the system is currently happening. Because no node can send a message into its own past, no message can be received before the GVT, and no node will ever have to rollback to a time before the GVT. Consequently, any state snapshot or already-processed message marked with a virtual time prior to GVT can be discarded, in a process Jefferson calls &#8220;fossil collection&#8221; (this is oddly prophetic, as the term &#8220;garbage collection&#8221; was not coined until 1994 by <span>Kaushik Ghosh</span>). Additionally, it can be shown that, even though individual local clocks may roll back, GVT never decreases. As long as no message is lost and no node blocks indefinitely, it will eventually increase, which is enough to guarantee that the system makes progress. Eventually, all message queues will be empty, and the system will terminate.</p>
<p>The main downside to a virtual time system is that it fundamentally relies on the assumption of what the author calls &#8220;temporal locality&#8221; &#8211; that messages won&#8217;t arrive in the past too often. Less obviously, it also depends on the efficiency of the implementation of state snapshotting, message queuing, and determining the global virtual time. Because virtual time is such a general framework, applying to distributed systems from multiprocessors to databases to LANs, this empirical analysis depends very much on the specific implementation. Evaluations of a highly-optimized practical implementation on multiprocessors were done in a series of papers by Richard M. Fujimoto (&#8220;Time warp on a shared memory multiprocessor&#8221; (1989), &#8220;The effect of memory capacity on Time Warp performance&#8221; (1993), &#8220;An adaptive memory management protocol for Time Warp parallel simulation&#8221; (1994), &#8220;An Empirical Evaluation of Performance-Memory Trade-Offs in Time Warp&#8221; (1997)). I&#8217;m not aware of any thorough evaluations in other scenarios.</p>
<p>The concept of virtual time was later generalized by Friedemann Mattern in his &#8220;Virtual Time and Global States of Distributed Systems,&#8221; which instead of using a totally ordered linear virtual time relies on a partially ordered virtual time (<em>Parallel and Distributed Algorithms</em>, 1989, 215-226, <a href="http://www.vs.inf.ethz.ch/publ/papers/VirtTimeGlobStates.pdf">PDF</a>). The motivation is to eliminate unnecessary ordering constraints imposed by the total order, at the cost of some conceptual complexity. Jefferson briefly hinted at this possibility (&#8220;[v]irtual times may be only partially ordered&#8221;).</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/04/23/virtual-time/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications</title>
		<link>http://papersincomputerscience.org/2009/03/04/chord-a-scalable-peer-to-peer-lookup-service-for-internet-applications/</link>
		<comments>http://papersincomputerscience.org/2009/03/04/chord-a-scalable-peer-to-peer-lookup-service-for-internet-applications/#comments</comments>
		<pubDate>Wed, 04 Mar 2009 21:55:48 +0000</pubDate>
		<dc:creator>dcoetzee</dc:creator>
				<category><![CDATA[Distributed systems]]></category>
		<category><![CDATA[chord]]></category>
		<category><![CDATA[consistent hashing]]></category>
		<category><![CDATA[distributed hash table]]></category>
		<category><![CDATA[hash table]]></category>
		<category><![CDATA[peer-to-peer]]></category>

		<guid isPermaLink="false">http://papersincomputerscience.wordpress.com/?p=46</guid>
		<description><![CDATA[This 2001 paper introduced Chord, one of the first published distributed hash tables. Distributed hash tables are data structures that provide a "lookup" function for a distributed group of computers: given the key for a piece of data, find the computer where that piece of data is stored. Chord is one of the earliest and simplest data structures of this sort, and is designed for use with peer-to-peer networks.]]></description>
			<content:encoded><![CDATA[<p><strong>Citation</strong>: Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. In <em>Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols For Computer Communications</em> (San Diego, California, United States). SIGCOMM &#8217;01. ACM, New York, NY, 149-160. (<a href="http://www.sigcomm.org/sigcomm2001/p12-stoica.pdf">PDF</a>).</p>
<p><strong>Abstract</strong>: A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents <em>Chord</em>, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.</p>
<p><strong>Discussion</strong>: This 2001 paper introduced <a href="http://en.wikipedia.org/wiki/Chord_(distributed_hash_table)">Chord</a>, one of the first published <a href="http://en.wikipedia.org/wiki/Distributed_hash_table">distributed hash tables</a> and one of the most popular ones in practice due to its simplicity and good practical performance.</p>
<p>Distributed hash tables are data structures that provide a &#8220;lookup&#8221; function for a distributed group of computers: given the key for a piece of data, find the computer where that piece of data is stored. This is easy to do with a large centralized server that keeps track of where each piece of data is stored, but the problem becomes more complicated when we want to implement one on a <a href="http://en.wikipedia.org/wiki/Peer-to-peer">peer-to-peer</a> network where all nodes are running the same code and none is willing to commit the resources to track the entire network. Chord is one of the earliest and simplest data structures of this sort. In a system with <em>n</em> machines, it is able to perform a lookup by contacting only O(log N) other nodes, and only has to store data about O(log N) other nodes; no node has complete knowledge of the network.</p>
<p>The first task a distributed hash table must do is decide which nodes store which data items. Chord relies on the use of a hash function, typically a cryptographic hash function such as <a href="http://en.wikipedia.org/wiki/SHA_hash_functions">SHA1</a>, to assign a unique integer identifier to each node (the hash of its IP address) and to each key that may be looked up. Then, a key <em>k</em> is assigned to a node <em>n</em> if and only if hash(<em>k</em>) ≤ hash(<em>n</em>) and there is no other node with a hash between hash(<em>k</em>) and hash(<em>n</em>). If there are no nodes with hashes greater than hash(<em>k</em>), it is assigned to the node with smallest hash. This can be visualized by imagining all the hash values laid out on a large circle or ring, increasing in the clockwise direction; to find the node for a key, you go to the key&#8217;s hash on the circle and proceed clockwise until you encounter a node&#8217;s hash. This system is called <a href="http://en.wikipedia.org/wiki/Consistent_hashing">consistent hashing</a>, and was conceived by D. Lewin et al in 1997 (Karger, D., Lehman, E., Leighton, F., Levine, M., Lewin, D., and Panigrahy, R. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In <em>Proceedings of the 29th Annual ACM Symposium on Theory of Computing</em>, May 1997, pp. 654-663. <a href="http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf">PDF</a>). It&#8217;s convenient for distributed hash tables because if a node joins or leaves the system, only a small number of data items have to be reassigned (those between it and the previous node), and only two nodes have to communicate to reassign them; the most obvious alternative, a system that assigns keys at random, would need to contact all nodes to locate the data to be reassigned.</p>
<p>The most trivial distributed hash table in this arrangement would be for each node to keep the IP address of the next node in the clockwise direction on the circle, called its <em>successor</em>; this connects them all in a circularly linked list. A node can follow these links by asking each node in turn for its successor, until the one storing the desired key is reached. This works fine, but is very slow: it requires roundtrip communication with a series of N/2 nodes on average, and N-1 nodes in the worst case.</p>
<p>To speed this up, Chord introduces the concept of <em>fingers</em>: the <em>i</em>th finger of a node <em>n</em> is the first node in the clockwise direction from the hash value hash(<em>n</em>) + 2<sup><em>i</em>-1</sup>. Every node stores a table of fingers and the IP addresses of the corresponding nodes. To locate the node for a key, a node contacts the largest-indexed finger on its table that precedes the key on the circle, and forwards the request to that node. This repeats until the request reaches the node carrying the data. The request only needs to be forwarded at most log N times, because each node stores the most information about nodes closely following it on the circle, and the closer we get the more precise information we have.</p>
<p>When a node joins or leaves the network, it needs to set up its finger tables, and update the finger tables of other nodes to include it. Using the above lookup primitive, it is easy to locate each of the necessary nodes in log N time, so that nodes can join or leave the network in O(log<sup>2</sup> N) time.</p>
<p>Part of why Chord is useful in peer-to-peer systems is that it&#8217;s resilient to nodes continuously joining and leaving the network: as long as the successor links remain correct, lookup remains correct; finger tables are just an optimization and they can be updated more infrequently without significantly impacting performance. A stabilization procedure is responsible for ensuring that successors remain correct and nodes have a consistent view of successors. Similarly, node failures, where a node drops from the network without warning, can be dealt with by adding a list of the next log N successors to each node; if the successor ever fails, the request can be forwarded to the next live successor. It&#8217;s unlikely all log N successors will fail simultaneously before the stabilization procedure can update the successor list. To prevent data loss in the case of node failure, a replication protocol can be built on top of Chord which simply stores the same data under two or more distinct keys.</p>
<p>In line with its goal as a practical system, a large section (section 6) was dedicated to experimental results. Most of these were based on simulations with large numbers of nodes (say 10000) to demonstrate scalability and stability. They show that the average number of nodes contacted for lookups is actually close to (log N)/2; and they give an important optimization, which is for each physical node to run a number of &#8220;virtual&#8221; nodes with the same IP address; this helps to divide the hash space more evenly and decrease the variance in the number of keys per physical node. They also ran an experiment on the Internet with a smaller network of 10 machines across the United States, showing a total latency of about 200 ms for lookups.</p>
<p>As is frequently the case with scientific discoveries, a number of researchers independently invented systems with similar capabilities to Chord at about the same time in 2001. These systems included <a href="http://en.wikipedia.org/wiki/Pastry_(DHT)">Pastry</a> and <a href="http://en.wikipedia.org/wiki/Tapestry_(DHT)">Tapestry</a>, which both emphasize minimizing latency over simplicity, and <a href="http://en.wikipedia.org/wiki/Content_addressable_network">CAN (content addressable network)</a>, which organizes its nodes in a <em>d</em>-dimensional Cartesian coordinate space instead of a ring, and requires more parameter tuning than Chord. Since then a number of other distributed hash tables technologies have been developed, such as  <span class="new">Maymounkov</span> and <span class="new">Mazières</span>&#8216;s <a href="http://en.wikipedia.org/wiki/Kademlia">Kademlia </a>in 2002 (<a href="http://www.cs.rice.edu/Conferences/IPTPS02/109.pdf">PDF</a>), which relies on an XOR-based metric topology.</p>
<p>Chord, particularly the popular MIT implementation MIT Chord, has undergone a great deal of performance refinement and further analysis since, including techniques like proximity routing (preferring to route to nodes with nearby nodes), geographic overlay construction (using physical location to inform the structure of the topology), and landmark routing. These all aim to lower the latency of lookup in practice. A <a href="http://www.cs.berkeley.edu/~zf/papers/chord_perf.pdf">Berkeley course project</a> by Li Zhuang and Feng Zhou explained and evaluated a variety of these refinements.</p>
<p>An important limitation of Chord (discussed briefly at the end of section 4.3) is its fundamental assumption that the hash function behaves randomly. Although it uses cryptographic hashes that are collision-resistant, it isn&#8217;t difficult to generate IPs or keys that hash to a small segment of the hash space, by simply generating random values until you find one with a hash in that segment, rendering it vulnerable to certain attacks. Other distributed hash tables, such as Pastry, Tapestry, and Symphony, incorporate randomization into their topology.</p>
<p>A more extensive modern discussion of Chord is given in Chapter 2 of Monica Haladyna Braunisch&#8217;s 2006 Master&#8217;s Thesis at MIT (<a href="http://ast-deim.urv.cat/trac/planetsim/chrome/site/HaladynaBraunischMALMThesis20060430.pdf">PDF</a>).</p>
<p><em>The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://papersincomputerscience.org/2009/03/04/chord-a-scalable-peer-to-peer-lookup-service-for-internet-applications/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

