Posted by dcoetzee on July 6th, 2009
Citation: Chazelle, B. The soft heap: an approximate priority queue with optimal error rate. Journal of the ACM 47, 6 (Nov. 2000), 1012-1027. (PDF)
Abstract: A simple variant of a priority queue, called a soft heap, is introduced. The data structure supports the usual operations: insert, delete, meld, and findmin. Its novelty is to beat the logarithmic bound on the complexity of a heap in a comparison-based model. To break this information-theoretic barrier, the entropy of the data structure is reduced by artificially raising the values of certain keys. Given any mixed sequence of n operations, a soft heap with error rate ε (for any 0 < ε ≤ 1/2) ensures that, at any time, at most εn of its items have their keys raised. The amortized complexity of each operation is constant, except for insert, which takes O(log 1/ε) time. The soft heap is optimal for any value of ε in a comparison-based model. The data structure is purely pointer-based. No arrays are used and no numeric assumptions are made on the keys. The main idea behind the soft heap is to move items across the data structure not individually, as is customary, but in groups, in a data-structuring equivalent of “car pooling.” Keys must be raised as a result, in order to preserve the heap ordering of the data structure. The soft heap can be used to compute exact or approximate medians and percentiles optimally. It is also useful for approximate sorting and for computing minimum spanning trees of general graphs.
Discussion: This 2000 paper by Bernard Chazelle introduced a data structure called a soft heap, which is a heap data structure that can perform any operation in amortized constant time. In exchange for this ideal performance bound, a new limitation is imposed: at any time, any element in the heap may have its key increased at the discretion of the data structure; such an element is said to be corrupted. There is an upper limit of εn corrupted elements, where n is the number of operations completed so far and ε is a fixed parameter. For example, if ε = 0.01, and we insert 1000 elements into the heap, at most 10 elements will be corrupted.
Heaps are one of the fundamental data structures studied in computer science. Like dictionaries or hash tables, they store a set of elements, each with an associated key. Heaps support four fundamental operations:
- insert: Insert a new element, given the element and its key.
- extract minimum: Get the element with minimum key and delete it from the heap.
- decrease key: Update an existing element’s key to a smaller value.
- merge or meld: Combine two existing heaps together into one large heap.
The most obvious application of these operations is a priority queue: if the keys indicate priority (with smaller values indicating higher priority), then insert adds a new task to the queue, and extract minimum retrieves the current highest-priority task. Heaps also form the basis of the heapsort sorting algorithm, which in its simplest form inserts each element into a heap one-by-one and then uses extract minimum repeatedly to get the result list in order.
The sorting application demonstrates an important lower bound regarding heaps: it is impossible to design a heap data structure that can do both the insert and extract minimum operations in amortized constant time, for if this were possible we would be able to do a comparison sort in linear (O(n)) time; but any such sort requires Ω(n log n) time. Indeed, the best known heap data structure, the Fibonacci heap, developed in 1984, can do all these operations in amortized constant time except for extract minimum, which requires Θ(log n) amortized time.
But what if there were some way to modify the heap data structure that let us break through this lower bound and do all operations in amortized constant time without contradicting the above result? This is exactly what the soft heap does. A soft heap is like a normal heap, but with the troubling property that the key you assign to an element when you insert it may not stay that way; more formally, the data structure may choose at its discretion to increase the key of any element at any time. Any element that has had its key increased is called corrupted. It’s easy to see how this might be helpful in reducing complexity; if the data structure just decided to increase the keys of all elements to infinity, it would be trivial to do all operations in constant time.
But to make soft heaps actually useful and not just fast, we need an additional constraint. This constraint is as follows: each soft heap has a fixed parameter ε called the error rate. Roughly speaking, this controls what proportion of the elements may be corrupted at any given time. More formally, it specifies that if we start with the empty heap and do n operations, the resulting heap will have at most εn corrupted elements. If all n operations are inserts, then indeed ε will be an upper bound on the proportion of corrupted elements. If extract minimum operations are included, the relationship is no longer so clear – it’s possible all elements may become corrupt.
As Chazelle so elegantly says, “despite this apparent weakness, the soft heap is optimal and—perhaps even more surprising—useful.” Perhaps the simplest nontrivial application of soft heaps is finding the median of a list of values in linear time. There is a classical algorithm for doing this, but the implementation and analysis of the algorithm based on the soft heap is much simpler. All we have to do is set ε = 1/3, then insert all n elements into the heap, and then do extract minimum n/3 times. Depending on exactly which elements got corrupted, the final element extracted, the pivot, will be somewhere between the (n/3)th and (2n/3)th smallest element in the list. We then partition the list into the sublist of elements less than or equal to the pivot, and those greater than the pivot, and recursively invoke this procedure on the one containing the median. This is similar to the quicksort algorithm, but only recursively processing one of the sublists instead of both, and it runs in worst-case linear time.
Historically, soft heaps were invented with a specific application in mind: computing minimum spanning trees. Chazelle has achieved some fame for discovering the asymptotically best known algorithms for several problems, and this is one of them: his soft-heap based algorithm can compute minimum spanning trees in O(n α(n)) time, where α is the inverse Ackermann function, a very slowly-growing function that is less than 5 for all remotely practical values of n. I don’t discuss the details of this algorithm here because they’re pretty complex, but you can see Chazelle’s paper (Bernard Chazelle. A Minimum Spanning Tree Algorithm with Inverse-Ackermann Type Complexity. JACM 47(6):1028–1047, 2000, PDF). In 2002, Pettie and Ramachandran presented a minimum spanning tree algorithm based on soft heaps that is known to be asymptotically optimal.
So how does it work? In Chazelle’s original implementation, the soft heap was a simple variant on the binomial heap. Instead of storing a single element at each node, it would store a list of elements, associated with a single key that is an upper bound on the original keys of all the elements. This is the source of the “corruption” – any values in this list whose original keys are smaller than the common key are corrupted. Such “multivalued” nodes could only appear near the top levels of the trees making up the heap, which limits how many corrupted elements there can be. These lists are formed during the soft heap’s special “sift-up” routine, which under certain conditions will run twice on the same node, causing it to merge (concatenate) element lists with one of its children. There are a number of subtleties to this process, however, and to explicate them Chazelle gives complete C source code for the data structure in a literate programming style.
Recently a simpler implementation of soft heaps by Kaplan and Zwick has appeared (A simpler implementation and analysis of Chazelle’s soft heaps. In Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms (Jan 4-6, 2009).) They use the original soft heaps concept of storing a list of items at each node, and of occasionally running sift on elements twice during sifting, but rather than being based on binomial heaps, this implementation is based on a set of heap-ordered binary trees. Also, their condition for running sift twice on a node is somewhat less arbitrary: each node x has an associated “size” value that increases exponentially with rank, and the size |list(x)| of the element list at any internal node x must always satisfy size(x) ≤ |list(x)| ≤ 3size(x).
How do you know if soft heaps might be useful for something you’re doing? Due to their peculiar behavior, they tend to be most useful for inventing new algorithms with better worst-case performance than existing algorithms, particularly in situations where the “approximate rank” – or approximate position in sorted order – of an element in a list is useful. It tends to be applied in places where ordinary heap data structures have traditionally been applied in the past, like graph algorithms and sorting. Beyond this though, there may still remain a rich set of untapped applications of soft heaps to be found that employ them in novel ways.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.