Papers in Computer Science

Discussion of computer science publications

Archive for the ‘Programming languages’ Category

On understanding data abstraction, revisited

Posted by dcoetzee on December 5, 2009

Citation: William R. Cook. On understanding data abstraction, revisited. ACM SIGPLAN Notices 44, 10, 557-572. October 2009. (PDF)

Abstract: In 1985 Luca Cardelli and Peter Wegner, my advisor, published an ACM Computing Surveys paper called “On understanding types, data abstraction, and polymorphism”. Their work kicked off a flood of research on semantics and type theory for object-oriented programming, which continues to this day. Despite 25 years of research, there is still widespread confusion about the two forms of data abstraction, abstract data types and objects. This essay attempts to explain the differences and also why the differences matter.

Discussion: This October 2009 essay by Assistant Professor William R. Cook of the University of Texas at Austin, an active programming languages researcher, seeks to clarify the fundamental difference between two often-confused forms of data abstraction: abstract data types (ADTs) and objects (as in object-oriented programming).

Both abstract data types and objects have the same basic purpose: they allow the same functionality to be implemented in different ways, and allow the implementation to be modified without affecting client code. For example, when using a “dictionary” module mapping keys to values, the client shouldn’t have to care whether that functionality is internally implemented using a hash table or a red-black tree; and the implementer should be free to modify this kind of implementation detail at any time.

The “objects” provided by mainstream object-oriented programming languages such as Java, C++, and C# are actually a sort of hybrid of abstract data types and true objects, able to effectively simulate both abstractions. To demonstrate the distinction, here we outline an implementation for an “IntQueue” class representing a first-in first-out queue of integers in Java using both methods. First using objects:

interface IIntQueue {
  public boolean isEmpty();
  public void enqueue(int i);
  public int dequeue();
  public IIntQueue append(IIntQueue q);
}

class IntQueue implements IIntQueue {
  java.util.ArrayList list;

  public IntQueue() { list = new java.util.ArrayList(); }
  public boolean isEmpty() { return list.size() == 0; }
  public void enqueue(int i) { list.add(i); }
  public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }
  public IIntQueue append(IIntQueue q) {
    IntQueue result = new IntQueue();
    while (!this.isEmpty()) { result.enqueue(this.dequeue()); }
    while (!q.isEmpty()) { result.enqueue(q.dequeue()); }
    return result;
  }
}

public class Program {
  public static void main(String[] args) {
    IIntQueue q1 = new IntQueue(), q2 = new IntQueue();
    q1.enqueue(1); q1.enqueue(2);
    q2.enqueue(3); q2.enqueue(4);
    IIntQueue q3 = q1.append(q2);
    while (!q3.isEmpty()) {
      System.out.println(q3.dequeue());
    }
  }
}

This Java sample follows two important disciplines:

  1. Concrete class names, such as IntQueue, are only ever used in “new” expressions, not as parameter types, return types, or even local variable types – and this includes in the definition of the IntQueue class itself. Instead, we are constrained to exclusively use interfaces (pure virtual classes in C++) for our static types.
  2. Reference equality (==) is never used.

Although they may seem draconian, adhering to this “pure object” discipline introduces some powerful flexibility. For example, instead of having the “append” method construct the combined queue all at once, we can introduce a new class that allows us to rewrite it as a constant-time operation:

class IntQueue implements IIntQueue {
  // (same as before)
  public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }
}

class LazyAppendIntQueue implements IIntQueue {
  IIntQueue q1, q2;

  public LazyAppendIntQueue(IIntQueue q1, IIntQueue q2) { this.q1 = q1; this.q2 = q2; }
  public boolean isEmpty() { return q1.isEmpty() && q2.isEmpty(); }
  public void enqueue(int i) { q2.enqueue(i); }
  public int dequeue() { if (q1.isEmpty()) return q2.dequeue(); else return q1.dequeue(); }
  public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }
}

This modification requires no changes to the client code; indeed, no client code could possibly detect any change in behavior, as long as it’s constrained to accessing objects through the IIntQueue interface.

An alternate way of optimizing the append() method is to append the underlying arrays, and then make this the underlying array of a new IntQueue. In the strict objects model, however, this is impossible: the append() method can see the representation of this, but can only access its argument through the IIntQueue interface. We might try to fix this by adding a method to the interface to get the underlying array, but that would severely limit the possible implementations of IIntQueue (in particular, LazyAppendIntQueue above would be excluded). The restriction that objects cannot access other objects except through their interface, even other instances of the same type, is called autognosis, and is a fundamental limitation of objects.

Now let’s consider how this same data structure could be implemented in the style of an abstract data type (ADT):

class IntQueue {
  java.util.ArrayList list;

  public IntQueue() { list = new java.util.ArrayList(); }
  public boolean isEmpty() { return list.size() == 0; }
  public void enqueue(int i) { list.add(i); }
  public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }
  public IntQueue append(IntQueue q) {
    IntQueue result = new IntQueue();
    result.list = this.list;
    result.list.addAll(q.list);
    return result;
  }
}

This looks more like typical Java code – each method of IntQueue is free to ravage the internal state of any instance of IntQueue, and interfaces are not used; consequently, different implementations of the same module are not interchangeable at runtime. They are, however, interchangeable at compile-time – if we had two different implementations of “class IntQueue” in two different namespaces, we could decide with a single “import” statement which one we’d like to use. Different parts of the program might choose to use different implementations, but they wouldn’t “mix” – they couldn’t be passed to one another’s append() methods. For these reasons, ADTs are considerably less flexible than true objects, but what they lose in flexibility they make up for in simplicity and the potential for optimization.

Considering how much more ADTs look like normal Java code, you might be asking yourself where objects are really used in practice. In general, they tend to pop up in places where a single, relatively stable interface has many different implementations; common examples include “windows, filesystems, or device drivers.” Filesystems are the most illustrative example: by providing a single fixed API, not only can many different filesystems be accessed through the same uniform filesystem interface, but they can be composed in interesting ways, such as mounting an ISO (a file on one filesystem contains the underlying storage for a different filesystem).

The filesystem example also exposes one of the strongest disadvantages of objects: once many implementations have been written to a stable interface, changing that interface in any way is very difficult. ADT signatures on the other hand are straightforward to extend; for example, if a client of Dictionary needs to iterate over the dictionary in order, you can just add that functionality to the version based on red-black trees, and then compel that client to use that particular implementation.

Following a development process based on refactoring, it’s not uncommon to introduce an abstraction using ADTs and later refine it into a true object when more flexibility is needed. Conversely, the need for optimization using autognosis may require introducing some elements of ADTs. Using type inspection like “instanceof” and reflection, it may be possible to get the “best of both worlds” by optimizing certain common cases, but falling back on the generic interface for unrecognized types. This is sort of a poor man’s version of multiple dispatch.

Neither Cook nor I promote the use of either ADTs or true objects over the other. The most important takeaway from this is: know how to use both ADTs and true objects, recognize which one you’re using, and know their advantages and disadvantages. By avoiding the conceptual conflation that hybrid data abstraction systems tend to induce, you can better predict where to expect issues with extensibility and efficiency in your system.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

Posted in Programming languages | 5 Comments »