Papers in Computer Science

Discussion of computer science publications

On understanding data abstraction, revisited

Posted by dcoetzee on December 5th, 2009

Citation: William R. Cook. On understanding data abstraction, revisited. ACM SIGPLAN Notices 44, 10, 557-572. October 2009. (PDF)

Abstract: In 1985 Luca Cardelli and Peter Wegner, my advisor, published an ACM Computing Surveys paper called “On understanding types, data abstraction, and polymorphism”. Their work kicked off a flood of research on semantics and type theory for object-oriented programming, which continues to this day. Despite 25 years of research, there is still widespread confusion about the two forms of data abstraction, abstract data types and objects. This essay attempts to explain the differences and also why the differences matter.

Discussion: This October 2009 essay by Assistant Professor William R. Cook of the University of Texas at Austin, an active programming languages researcher, seeks to clarify the fundamental difference between two often-confused forms of data abstraction: abstract data types (ADTs) and objects (as in object-oriented programming).

Both abstract data types and objects have the same basic purpose: they allow the same functionality to be implemented in different ways, and allow the implementation to be modified without affecting client code. For example, when using a “dictionary” module mapping keys to values, the client shouldn’t have to care whether that functionality is internally implemented using a hash table or a red-black tree; and the implementer should be free to modify this kind of implementation detail at any time.

The “objects” provided by mainstream object-oriented programming languages such as Java, C++, and C# are actually a sort of hybrid of abstract data types and true objects, able to effectively simulate both abstractions. To demonstrate the distinction, here we outline an implementation for an “IntQueue” class representing a first-in first-out queue of integers in Java using both methods. First using objects:

[sourcecode language="java"]
interface IIntQueue {
public boolean isEmpty();
public void enqueue(int i);
public int dequeue();
public IIntQueue append(IIntQueue q);
}

class IntQueue implements IIntQueue {
java.util.ArrayList list;

public IntQueue() { list = new java.util.ArrayList(); }
public boolean isEmpty() { return list.size() == 0; }
public void enqueue(int i) { list.add(i); }
public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }
public IIntQueue append(IIntQueue q) {
IntQueue result = new IntQueue();
while (!this.isEmpty()) { result.enqueue(this.dequeue()); }
while (!q.isEmpty()) { result.enqueue(q.dequeue()); }
return result;
}
}

public class Program {
public static void main(String[] args) {
IIntQueue q1 = new IntQueue(), q2 = new IntQueue();
q1.enqueue(1); q1.enqueue(2);
q2.enqueue(3); q2.enqueue(4);
IIntQueue q3 = q1.append(q2);
while (!q3.isEmpty()) {
System.out.println(q3.dequeue());
}
}
}
[/sourcecode]

This Java sample follows two important disciplines:

  1. Concrete class names, such as IntQueue, are only ever used in “new” expressions, not as parameter types, return types, or even local variable types – and this includes in the definition of the IntQueue class itself. Instead, we are constrained to exclusively use interfaces (pure virtual classes in C++) for our static types.
  2. Reference equality (==) is never used.

Although they may seem draconian, adhering to this “pure object” discipline introduces some powerful flexibility. For example, instead of having the “append” method construct the combined queue all at once, we can introduce a new class that allows us to rewrite it as a constant-time operation:

[sourcecode language="java"]
class IntQueue implements IIntQueue {
// (same as before)
public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }
}

class LazyAppendIntQueue implements IIntQueue {
IIntQueue q1, q2;

public LazyAppendIntQueue(IIntQueue q1, IIntQueue q2) { this.q1 = q1; this.q2 = q2; }
public boolean isEmpty() { return q1.isEmpty() && q2.isEmpty(); }
public void enqueue(int i) { q2.enqueue(i); }
public int dequeue() { if (q1.isEmpty()) return q2.dequeue(); else return q1.dequeue(); }
public IIntQueue append(IIntQueue q) { return new LazyAppendIntQueue(this, q); }
}
[/sourcecode]

This modification requires no changes to the client code; indeed, no client code could possibly detect any change in behavior, as long as it’s constrained to accessing objects through the IIntQueue interface.

An alternate way of optimizing the append() method is to append the underlying arrays, and then make this the underlying array of a new IntQueue. In the strict objects model, however, this is impossible: the append() method can see the representation of this, but can only access its argument through the IIntQueue interface. We might try to fix this by adding a method to the interface to get the underlying array, but that would severely limit the possible implementations of IIntQueue (in particular, LazyAppendIntQueue above would be excluded). The restriction that objects cannot access other objects except through their interface, even other instances of the same type, is called autognosis, and is a fundamental limitation of objects.

Now let’s consider how this same data structure could be implemented in the style of an abstract data type (ADT):

[sourcecode language="java"]
class IntQueue {
java.util.ArrayList list;

public IntQueue() { list = new java.util.ArrayList(); }
public boolean isEmpty() { return list.size() == 0; }
public void enqueue(int i) { list.add(i); }
public int dequeue() { int i = (Integer)list.get(0); list.remove(0); return i; }
public IntQueue append(IntQueue q) {
IntQueue result = new IntQueue();
result.list = this.list;
result.list.addAll(q.list);
return result;
}
}
[/sourcecode]

This looks more like typical Java code – each method of IntQueue is free to ravage the internal state of any instance of IntQueue, and interfaces are not used; consequently, different implementations of the same module are not interchangeable at runtime. They are, however, interchangeable at compile-time – if we had two different implementations of “class IntQueue” in two different namespaces, we could decide with a single “import” statement which one we’d like to use. Different parts of the program might choose to use different implementations, but they wouldn’t “mix” – they couldn’t be passed to one another’s append() methods. For these reasons, ADTs are considerably less flexible than true objects, but what they lose in flexibility they make up for in simplicity and the potential for optimization.

Considering how much more ADTs look like normal Java code, you might be asking yourself where objects are really used in practice. In general, they tend to pop up in places where a single, relatively stable interface has many different implementations; common examples include “windows, filesystems, or device drivers.” Filesystems are the most illustrative example: by providing a single fixed API, not only can many different filesystems be accessed through the same uniform filesystem interface, but they can be composed in interesting ways, such as mounting an ISO (a file on one filesystem contains the underlying storage for a different filesystem).

The filesystem example also exposes one of the strongest disadvantages of objects: once many implementations have been written to a stable interface, changing that interface in any way is very difficult. ADT signatures on the other hand are straightforward to extend; for example, if a client of Dictionary needs to iterate over the dictionary in order, you can just add that functionality to the version based on red-black trees, and then compel that client to use that particular implementation.

Following a development process based on refactoring, it’s not uncommon to introduce an abstraction using ADTs and later refine it into a true object when more flexibility is needed. Conversely, the need for optimization using autognosis may require introducing some elements of ADTs. Using type inspection like “instanceof” and reflection, it may be possible to get the “best of both worlds” by optimizing certain common cases, but falling back on the generic interface for unrecognized types. This is sort of a poor man’s version of multiple dispatch.

Neither Cook nor I promote the use of either ADTs or true objects over the other. The most important takeaway from this is: know how to use both ADTs and true objects, recognize which one you’re using, and know their advantages and disadvantages. By avoiding the conceptual conflation that hybrid data abstraction systems tend to induce, you can better predict where to expect issues with extensibility and efficiency in your system.

The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.

5 Responses to “On understanding data abstraction, revisited”

  1. Neil Conway Says:

    Perhaps I’m missing something, but it seems to me that clients can observe a difference between LazyAppendIntQueue and the first IntQueue implementation: IntQueue#append() does not share state, whereas LazyAppendIntQueue#append() does. That is, if you pass a Queue to LazyAppendIntQueue#append() and then subsequently modify the Queue, you get different behavior (IntQueue does a deep copy, LazyAppendIntQueue does a shallow copy of mutable state).

    BTW, you could replace the second IntQueue implementation with a LazyAppendIntQueue where “q1″ is null.

  2. dcoetzee Says:

    Hi Neil, thanks for reading. I should have been clearer about that – my unstated assumption (for the sake of simplicity) was that after appending q1 and q2, neither q1 nor q2 is accessed again. In the first case, IntQueue.append() destroys the contents of both q1 and q2 via dequeue(); in the second case, the appended queue is expected to read and modify them in the future. If the queues were implemented as immutable types this would all work out, but the enqueue() operation would be less efficient. Another example of the tradeoff between flexibility and efficiency.

    I’m not quite sure what you mean about replacing the second IntQueue implementation with a LazyAppendIntQueue where q1 is null – could you clarify? Note that the second IntQueue implementation employs autognosis (accessing the “list” data member of its parameter) whereas LazyAppendIntQueue does not.

  3. Vinod Grover Says:

    There was a paper by John Reynolds called “User-defined types and procedural data structures as complementary approaches to data abstraction” which was published in 75 and captures precisely what people call objects. The idea is that the representation of a type can be changed by the user without changing the interface… Cook’s paper mentions that I believe.

    Anyway Reynold’s paper is worth reading still

  4. William Cook Says:

    Hi Derrick,
    Thanks for the nice discussion and new example. My only comment is that I am not sure I agree that your second example is “more like typical Java code”. There are many different styles of Java coding and I’m not sure that this one is representative. However, you have described the two styles nicely.
    Thanks again.
    PS: The Reynolds paper is quite nice. If you can’t find the original, it was reprinted in here: Theoretical aspects of object-oriented programming: types, semantics, and language design

  5. dcoetzee Says:

    Thank you for your feedback! I hope it’ll help clarify these concepts for everyone. You’re right that I was a bit narrow-minded in describing the ADT implementation as “typical” Java – certainly Java programmers do take advantage of interfaces regularly, for example. This may be due to my limited exposure to Java. And thanks for the pointer to the Reynolds paper.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>