Friday, May 15, 2020

Doubly-Linked Lists Are Useless in Collections Libraries

The doubly-linked list is a ubiquitous topic in data structures courses in Computer Science curricula. It is a course I have taught many times over the course of my career as a Computer Science professor. As with most of the standard data structures covered in these courses (such as dynamic arrays, stacks, queues, heaps, singly-linked lists, hash tables, and binary search trees), doubly-linked lists routinely appear as standard collections in data structure libraries.

Most of these standard data structures are routinely useful. I mention them here in order of the rough frequency with which I seem to employ them:

Dynamic arrays are, by far, the most useful data structure for representing an arbitrary collection of data. They are friendly to the CPU cache, relatively conservative with memory, and enable constant-time random access.
Hash tables enable the fastest practical implementations of unordered sets and maps, especially because (like dynamic arrays) they enable cache-friendly memory layout.
Binary search trees enable fast implementations of totally ordered sets and maps.
Stacks, queues, and heaps are frequently components of other algorithms, such as graph traversal and A* search algorithms.
Singly-linked lists are foundational data structures for purely functional programming and purely functional data structures. They are also a prerequisite for understanding binary search trees.

Unlike these other data structures, however, I haven't found doubly-linked lists to be especially useful in practice. The advantages of the doubly-linked list are as follows:

Constant-time insertion and deletion of elements anywhere in the list, provided that an iterator is pointing to the pertinent location.
Constant-time insertion and deletion of elements at the start or end of the list. This is suitable for a queue or deque.
Constant-time splicing of two lists together.

Its disadvantages include the following:

Storing each element also requires storing two pointers, making it somewhat memory-hungry.
Accessing an arbitrary element requires linear time.
Lots of tiny linked objects are unlikely to be laid out together in memory in a way that is friendly to the CPU cache.

Before discussing the first two advantages of the doubly-linked list, I want to mention that for the third advantage (splicing), it really is the most time-efficient data structure available for a sequence. Curiously, however, some collection implementations (e.g. Java) don't support this operation at all, while others (e.g. Rust, C++) do.

That said, I have a hard time imagining a situation in which constant-time splicing is useful. I see an analogy between the movement of elements in a list splice and the amortized analysis of the movement of elements when resizing a dynamic array. Once an element is added to a list, there is likely to be a very small number of times in which it is part of a list splice. Moving it through the same number of linear-time array splices could probably be amortized. I need to think more about this, but it is the direction my intuition is moving.

Moving on from there, equally suitable (or better) for the first situation are the following:

The gap buffer (implementation in Rust) stores elements in an array, with values accumulating at both ends, leaving a gap in the middle. Insertions take place at a location designated by a cursor, fulfilling a role similar to that of the iterator in a doubly-linked list. Insertions to the left of the cursor accumulate at the start of the array, while insertions to the right of the cursor accumulate towards the end. When the cursor moves left, an element is transferred from the front to the back of the array. When it moves right, an element is transferred in the other direction. As with any array, the gap buffer can be resized in amortized constant time as needed.
The zipper is similar, except that instead of an array there are two singly-linked lists representing the sequences on each side of the cursor. This avoids the need for amortization to yield constant-time insertion and deletion without paying the memory penalty of the doubly-linked list.

Both of these approaches are much less memory-hungry than the doubly-linked list. A particular advantage of the zipper (as with any data structure based on a singly-linked list) is that it can be implemented as a purely functional data structure.

For the second situation, a ring buffer is preferable. This is an array in which two integers point to the indices of the start and end of the sequence. This makes both insertion and deletion at both ends highly efficient. It is still constant time for random access, albeit at a significant performance penalty in comparison to a standard array.

An arguable advantage of the doubly-linked list is that there is no need for amortization to yield constant-time performance for the target operations of insertion and deletion at the ends. I personally do not think this matters very much, for two reasons:

Amortized constant time postulates that occasional linear-time operations can, in effect, "spread out" their time penalty across all operations. In a real-time system, it is possible that a task might miss its deadline when the occasional expensive operation arises. I would argue, however, that a fixed-size array is preferable to either a dynamic array or a linked list in this situation. Real-time systems with constant time bounds also tend to have constant space bounds and an aversion to frequent use of the heap. Determining a suitable fixed-array size for the application leads to guaranteed constant time.
The dynamic resizing operations are really not very expensive in practice.

Of course, a claim like the second requires empirical verification. To that end, I implemented a simple experiment in Rust. (Note that, contrary to widely-held views about Rust's ownership system, it is perfectly possible to implement a doubly-linked list in safe Rust by using Weak pointers.) It adds a specified number of elements to both data structures at the end, then removes them all from the front. I not only recorded the total time necessary; I also recorded the maximum time used for a single operation.

Here are some results:

# items	VecDeque total (s)	VecDeque max op (s)	VecDeque fixed size total (s)	VecDeque fixed size op(s)	LinkedList total (s)	LinkedList max op (s)
100,000	0.028143	0.000203	0.028134	0.000022	0.035405	0.000089
1,000,000	0.327679	0.002232	0.314516	0.000061	0.402862	0.000277
10,000,000	2.997163	0.034366	2.966092	0.000251	3.838949	0.002055
100,000,000	32.289343	0.303334	28.891364	0.001622	47.137053	0.019145

I was actually surprised that LinkedList performed as well as it did. While it was always slower than the VecDeque, it still was within 50% or so of its performance. Unsurprisingly, preallocating sufficient space for the elements in the VecDeque resulted in by far the best performance.

What was more interesting was to see just how many items need to be in play for the linear-resizing operation to have a noticeable impact. With 100,000,000 items in play, it could represent a noticeable pause, but this is a particularly extreme situation. If this many items were to be anticipated, preallocating space is the way to go.

This isn't to say that there are no imaginable uses of the doubly-linked list - just not the kind of uses that would cause it to be featured in a collections framework. For example, using a doubly-linked list to represent the insertion order of elements in a hash table seems useful. When removing an element, we already have a pointer into the linked list handy, so the ability to remove in place in constant time really is pertinent. But notice how intrusive this is into the data structure - using an off-the-shelf linked list really isn't sufficient.

To conclude, then, there really doesn't seem to be any use for having a doubly-linked list in a collections library. Anything it can do can be done equivalently or better by another data structure. I also think it is particularly problematic to dedicate a lot of class time to a data structure that isn't really very useful. It gives students the impression that it is important, that it is used a lot, and that it matters, when it really doesn't. And this, in turn, even influences designers of collections frameworks, who in turn mislead developers into thinking that it is actually useful.

Saturday, May 9, 2020

Using the trait alias: labeling collections of trait constraints in Rust

When programming in Rust, a major challenge to the DRY principle is in regard to trait constraints. Let's say, for example, that I want to create a trait for random-access sequences whose values are primitive (in Rust terms, Copy) types. Furthermore, I want to be able to sort these sequences, so my values must also be orderable. The result could look something like this:

use len_trait::Len; // I employed len-trait = "0.6.1"

pub trait VTest<T:Copy+PartialEq+PartialOrd+Sized> 
    : Index<usize,Output=T> + IndexMut<usize,Output=T> + Len {
    fn add(&mut self, t: T);

    fn swap(&mut self, i: usize, j: usize) {
        let temp: T = self[i];
        self[i] = self[j];
        self[j] = temp;
    }

    fn selection_sort(&mut self) {
        for i in 0..self.len() {
            let lowest = (i+1..self.len())
                .fold(i, |lowest, j|
                    if self[lowest] < self[j] {lowest} else {j});
            self.swap(lowest, i);
        }
    }
}

Now, in order to implement this trait, we are faced with code duplication for the constraints on the generic variable T:

impl <T:Copy+PartialEq+PartialOrd+Sized> VTest<T> for Vec<T> {
    fn add(&mut self, t: T) {
        self.push(t);
    }
}

impl <T:Copy+PartialEq+PartialOrd+Sized> VTest<T> for VecDeque<T> {
    fn add(&mut self, t: T) {
        self.push_back(t);
    }
}

When coding, the most expedient way forward is to copy-and-paste this sequence of constraints on T. From both an aesthetic and a maintenance perspective, this is a nightmare.

Option 1: Create an empty trait

So, when faced with this, I undertook a search to determine what others had done when faced with this situation. The top search result suggested something like the following:

pub trait SimpleValue : Copy+PartialEq+PartialOrd+Sized {}

Which, in turn, enables me to write the following:

pub trait VTest<T:SimpleValue> 
    : Index<usize,Output=T> + IndexMut<usize,Output=T> + Len {
    // ...
}

impl <T:SimpleValue> VTest<T> for Vec<T> {
    // ...
}

impl <T:SimpleValue> VTest<T> for VecDeque<T> {
    // ...
}

Now that looks much better! But there is a small wrinkle. Rust won't automatically infer that, say, usize automatically implements the SimpleValue trait. So I do have to declare explicitly that it does:

impl SimpleValue for usize {}

Again, this doesn't seem that much of a price to pay. And within the scope of a single crate, it works perfectly well and solves the problem. Where we run into trouble is if we find that our new VTest trait is so successful that we want to import it from a different crate. We would, of course, import SimpleValue as well.

And actually, this works, most of the time, unless you try something like this:

impl SimpleValue for i64 {}

In response, the compiler will reply with an error message along the following lines:

error[E0117]: only traits defined in the current crate can be implemented for arbitrary types
  --> src\lib.rs:21:5
   |
21 |     impl SimpleValue for i64 {}
   |     ^^^^^^^^^^^^^^---
   |     |             |
   |     |             `i64` is not defined in the current crate
   |     impl doesn't use only types from inside the current crate
   |
   = note: define and implement a trait or new type instead

Something that is extremely cool about the Rust compiler is its inclusion of hyperlinks to explanations of compiler errors. When we follow this particular link, we learn:

This error indicates a violation of one of Rust's orphan rules for trait implementations. The rule prohibits any implementation of a foreign trait (a trait defined in another crate) where

the type that is implementing the trait is foreign
all of the parameters being passed to the trait (if there are any) are also foreign.

Orphan rule? What's that? Imagine two crates, each of which implements the same foreign trait for the same foreign datatype. If someone tries to use both crates in a project, we get an ambiguity; that is, it is not clear which trait implementation we should actually use. So the Rust solution is to prohibit the implementation of a foreign trait on a foreign type in all situations, so as to prevent this error from ever occurring. This is actually allowed in the Haskell language, although it triggers a compilation warning and is widely discouraged.

This rule has proven hugely unpopular, as it prevents lots of perfectly correct code (such as the motivating situation in this post) from being compiled. One could imagine expressing a preference for which crate's trait implementation to use, but this runs into trouble when code from the other crate relies on its own version. And simply prohibiting crates from being imported together impedes code reuse. It is clear that there is no straightforward solution.

Option 2: The Trait Alias

This leads to another possible solution, the concept of the trait alias, which is really all we wanted to begin with. With a trait alias, a collection of traits can be given a name, purely local to the crate in which it is embedded. Any qualifying types that match can then implement the trait. Here is what our example looks like with trait aliases:

#![feature(trait_alias)]
pub trait Item = Copy+PartialEq+PartialOrd+Sized;

pub trait VTest<T:Item> : Index<usize, Output="T"> + IndexMut<usize, Output="T"> + Len {
    fn add(&mut self, t: T);

    fn swap(&mut self, i: usize, j: usize) {
        let temp: T = self[i];
        self[i] = self[j];
        self[j] = temp;
    }

    fn selection_sort(&mut self) {
        for i in 0..self.len() {
            let lowest = (i+1..self.len())
                .fold(i, |lowest, j|
                    if self[lowest] < self[j] {lowest} else {j});
            self.swap(lowest, i);
        }
    }
}

impl <T:Item> VTest<t> for Vec<T> {
    fn add(&mut self, t: T) {
        self.push(t);
    }
}

impl <T:Item> VTest<t> for VecDeque<T> {
    fn add(&mut self, t: T) {
        self.push_back(t);
    }
}

Currently, trait aliases are only available in nightly Rust. There is still considerable discussion about how the feature should work and what the syntax should look like, so it is not clear when the feature is going to be stabilized. In the meantime, I think this feature is so essential for ergonomic use of Rust that I am defaulting to the nightly compiler for all of my projects until this feature is stabilized.

Friday, May 8, 2020

Arrays vs. Ring Buffers: A Performance Comparison in Rust

The VecDeque collection in Rust implements a double-ended queue with O(1) operations to push and pop on either end. (Other language libraries have similar data structures, such as ArrayDeque in Java.) It also provides O(1) random access to individual elements.

These ring buffers are implemented behind-the-scenes with an array and two index variables: a start and an end. For the sake of simplicity, imagine that the deque is initialized with a starting capacity of 10. Both the start and end variables could then be initialized to point at index 5. Any push to the front decreases the starting index, and any push to the back increases the ending index.

When accessing an individual element, the index is added to the starting index to determine the final array position to employ. If this sum exceeds the size of the array, it "wraps around" to the lower part of the array.

What is not obvious is the degree to which this extra addition to find the index represents a performance penalty. Documentation for the Rust Collections library states that whenever Vec and VecDeque have the same asymptotic time bounds for an operation, that the Vec will generally be faster, but no further details are provided.

I became interested in this question because of how the constant-time double-ended push and pop operations for a VecDeque incur no asymptotic time penalty for direct random access. If the penalty is, in fact, insignificant, then maybe there is no need to use a regular Vec at all. So to satisfy my curiosity, I constructed an experiment.

In this experiment, we create a Vec with a specified number of elements. (I used 100000 to produce the data below.) We then randomly shuffle these elements. At this point, the experiment begins. We copy these values into another Vec and then a VecDeque, and then sort them using the selection sort algorithm. While this is not a very good sorting algorithm in general, for this purpose it is ideal, as it requires a lot of random accesses that are widely spaced across the sequence being sorted. By using the same sequence of shuffled values for each data structure, the test should also be fair.

Here are the results. For each algorithm, we performed 30 trials with 100000 elements. Each trial featured a different permutation of those values.

Data Structure	Mean completion time (ms)	Standard Deviation
Vec	3238.4	205.1558
VecDeque	4813.2	213.019

It is pretty clear, then, that for random access the Vec is consistently faster than the VecDeque. In exchange for constant-time front-end pushes and pops, the VecDeque takes about 50% longer to complete each random access. It seems to me to be well worth the cost when that functionality is pertinent, but when it is not the plain Vec is strongly preferable.

Please feel free to download and experiment with my source code if you are interested.