Most of these standard data structures are routinely useful. I mention them here in order of the rough frequency with which I seem to employ them:
- Dynamic arrays are, by far, the most useful data structure for representing an arbitrary collection of data. They are friendly to the CPU cache, relatively conservative with memory, and enable constant-time random access.
- Hash tables enable the fastest practical implementations of unordered sets and maps, especially because (like dynamic arrays) they enable cache-friendly memory layout.
- Binary search trees enable fast implementations of totally ordered sets and maps.
- Stacks, queues, and heaps are frequently components of other algorithms, such as graph traversal and A* search algorithms.
- Singly-linked lists are foundational data structures for purely functional programming and purely functional data structures. They are also a prerequisite for understanding binary search trees.
Unlike these other data structures, however, I haven't found doubly-linked lists to be especially useful in practice. The advantages of the doubly-linked list are as follows:
- Constant-time insertion and deletion of elements anywhere in the list, provided that an iterator is pointing to the pertinent location.
- Constant-time insertion and deletion of elements at the start or end of the list. This is suitable for a queue or deque.
- Constant-time splicing of two lists together.
Its disadvantages include the following:
- Storing each element also requires storing two pointers, making it somewhat memory-hungry.
- Accessing an arbitrary element requires linear time.
- Lots of tiny linked objects are unlikely to be laid out together in memory in a way that is friendly to the CPU cache.
Before discussing the first two advantages of the doubly-linked list, I want to mention that for the third advantage (splicing), it really is the most time-efficient data structure available for a sequence. Curiously, however, some collection implementations (e.g. Java) don't support this operation at all, while others (e.g. Rust, C++) do.
That said, I have a hard time imagining a situation in which constant-time splicing is useful. I see an analogy between the movement of elements in a list splice and the amortized analysis of the movement of elements when resizing a dynamic array. Once an element is added to a list, there is likely to be a very small number of times in which it is part of a list splice. Moving it through the same number of linear-time array splices could probably be amortized. I need to think more about this, but it is the direction my intuition is moving.
Moving on from there, equally suitable (or better) for the first situation are the following:
- The gap buffer (implementation in Rust) stores elements in an array, with values accumulating at both ends, leaving a gap in the middle. Insertions take place at a location designated by a cursor, fulfilling a role similar to that of the iterator in a doubly-linked list. Insertions to the left of the cursor accumulate at the start of the array, while insertions to the right of the cursor accumulate towards the end. When the cursor moves left, an element is transferred from the front to the back of the array. When it moves right, an element is transferred in the other direction. As with any array, the gap buffer can be resized in amortized constant time as needed.
- The zipper is similar, except that instead of an array there are two singly-linked lists representing the sequences on each side of the cursor. This avoids the need for amortization to yield constant-time insertion and deletion without paying the memory penalty of the doubly-linked list.
Both of these approaches are much less memory-hungry than the doubly-linked list. A particular advantage of the zipper (as with any data structure based on a singly-linked list) is that it can be implemented as a purely functional data structure.
For the second situation, a ring buffer is preferable. This is an array in which two integers point to the indices of the start and end of the sequence. This makes both insertion and deletion at both ends highly efficient. It is still constant time for random access, albeit at a significant performance penalty in comparison to a standard array.
An arguable advantage of the doubly-linked list is that there is no need for amortization to yield constant-time performance for the target operations of insertion and deletion at the ends. I personally do not think this matters very much, for two reasons:
- Amortized constant time postulates that occasional linear-time operations can, in effect, "spread out" their time penalty across all operations. In a real-time system, it is possible that a task might miss its deadline when the occasional expensive operation arises. I would argue, however, that a fixed-size array is preferable to either a dynamic array or a linked list in this situation. Real-time systems with constant time bounds also tend to have constant space bounds and an aversion to frequent use of the heap. Determining a suitable fixed-array size for the application leads to guaranteed constant time.
- The dynamic resizing operations are really not very expensive in practice.
Of course, a claim like the second requires empirical verification. To that end, I implemented a simple experiment in Rust. (Note that, contrary to widely-held views about Rust's ownership system, it is perfectly possible to implement a doubly-linked list in safe Rust by using Weak pointers.) It adds a specified number of elements to both data structures at the end, then removes them all from the front. I not only recorded the total time necessary; I also recorded the maximum time used for a single operation.
Here are some results:
# items | VecDeque total (s) | VecDeque max op (s) | VecDeque fixed size total (s) | VecDeque fixed size op(s) | LinkedList total (s) | LinkedList max op (s) |
100,000 | 0.028143 | 0.000203 | 0.028134 | 0.000022 | 0.035405 | 0.000089 |
1,000,000 | 0.327679 | 0.002232 | 0.314516 | 0.000061 | 0.402862 | 0.000277 |
10,000,000 | 2.997163 | 0.034366 | 2.966092 | 0.000251 | 3.838949 | 0.002055 |
100,000,000 | 32.289343 | 0.303334 | 28.891364 | 0.001622 | 47.137053 | 0.019145 |
I was actually surprised that LinkedList performed as well as it did. While it was always slower than the VecDeque, it still was within 50% or so of its performance. Unsurprisingly, preallocating sufficient space for the elements in the VecDeque resulted in by far the best performance.
What was more interesting was to see just how many items need to be in play for the linear-resizing operation to have a noticeable impact. With 100,000,000 items in play, it could represent a noticeable pause, but this is a particularly extreme situation. If this many items were to be anticipated, preallocating space is the way to go.
This isn't to say that there are no imaginable uses of the doubly-linked list - just not the kind of uses that would cause it to be featured in a collections framework. For example, using a doubly-linked list to represent the insertion order of elements in a hash table seems useful. When removing an element, we already have a pointer into the linked list handy, so the ability to remove in place in constant time really is pertinent. But notice how intrusive this is into the data structure - using an off-the-shelf linked list really isn't sufficient.
To conclude, then, there really doesn't seem to be any use for having a doubly-linked list in a collections library. Anything it can do can be done equivalently or better by another data structure. I also think it is particularly problematic to dedicate a lot of class time to a data structure that isn't really very useful. It gives students the impression that it is important, that it is used a lot, and that it matters, when it really doesn't. And this, in turn, even influences designers of collections frameworks, who in turn mislead developers into thinking that it is actually useful.
No comments:
Post a Comment