I've been working on a project which implements a number of data structures in Python that are stored entirely on disk. It started when I needed data structures that were gigabytes in size, but I only had megabytes of ram available to me. I figured I'd just throw it all on disk... it'd be a lot slower than ram, but at least it'd be able to run at all (plus we can use the ram as a cache)
I was working on a dynamic array implementation and found that when I needed to resize it, copying gigabytes or terabytes of data simply was taking too long. In theory this cost is amortized, but in practice you're chugging along at 25,000 writes per second and then suddenly stop for a few minutes or hours while the array resizes, then continue chugging along again. I didn't like it and couldn't use it in a user facing application.
What I needed was a fast dynamic array with consistent throughput that's stored on persistent storage. Here is how I achieved that.
Essentially, I removed the need to copy any data, turning the resize into a constant time operation (but we still retain data locality).
Typically dynamic arrays work like:
- We allocate some portion of our memory for the array:
- As our program runs, other memory will be used for other data:
- Eventually, we'll use all of the available slots in our array and will need to resize it. To do this, we allocate a new chunk of memory twice as big as the original:
- Finally, we copy over the old array into the new array and delete the old array:
That last bit is what kills us, so instead of copying the data we just leave it alone.
Effectively, we stop at step 3. We still allocate the new memory but don't do any copying. This leaves us with:
We treat the two arrays as separate. When we are given an index of an item to look up, we simply need to figure out which array the element falls into and then where specifically in the array it is.
It can get a little confusing when we resize a few times, but effectively we're just making an array of fixed size arrays. If we resized our array 4 times we'd get something like:
Suppose the first array holds N elements and every other array holds twice as many as the one before it. We can figure out which array an index (i) would fall into with:
or in python:
def get_array_index(i, N): return int(log((i / N) + 1, 2))
So given an index of 70, this function will return 0 because the 70th element falls into the first array. Given an index of 629, this function will return 2 because the 629th element is in the third array.
Once we know which array the index is in, we need to find the exact position in that array for the element. We calculate that with:
or in python:
def get_relative_index(i, N): return i - (2**get_array_index(i, N) - 1)*N
These two equations will tell us that index 629 is in the 2nd array at position 329 or that index 1243 is in the 3rd array at position 543.
The nice thing about this is that figuring out where the element is takes constant time. It's a larger constant than traditional dynamic arrays, but it's so much quicker than disk access speeds that it is negligible when accessing an array directly on disk.
We still retain data locality except at a few points, but that cost is amortized. For all intents, our data locality is the same as it would be with a standard dynamic array.
Allocating space for a new array can also be done in constant time. If you dedicate an entire chunk of disk (i.e. a partition) to the array, allocating space for a new array is as simple as changing your end-of-file pointer within that partition. Even if you can't allocate a large chunk of space initialy, appending a bunch of null bytes to the end of a file is a pretty quick process.
Anyway, that's how I implemented a fast dynamic array with consistent throughput that lives entirely on persistent storage.
About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at email@example.com.