Steve Krenzel - 18. Distributed Computing: MapReduce and HashFold

MapReduce and HashFold

This is a post about HashFold, a theoretical framework (for now) that is suited to performing the same types of distributed computation that MapReduce can perform, but improves upon MapReduce both in runtime and memory usage. This is quite a claim so I'm putting the idea out there for public scrutiny.

If you're familiar with MapReduce, HashFold does away with sorting values and the concept of a combiner, it removes the need for distinguishing between map and reduce phases, and in general it simplifies the overall architecture in many ways while allowing us to leverage existing distributed key-value stores.

The way HashFold works is simple. First, we need a hash table. Then we need a mapper that takes inputs and produces key-value pairs. Finally, we need an associative fold function. Once we have these three things we're good to go.

The short version of HashFold is, given an input, i, a hashmap, h, a mapper, m, and a fold function, f, HashFold is effectively: h[m(i).key] = f(h[m(i).key], m(i).value).

This deserves an explanation.

We simply pass a given input to our map function. Our map function then returns a key and a value (referred to as v1). We look up the key in the hash table and get a value (referred to as v2). We then pass v1and v2 to our fold function, which returns a value (referred to as v3). We then set v3 as the value for the respective key in the hash table. Then we repeat for the next input.

Despite the simplicity, this works great and is easy to distribute. Each node hash-folds its local inputs and when it's finished simply redistributes the key-value pairs so that all of the values for a certain key wind up on the same node. Then we just do the fold again.

Here are some examples in pseudo-code (pythonic pseudo-code, but pseudo-code none-the-less):

To count the number of times a word occurs in a corpus:

def map(document):
    for word in document.split():
        yield word, 1

def fold(count1, count2):
    return count1 + count2

To create an index for words to documents:

def map(document):
    for word in document.split():
        yield word, document

def fold(doc_set, new_doc):
    doc_set.union(new_doc)
    return doc_set

To find friends:

def map(user, friends):
    for friend in friends:
        yield sorted(user, friend), friends

def fold(friends1, friends2):
    friends1.intersect(friends2)

Note: In the above examples we gloss over the case when no value exists for a given key yet... it's trivial to handle, I just didn't want to convolute the examples.

So I claimed that HashFold can be more memory efficient than MapReduce. I make this claim because MapReduce needs to store all of the key-value pairs generated by the mapper, whereas HashFold only needs to store one key-value pair at any given time (in addition to the hash-table).

I also claimed that HashFold can be faster than MapReduce. I make this claim because MapReduce must sort all of the key-value pairs generated. Typically any given key has many values, and sorting requires N*log(N) operations where N is the number of values. There is no sorting in HashFold.

One immediate limitation of HashFold might appear to be that the hash-table only stores one value and the fold function only operates on two values at any given time whereas MapReduce makes all of the values available to the reduce function. Should you require this, HashFold can achieve the equivalent functionality, but two HashFold jobs are required. The first job makes the list of key->values (see the document index example), the second job then just passes the entire values list to a fold function (which is now the equivalent to your reduce function).

This may sound like HashFold is less efficient because it requires two separate jobs, but we don't need to sort the values like we do in MapReduce and instead we only need to do a linear iteration through the keys (there are typically far fewer keys than values as well).

There is a lot more to this, but I'll stop there.

Thoughts?

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at steve@thinkfuse.com.