Steve Krenzel

Odd that we think definitions are definitive. 
« Back to blog

Improving MapReduce with HashFold

This is a post about HashFold, a theoretical framework (for now) that is suited to performing the same types of distributed computation that MapReduce can perform, but improves upon MapReduce both in runtime and memory usage. This is quite a claim so I'm putting the idea out there for public scrutiny.

If you’re familiar with MapReduce, HashFold does away with sorting values and the concept of a combiner, it removes the need for distinguishing between map and reduce phases, and in general it simplifies the overall architecture in many ways while allowing us to leverage existing distributed key-value stores.

The way HashFold works is simple. First, we need a hash table. Then we need a mapper that takes inputs and produces key-value pairs. Finally, we need an associative fold function. Once we have these three things we’re good to go.

The short version of HashFold is, given an input, i, a hashmap, h, a mapper, m, and a fold function, f, HashFold is effectively: h[m(i).key] = f(h[m(i).key], m(i).value).

This deserves an explanation.

We simply pass a given input to our map function. Our map function then returns a key and a value (referred to as v1). We look up the key in the hash table and get a value (referred to as v2). We then pass v1 and v2 to our fold function, which returns a value (referred to as v3). We then set v3 as the value for the respective key in the hash table. Then we repeat for the next input.

Despite the simplicity, this works great and is easy to distribute. Each node hash-folds its local inputs and when it’s finished simply redistributes the key-value pairs so that all of the values for a certain key wind up on the same node. Then we just do the fold again.

Here are some examples in pseudo-code (pythonic pseudo-code, but pseudo-code none-the-less):

To count the number of times a word occurs in a corpus:

def map(document):
    for word in document.split():
        yield word, 1

def fold(count1, count2):
    return count1 + count2

To create an index for words to documents:

def map(document):
    for word in document.split():
        yield word, document

def fold(doc_set, new_doc):
    doc_set.union(new_doc)
    return doc_set

To find friends:

def map(user, friends):
    for friend in friends:
        yield (user, friend), friends

def fold(friends1, friends2):
    friends1.intersect(friends2)
    return friends1

Note: In the above examples we gloss over the case when no value exists for a given key yet… it’s trivial to handle, I just didn’t want to convolute the examples.

So I claimed that HashFold can be more memory efficient than MapReduce. I make this claim because MapReduce needs to store all of the key-value pairs generated by the mapper, whereas HashFold only needs to store one key-value pair at any given time (in addition to the hash-table).

I also claimed that HashFold can be faster than MapReduce. I make this claim because MapReduce must sort all of the key-value pairs generated. Typically any given key has many values, and sorting requires N*log(N) operations where N is the number of values. There is no sorting in HashFold.

One immediate limitation of HashFold might appear to be that the hash-table only stores one value and the fold function only operates on two values at any given time whereas MapReduce makes all of the values available to the reduce function. Should you require this, HashFold can achieve the equivalent functionality, but two HashFold jobs are required. The first job makes the list of key->values (see the document index example), the second job then just passes the entire values list to a fold function (which is now the equivalent to your reduce function).

This may sound like HashFold is less efficient because it requires two separate jobs, but we don’t need to sort the values like we do in MapReduce and instead we only need to do a linear iteration through the keys (there are typically far fewer keys than values as well).

There is a lot more to this, but I'll stop there.

Thoughts?

 

Comments (3)

Jun 22, 2009
Robert Barta said...
Thoughts?

Not yet. But you made me curious. :-) Any paper, white paper or one programming example to see how this works?

Jun 23, 2009
Nick Johnson said...
Interesting idea. A few thoughts:
- This won't work for any situation where the reducer (in MapReduce parlance) expects to get all the values in sorted order before producing an output value. No examples come immediately to mind, but they definitely exist. :)
- You don't explicitly describe, but seem to implicitly expect, that the fold function can be called with _both_ arguments being the output of a previous fold - eg, f(f(a, b), f(c, d)). This is necessary if you want to apply the "fold values locally" optimization, but only works if the output of your map and the output of your fold are of the same type.
- I'm not convinced the 'fold values locally' optimization is actually necessary, though. Given available network bandwidth on a well-connected cluster, folding on a single node per key is probably perfectly satisfactory.
- How do you handle the case where the total 'folded' output size is greater than available RAM?
Jun 23, 2009
Steve Krenzel said...
Robert,
Thanks... working on a functioning demo. I used a prototype to solve a facebook puzzle, but that code isn't suitable for public consumption.

Nick,
Thanks for the great thoughts.

1. We can get around that limitation with two HashFold jobs. the first one creates a list, the second one operates on that list (and sorts it in the same step). Surprisingly, this may still be more efficient than the way MapReduce does things now (at least in theory, we'll have to wait and see how practice plays out).

2. That is accurate. I should have been more explicit about it. As described above the map return value, the arguments for the fold, and the fold return value must all be the same type. This isn't as limiting as it sounds, but if it turns out to be too limiting there are solutions to each of those issues that won't break the overall concept and theory behind HashFold (i.e.we could have a local fold function and a global fold function without any loss in concurrency)

3. The local fold is important. I wish network bandwidth wasn't an issue, but it is one of the biggest issues in distributed computing. You want as little data as possible to hit the network. It's very easy to saturate pipes by moving terabytes or petabytes of data around. Folding locally is in many ways equivalent to MapReduce's combiner. It helps with performance because the data is already in memory (we can generally process data faster than we can read, write or move it), and it helps with network performance because the output of a fold will often be smaller than the sum of its inputs.

4. That's more of an implementation detail. We could use some persistent storage and process it in batches.

Keep the feedback coming.

All the best,
Steve

Leave a comment...

 
Got an account with one of these? Login here, or just enter your comment below.
Posterous-login    twitter