Steve Krenzel - 14. Algorithms: O(N*K) without the N

O(N*K) without the N

There was a good post on proggit the other day about optimizing code and making the final code 80 to 100 times faster than the original.

The task was: Given a list of words, and two words from that list, find a series of words that will get you from one word to the other by only changing one character at a time.

For instance to go from wolf to moon:

wolf

woof

goof

goon

moon

The core problem for the poster was that finding words that were different by only one character was taking a very long time. If he could solve that, then he could use whatever search algorithm he wanted inorder to find his way from one word to the other (Depth-First, Breadth-First, A*, etc...).

The initial approach taken by the poster was to take the key (the word we're starting with) and compare it to every word in the word list. For every word in the list of the same length as the key, he calculated how many letters were different. If the difference was of length one, he'd add that word to his results.

If the length of the key is K and the length of the word list is N, the expected runtime of this algorithm is O(N*K).

Someone recommended that it might be faster to take the key and generate every possible one-letter variation of it. If the key was 10 characters long, this would generate 250 words (25 variations for each character in the key), or 25*K for any key of length K. You could then just check which of those 250 words was in your word list. If the data structure for the word list was a hashset, then this would run in O(25*K) because each of the 250 look ups would be done in constant time.

This solution turned out to be by far the fastest of everything the poster tried. The reason was that it completely removed N from the equation, which was the dominating factor. We can do even better though. By trading memory for time, we can get it to O(K).

For every word in the word list, we substitute each character in the word with a special character. In this case we'll use a space (" "). The word happy would become:

" appy"

"h ppy"

"ha py"

"hap y"

"happ "

Then each of these new words with spaces becomes a key in a hashmap. The value associated with that key is the set of all words that generated that key. The code to do this in python is something like:

lookup = {}
for word in wordlist:
    for i in xrange(len(word)):
        key = word[:i] + " " + word[i+1:]
        lookup.setdefault(key, set()).add(word)

The variable lookup is then a hashmap that contains every possible one character variation and the set of words associated with that variation.

To find the words that are one different than the key, we simply iterate through the key substituting each character with a space. For each key generated, we look up the set of words associated with that key and take the union of them all, like so:

matches = set()
for i in xrange(len(word)):
    key = word[:i] + " " + word[i+1:]
    matches = matches.union(lookup.get(key, set()))
if word in matches:
    matches.remove(word)

This generates a set that contains every word that is one character different than the key. The set will contain the key itself as well, which is why the last two lines remove it. Since we're doing K iterations each taking O(1), the total time is O(K).

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at steve@thinkfuse.com.