Steve Krenzel - 24. Code Jam: Welcome to Google Code Jam

Welcome to Google Code Jam

This challenge was my favorite one, it was the only one of the three challenges that forced me to go back and rethink my solution after it worked on the small data set but not the larger one.

The goal of the challenge is to count how many times a phrase occurs in a body of text. The catch is, there can be any number of characters in between the characters in the phrase. For instance, the phrase "dog" appears twice in the text "did you go?" It would match twice because of "d i d y o u g o ?" and "d i d y o u g o ?" In other words, how many ways can you delete the characters in a body of text in such a way that you'll be left with the desired phrase.

The first approach I took was the simple approach, recursing through and individually counting each match. This approach of using the easy but dumb algorithm worked fine for the first two challenges, so I figured it couldn't hurt to try the same approach here. Unfortunately it failed horribly when you got to larger bodies of text that had a quintillion or more matches. It failed because we counted each occurence individually. The code looked like:

from sys import stdin

#Ignore the first line
stdin.readline()

def count_text(line, text):
    if text == "":
        return 1
    return sum(count_text(line[i + 1:], text[1:])
               for i, c in enumerate(line) if c == text[0])

for i, line in enumerate(l.strip() for l in stdin.readlines()):
    print "Case #%d %04d"%(i + 1, count_text(line, "welcome to code jam"))

This worked wonderfully for the small data set, which maxed out at 256 matches, but failed for the larger data set where there were up to 2,598,789,532,778,836,923,762,772 matches. It turned out I needed to be more clever.

I don't know what the name of this algorithm would be. I came up with it on the spot and just referred to it as "Valid Prefix Counting", but I'm sure there is a more formal name for it. In the following, Pi refers to the i'th character in the phrase. C refers to the body of text we are matching against.

We make an array of (position, count) tuples. Position is the position of Pi, and count is the sum of all of the counts of P[i-1] where the position of P[i-1] is less than the position of Pi. Once we construct this table for Pi, we can construct the table for P[i+1] very quickly. Once we reach the last character, we sum up all of the counts and we've got our number of matches.

For example, we start by creating an array, A1, of all of the positions of P0 (the first character in the phrase) in C, along with a count of 1. Then we make a second array, A2, of all the occurences of P1 in C, along with a sum of how many P0's occur before P1's position.

At this point, if we were to sum up all of the counts in A2 it would tell us how many matches the prefix P0P1 has in C. We then set A1 to A2, and use A1 to construct the array for P2, which we store in A2. We just keep repeating this process until we've gone through the whole phrase.

Here is the code:

from sys import stdin

#Eat the first line
stdin.readline()

def count_prefixes(prefixes, position):
    return sum(i[1] for i in prefixes if i[0] < position)

def count_text(line, text):
    curr =[(-1, 1)]
    for c in text:
        curr = [(i, count_prefixes(curr, i))
                 for i, d in enumerate(line) if d == c]
    return sum(i[1] for i in curr)

for i, line in enumerate(l.strip() for l in stdin.readlines()):
    print "Case #%d %04d"%(i + 1, count_text(line, "welcome to code jam"))

This turns out to work really well and is very simple. The scoreboard leaders and Google's analysis used a similar algorithm, but it involved a matrix instead of two arrays. Google's explanation took a little bit for me to really wrap my head around, and in hind sight it's fairly simple, but I'd argue that the above is even simpler (and easier to code).

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at steve@thinkfuse.com.