Steve Krenzel

Odd that we think definitions are definitive. 

Stepwise Interactive Development

I was reading a usenet thread from 1985. The topic of discussion was if there will be a problem with computers in the year 2000 (now known as the y2k bug). One person chimed in that they had already had a similar problem with banking software in 1980. The software only saved the last digit of the year and was written in the 70's. As soon as 1980 hit it would think it was 1970 again. After discussing the bug with his colleague, they determined that fixing the bug would also require compiling "billions" of other COBOL programs. Neither wanted to do that, so their solution was:

First, I modified the daily demand deposit program with code that
checked for the date and about mid-1979 started printed warnings on the
console of what would happen come new year.  Then the systems analyst
and I got new jobs.  This is known as stepwise interactive development.

Brilliant.

Comments [0]

Code Jam: Welcome to Code Jam

This challenge was my favorite one, it was the only one of the three challenges that forced me to go back and rethink my solution after it worked on the small data set but not the larger one.

The goal of the challenge is to count how many times a phrase occurs in a body of text. The catch is, there can be any number of characters in between the characters in the phrase. For instance, the phrase "dog" appears twice in the text "did you go?" It would match twice because of  "did you go?"  and "did you go?" In other words, how many ways can you delete the characters in a body of text in such a way that you'll be left with the desired phrase.

The first approach I took was the simple approach, recursing through and individually counting each match. This approach of using the easy but dumb algorithm worked fine for the first two challenges, so I figured it couldn't hurt to try the same approach here. Unfortunately it failed horribly when you got to larger bodies of text that had a quintillion or more matches. It failed because we counted each occurence individually. The code looked like:

#!/usr/bin/python
from sys import stdin

#Ignore the first line
stdin.readline()

def count_text(line, text):
    if text == "":
        return 1
    return sum(count_text(line[i + 1:], text[1:]) for i, c in enumerate(line) if c == text[0])

for i, line in enumerate(l.strip() for l in stdin.readlines()):
    print "Case #%d %04d"%(i + 1, count_text(line, "welcome to code jam"))

This worked wonderfully for the small data set, which maxed out at 256 matches, but failed for the larger data set where there were up to 2,598,789,532,778,836,923,762,772 matches. It turned out I needed to be more clever.

I don't know what the name of this algorithm would be. I came up with it on the spot and just referred to it as "Valid Prefix Counting", but I'm sure there is a more formal name for it. In the following, Pi refers to the i'th character in the phrase. C refers to the body of text we are matching against.

We make an array of (position, count) tuples. Position is the position of Pi, and count is the sum of all of the counts of P[i-1] where the position of P[i-1] is less than the position of Pi. Once we construct this table for Pi, we can construct the table for P[i+1] very quickly. Once we reach the last character, we sum up all of the counts and we've got our number of matches.

For example, we start by creating an array, A1, of all of the positions of P0 (the first character in the phrase) in C, along with a count of 1. Then we make a second array, A2, of all the occurences of P1 in C, along with a sum of how many P0's occur before P1's position.

At this point, if we were to sum up all of the counts in A2 it would tell us how many matches the prefix P0P1 has in C. We then set A1 to A2, and use A1 to construct the array for P2, which we store in A2. We just keep repeating this process until we've gone through the whole phrase.

Here is the code:

#!/usr/bin/python
from sys import stdin

#Eat the first line
stdin.readline()

def count_prefixes(prefixes, position):
    return sum(i[1] for i in prefixes if i[0] < position)

def count_text(line, text):
    curr =[(-1, 1)]
    for c in text:
        curr = [(i, count_prefixes(curr, i)) for i, d in enumerate(line) if d == c]
    return sum(i[1] for i in curr)

for i, line in enumerate(l.strip() for l in stdin.readlines()):
    print "Case #%d %04d"%(i + 1, count_text(line, "welcome to code jam"))

This turns out to work really well and is very simple. The scoreboard leaders and Google's analysis used a similar algorithm, but it involved a matrix instead of two arrays. Google's explanation took a little bit for me to really wrap my head around, and in hind sight it's fairly simple, but I'd argue that the above is even simpler (and easier to code).

Comments [2]

Code Jam: Watersheds

This challenge by far involved the most code.

The problem is: You're given a topology and need to determine how water will flow on it. When you calculate where each point of water will flow, you also determine a basin where the water would build up. Each point in the topology is in a basin, your job is to find which basin and give that basin a label.

The topology is given in a grid of M-by-N integers. Starting at each integer you can just traverse to the neighbor that is lowest until there are no neighbors that are lower, at which point you've found the bottom of that basin. This is a brute force approach but it's a pretty simple recursive function and works more than fast enough for even the larger data set.

The code I'm presenting here is pretty simple and short at only 30 lines long. In my humble opinion this is a better reference than the 45 lines long solution given by Google in their analysis. I'm not saying it's a better reference because it's shorter, but simply because it's a little more straight foward and easier to wrap your head around. Here is my code:

#!/usr/bin/python
from sys import stdin

def neighbors(topo, x, y):
    if y > 0:                yield (y - 1, x)
    if x > 0:                yield (y, x - 1)
    if x < len(topo[0]) - 1: yield (y, x + 1)
    if y < len(topo) - 1   : yield (y + 1, x)

def find_basin(topo, x, y):
    xs, ys = x, y
    for i, j in neighbors(topo, x, y):
        if topo[i][j] < topo[ys][xs]:
            xs, ys = j, i
    return (x, y) if (x, y) == (xs, ys) else find_basin(topo, xs, ys)

def print_basins(topo):
    labels = {}
    for y in xrange(len(topo)):
        for x in xrange(len(topo[y])):
            basin = find_basin(topo, x, y)
            print labels.setdefault(basin, chr(ord('a') + len(labels))),
        print

# Function to read in a line from stdin and split it into ints
read_line = lambda: map(int, stdin.readline().split())

for m in xrange(int(stdin.readline())):
    print "Case #%d:"%(m+1)
    h, w = read_line()
    print_basins([read_line() for i in range(h)])

To try it just save the code in a file (e.g. watersheds.py), download sample input from the challenge site, and run "python watersheds.py < B-small-practice.in" at the command line.

Comments [0]

Code Jam: Alien Language

This challenge was the easiest and should have taken most people just a few minutes.

You're given a list of words that define a language, and then a list of patterns that should represent a word in that language. The catch is that the patterns have unknown characters with potential solutions in parentheses. For example, if our language had three words "bad", "dad" and "sad" and we were given a pattern "(bd)ad", either "b" or "d" could be the first letter of the desired word. There are two words in our language that match the pattern: "bad" and "dad". On the other hand if we were given the pattern "(bt)a(ds)", we'd have four potential words "bad", "bas", "tad", and "tas" but only one of them is in our language so only one possible combination matches.

The challenge was simply to count how many words in the alien language matched a given pattern. We can simply translate the patterns into regular expressions by substituting parentheses for square brackets, and then use our regular expression library to count the matches found. The code looks like:

#!/usr/bin/python
from
 string import maketrans
from
 sys    import stdin
from
 re     import findall

length = int(stdin.readline().split()[1])
lang   = "".join(stdin.readline() for i in range(length))


for
i, test in enumerate(stdin.readlines()):
    test = test.translate(maketrans("()", "[]"))
    print "Case #%d: %d"%(i, len(findall(test, lang)))

To try it just save the code in a file (e.g. alien.py), download sample input from the challenge site, and run "python alien.py < A-small-practice.in" at the command line.

Comments [1]

The Pareto Principle and Google Code Jam

The Google Code Jam qualifying round was a few days ago and I thought it'd be fun to go over the three solutions that I came up with. I didn't actually compete, this was just done on my own time for fun. The next 3 posts will be about those challenges, but this one is about something else I observed.

I was curious how I'd compare against the leaderboard, so I timed myself. As a result, I knew how long it took to get a working solution, even if it was ugly, and then how much longer it took to get a solution I'd be comfortable presenting, committing, and maintaining.

The interesting thing that I found while doing these solutions was that writing an initial solution that worked typically took around 15 minutes (plus or minus a few). Then I spent some time cleaning up the code in various ways. I didn't properly comment it or write tests for it, but given the total time spent on it thus far, my observation leads me to believe that code (or at least my coding) follows the Pareto principle.

The Pareto principle states that 80% of effects come from 20% of the causes and vice-versa. It seems that getting code that solved the challenge (80% of the effects) took about 20% of the time spent. Assuming I had spent more time properly commenting the code, writing tests for it, etc... it appears 80% of the total time spent would have been spent doing the final 20% of work.

Comments [2]

Interesting Month

This past month has been crazy.

I was in Orlando for a week for a conference, then me and my girlfriend went to Disney World for a week. We flew back from Disney on a Monday night, went to sleep and the next morning I flew to Seattle. That Wednesday I interviewed with Microsoft for an amazing team and the next day they gave me an offer. That Monday I resigned from Lockheed and put in my two weeks. After my two weeks was up, I went up to Ithaca for a week to start my masters at Cornell. I just got back from Ithaca two days ago. This upcoming Sunday I'll be moving across the country to Seattle and continuing my masters there remotely while working at Microsoft.

In the past month I've stayed in 4 different cities, resgined from my job, got an offer from Microsoft, started my masters, and am about to move across the country. It's been crazy, but goddamn things are exciting right now.

Comments [3]

Obnoxious Ninja, The Follow Up

So this is an explanation of how the "Hello, World!" program works in my previous post. Most people seemed to think it had to do with modular arithmetic, but really it was all about primes... I just used mod to help with factoring.

The first thing to note is that the for loop generates all of the primes up to 10,000.

The second thing to note is that
01225064762735104601006456677135667000644077661030607
is an octal number that is the product of 13 primes:

373 1039 1847 2689 3583 3931 4219 4969 5897 6917 7901 8819 9127

373 happens to be the 72nd prime number. 72 happens to be the ascii code for "H".

1039 happens to be the 173rd prime. 173 - 72 = 101, and 101 happens to be the ascii code for "e".

1847 is the 281st prime. 281 - 173 = 108, and 108 is the ascii code for "l".

And so on.

What we're doing in the code is factoring 01225064762735104601006456677135667000644077661030607, then figuring out what ith prime each is and then taking the distance between these i's. This distance is then converted to ascii chars.

Now you might be wondering why we didn't just multiply the 72nd prime by the 101st prime, but rather the 72nd prime by the 173rd prime. We did this because if you simply converted the ascii char to an int and found the associated ith prime and then multiplied them all together, you lose the order of the characters.

By using progressively larger primes we preserve ordering information, rather than simply preserving what characters are present in the string.

Cool, eh?

Comments [0]

Obnoxious Ninja

There was a CL post making rounds earlier: http://sfbay.craigslist.org/sfc/eng/1246353621.html

Here is my "Hello, World!" response:

#usr/bin/python
r,e,a,l = range,enumerate,all,0
for i,x in e(i for i in r(4,10**4) if a(i%j!=0 for j in r(3,i))):
    if not 01225064762735104601006456677135667000644077661030607%x:
        print chr(i-l),
        l=i

So how does it work?

Comments [2]

Improving MapReduce with HashFold

This is a post about HashFold, a theoretical framework (for now) that is suited to performing the same types of distributed computation that MapReduce can perform, but improves upon MapReduce both in runtime and memory usage. This is quite a claim so I'm putting the idea out there for public scrutiny.

If you’re familiar with MapReduce, HashFold does away with sorting values and the concept of a combiner, it removes the need for distinguishing between map and reduce phases, and in general it simplifies the overall architecture in many ways while allowing us to leverage existing distributed key-value stores.

The way HashFold works is simple. First, we need a hash table. Then we need a mapper that takes inputs and produces key-value pairs. Finally, we need an associative fold function. Once we have these three things we’re good to go.

The short version of HashFold is, given an input, i, a hashmap, h, a mapper, m, and a fold function, f, HashFold is effectively: h[m(i).key] = f(h[m(i).key], m(i).value).

This deserves an explanation.

We simply pass a given input to our map function. Our map function then returns a key and a value (referred to as v1). We look up the key in the hash table and get a value (referred to as v2). We then pass v1 and v2 to our fold function, which returns a value (referred to as v3). We then set v3 as the value for the respective key in the hash table. Then we repeat for the next input.

Despite the simplicity, this works great and is easy to distribute. Each node hash-folds its local inputs and when it’s finished simply redistributes the key-value pairs so that all of the values for a certain key wind up on the same node. Then we just do the fold again.

Here are some examples in pseudo-code (pythonic pseudo-code, but pseudo-code none-the-less):

To count the number of times a word occurs in a corpus:

def map(document):
    for word in document.split():
        yield word, 1

def fold(count1, count2):
    return count1 + count2

To create an index for words to documents:

def map(document):
    for word in document.split():
        yield word, document

def fold(doc_set, new_doc):
    doc_set.union(new_doc)
    return doc_set

To find friends:

def map(user, friends):
    for friend in friends:
        yield (user, friend), friends

def fold(friends1, friends2):
    friends1.intersect(friends2)
    return friends1

Note: In the above examples we gloss over the case when no value exists for a given key yet… it’s trivial to handle, I just didn’t want to convolute the examples.

So I claimed that HashFold can be more memory efficient than MapReduce. I make this claim because MapReduce needs to store all of the key-value pairs generated by the mapper, whereas HashFold only needs to store one key-value pair at any given time (in addition to the hash-table).

I also claimed that HashFold can be faster than MapReduce. I make this claim because MapReduce must sort all of the key-value pairs generated. Typically any given key has many values, and sorting requires N*log(N) operations where N is the number of values. There is no sorting in HashFold.

One immediate limitation of HashFold might appear to be that the hash-table only stores one value and the fold function only operates on two values at any given time whereas MapReduce makes all of the values available to the reduce function. Should you require this, HashFold can achieve the equivalent functionality, but two HashFold jobs are required. The first job makes the list of key->values (see the document index example), the second job then just passes the entire values list to a fold function (which is now the equivalent to your reduce function).

This may sound like HashFold is less efficient because it requires two separate jobs, but we don’t need to sort the values like we do in MapReduce and instead we only need to do a linear iteration through the keys (there are typically far fewer keys than values as well).

There is a lot more to this, but I'll stop there.

Thoughts?

 

Comments [3]

Longest Palindrome

I was asked about finding the longest palindrome in a string today. Here is how I said I'd approach the problem.

Algorithm:

Assume for simplicity that the palindrome is a length that is an odd number. Also, we treat a single character as a palindrome of length 1.

For each position in the string we store an upper and lower pointer that tracks the upper and lower bounds of the largest palindrome centered at that position.

For instance, position 2 in the string "racecar" would start as:

At this point, the longest known palindrome at position 2 is the single character 'c'. We then increment the upper pointer and decrement the lower pointer and compare their respective values.

Since the upper and lower values are not the same, we stop and now know that the longest palindrome centered at position 2 is 'c'. We then start the process again at the next string position, 3.

We increment and decrement the upper and lower pointers and see that they are equal. Our current longest palindrome centered at position 3 is now 'cec'.

Since they were equal, we increment and decrement the pointers again and see that they are still equal. Our new longest palindrome centered at position 3 is now 'aceca'.

We continue incrementing and decrementing the upper and lower pointers until they don't match or they hit one of the ends of the string.

Since our pointers are equal, we increment and decrement them again. We hit an end of the string and the pointers are equal so we know that our longest palindrome centered at position 3 is 'racecar'.

We're done with position 3 and continue this same process again at position 4. We'll do this until we've covered every position in the string.

NOTE: We initally assumed that the palindrome had an odd length. This algorithm works for even palindromes as well. The only change we make is that when we initialize the pointers, instead of having the lower and upper pointer point to the same position we have the upper pointer point at the next position. Everything else with the algorithm remains the same. i.e. The initial pointers for an even palindrome 'centered' at position 3 would look like:

Runtime:

Worst-Case: O(N^2)
Best-Case:  O(N)

The worst case is achieved when the string is a single repeated character because every substring is a palindrome.

The best case is achieved when the string contains no palindromes.

Typically if a string only contains a single palindrome (the fewer the better), the closer to O(N) it will run. This is because every time it checks a position in the string, it checks the character before and after that position, and if they don't match then it stops looking for the palindrome. Positions in the string can be discarded after only one lookup if that position doesn't have a palindrome, so if there are no palindromes you only do N comparisons.

Code:

Here is some code that implements this algorithm, included with doctests.

Comments [6]