Steve Krenzel - 2. Statistical Substring: Substring Searching at Scale

Substring Searching at Scale

Here is an extremely simple implementation of a statistical full text substring search which has a runtime independent of the corpus size. Recall is 100% and precision is adjustable.

We use the sample index from http://www.dotnetdotcom.org/ for the test corpus (2MB in size) and break the source into lowercased n-grams. In this case of length 5. Given the string "The dog ran.", we'd get 8 5-grams: "the ", "he d ", "e dog", " dog ", "dog r", "og ra", "g ran", and " ran."

After all grams have been enumerated, the amount of space required increases roughly 5 fold (or n for an n-gram). Then we just iterate through each gram and insert it into a hashmap where the key is the gram and the value is the set of URLs which have contained that gram.

Searching consists of breaking the search string into grams, looking up each set of URLs for each gram, and intersecting all the sets. Whatever URLs are in the resultant set will contain the search string (with a high probability, which will be defined in another post).

The following code implements what we've described here. The script accepts two arguments, the file with the web content (from http://www.dotnetdotcom.org) and the gram length. You can run it with "python script.py web_index 5". The first loop does the indexing, and the second does the searching.

#!/usr/bin/python
from sys import argv

fin     = open(argv[1])
index   = {}
content = fin.read().split(chr(0))
pairs   = [(content[i], content[i+2]) for i in xrange(0, len(content)-1, 3)]
min_tok = int(argv[2])
fin.close()

for url, html in pairs:
    for i in range(len(html)-min_tok+1):
        key   = html[i:i+min_tok].lower()
        value = index.setdefault(key, set())
        value.add(url)

inp = raw_input("Search: ").lower()
while inp != "":
    results = index.get(inp[:min_tok], set())
    for i in range(len(inp)-min_tok+1):
        results = results.intersection(index.get(inp[i:i+min_tok], set()))
    for i, result in enumerate(results):
        print "%d) %s"%(i+1, result)
    print ""
    inp = raw_input("Search: ").lower()

There are two big drawbacks with this approach: 1) The search string can't be shorter than the gram length and 2) The index becomes huge, in this case 13MB from 2MB. But the benefit is that you can do full text searching extremely quickly, even for very large sets of corpuses, and recall is always 100%. You can also increase or decrease the size of the index by adjusting the size of the grams and the frequency of hash collisions. By hash collisions, I'm referring to two different grams pointing to the same set of urls even if all of the urls in that set don't contain both of the grams. By adjusting these parameters, we can finely tune the precision of the search and the size of the index. I plan on visiting these tradeoffs in a follow up post.

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at steve@thinkfuse.com.