Steve Krenzel - 10. Statistical Substring: Follow Up

Follow Up

In my previous post on statistical substring searching I mentioned that I'd follow up with some stats on it's accuracy. Here they are.

To generate the "results accuracy" plot, I generated every list of substrings in my test corpus for lengths of 4 through 20. I also generated 16 different indices where each index was indexed using a different length of n-gram. Keep in mind that because of the way this search was implemented, whatever length n-gram was used determines the minimal number of characters that can be searched for (so an index built on 5-grams can only search for terms of 5 characters or more). Then I simply searched for each substring using the indices and kept statistics on how many documents returned actually contained the string searched for. The numbers for 10-grams were pretty impressive, leveling off at about 96%.

The "substring accuracy" plot was generated in a similar fashion, but instead it tracked how many substrings returned results where every result contained the substring searched for. In other words, what percentage of strings returned perfect results with no false positives (precision). The probabilities tended to be much higher, most were above 90%. That means that even though few strings returned false positives in the results, those that did returned a lot of them. Read that again, it's important. When examining the strings responsible for these false positives most tended to be repeated characters or sequential new-lines or other outliers. My subjective analysis is that in practice these search strings would be unlikely to be searched for.

Finally, the "queries per second" plot gives you an idea of how many queries per second each index was able to perform and how it varied as the search strings increased in length.

I know I also mentioned in the first post that I'd talk about hash collisions and how they effect space and accuracy, but I decided that it wouldn't be of much value. Just know that all of these values have no collisions, and collisions will only reduce the accuracy of the system. In a future post I'll talk about a data structure that gives us all of the current benefits, but also the index will be able to handle any length search string and take up less space. The current 10-gram index takes up 36MB when serialized to disk, and it is generated from 2MB of data. You'll see that we can reduce the size of the index significantly.

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at steve@thinkfuse.com.