Wednesday, June 11, 2008

Regular Expression vs. Space Delimiting

There's been a little debate brewing over at the Wide Finder mailing list regarding the use of space delimiters to parse log entries, as opposed to a full blown regular expression. I struggled with this decision myself before finally choosing to use a regular expression. Having done so, I found my results differed from those of the reference implementation as I described here. I never expected that the reference would have to change or be re-run, but I did want to understand why my results differed from it. In fact this entire exercise is really a way for me to become a more effective Scala programmer, and to become a more effective concurrent programmer. How my results compare to others', either in performance or accuracy is only relevant in so far as I can learn something from their successes and missteps. That being said, here's the regex I used in my most recent submission followed by some thoughts about it:

val logEntryRegex = 
    ("""\n(\S+) - \S+ \S+ \S+ "GET ((/ongoing/When/\d+x/\d+/\d+/\d+/[^" .]+"""+
    """(?=[ "]))|[^" ]+)[^"]*" (\d+) (\S+) "([^\\"]*(?:\\.[^\\"]*)*)" [^\n]*""").r

The idea behind using such a beastly regular expression was four-fold:

  1. Optimize for common cases while accepting as many edge cases as practical
  2. Do the line splitting during the same pass
  3. Do the GET request filtering during the same pass
  4. Do the hit detection during the same pass

The leading newline is used instead of the ^ boundary marker because I'm splitting buffers arbitrarily relative to line boundaries, and don't want false positives on buffer boundaries. However, you do need to prepend the input stream with a newline to match the first line.

While there are many regular expressions that are equivalent to the above expression in what they match, some have exponentially worse performance. I would argue that while using a regular expression can make your wide finder more precise, it can also make it much slower, given the reality of combinatorial explosions that plague complex, poorly written expressions. If you can, profile your expression using a profiler like RegexBuddy. This profiler does a pretty good job of uncovering backtracking overload.

Also, I know of at least two log entries which my regular expression matches, but does not extract the "correct" URL. One was reported by me, and the one reported by Erik Engbrecht starting the mailing list thread.

ded947.cotswold-internet.net - - [09/Jan/2008:05:23:29 -0800] "GET /ongoing/ When/200x/2005/03/08/BloggingIsGood HTTP/1.0" 200 15326 "-" "-"
customer-reverse-entry.208.96.54.93 - - [12/Oct/2007:18:41:47 -0700] "GET /ongoing/What/The World/Products HTTP/1.0" 404 371 "-" "disco/Nutch-1.0-dev (experimental crawler; www.discoveryengine.com; disco-crawl@discoveryengine.com)"

In both of these cases the first case, my Wide Finder recorded "/ongoing/" as the URL requested. I'm not sure what was fetchedMy Wide Finder made the same assumption about what was requested as the reference implementation and, apparently as the web server in the second case. I've collected some other unusual log entries found while debugging for use while testing, my favorites being:

r61h80.res.gatech.edu - - [18/Mar/2003:05:19:16 -0800] "GET /ongoing/rug70.jpg HTTP/1.1" 200 2163 "http://www.tbray.org/ongoing/rug70.jpg" "Lynx Opera Mozilla MSIE Telnet PCap 4.0 5.0 6.0 <a href=\"http://members.fortunecity.com/randomn\"> http://members.fortunecity.com/randomn</a> <h1><b>DEATH TO SPAM</b></h1>"
corsair.digi.com.br - - [13/Nov/2003:11:44:21 -0800] "GET /ongoing/scripts/..%c1%1c../winnt/system32/cmd.exe?/c+dir+c:\\ HTTP/1.0" 404 239 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

Maybe this is a good chance to reflect on Postel's Law?

No comments:

Post a Comment