textmass: WideFinder 2 Strawman implemented in Scala

Tim Bray has posted a strawman of the WideFinder 2 Benchmark and also posted an implementation in Ruby. It's similar in spirit to the original WideFinder, with the following differences:

much more logic: more pattern matching, counting, and conditional logic per record, which is designed to increase the CPU load
much more data: over 40GB in fact.
much more power: Sun Fire T2000 UltraSparcT1, 32 cores

Once again, recreating this basic implementation in Scala will help later to factor out any differences in performance based on language/runtime alone. Once pure language/runtime differences are understood, it will be easier to discern differences between various algorithmic approaches to concurrency. So without further ado, here is the basic, single-threaded, Scala version of the benchmark:

import scala.collection.mutable._

/** A normal HTML report: Top 10 popular URLs by hits and bytes, top 10 404s, *
 *  top 10 client IPs by hits, and the top 10 referrers.  Only, skip the HTML */
object WideFinder2 extends Application {
    class Counter extends HashMap[String,Long] {
        override def default(key:String) = 0L
        // override update to workaround http://lampsvn.epfl.ch/trac/scala/ticket/904
        override def update(k:String,v:Long) = super.update(k,v) 
    }
    
    val hits, bytes, s404s, clients, refs = new Counter
    val uriRegex = """/ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+$""".r
    val refExclRegex = "-|http://www\\.tbray\\.org/ongoing/".r

    def record(host: String, uri: String, ref: String) {
        if(uriRegex.findPrefixOf(uri).isDefined) {
            hits(uri) += 1
            clients(host) += 1
            if(refExclRegex.findPrefixOf(ref).isEmpty) 
                refs(ref) += 1
        }
    }

    def report(label: String, map: Map[String,Long], megabytes: Boolean) {
        printf("Top %s:\n", label)
        val keysByCount = map.keySet.toList.sort { (a,b) =>
            (map(a) - map(b)) match {
                case 0 => a < b
                case n => n > 0
            }
        }
        for(key <- keysByCount.take(10)) {
            val format = if(megabytes) "%10.1fM: %s\n" else "%11.0f: %s\n"
            val value: Double = if(megabytes) map(key)/(1024.0d*1024.0d) else map(key)
            printf(format, value, if(key.length > 60) key.take(60) + "..." else key)
        }   
        println
    }

    override def main(args: Array[String]) {
        def toInt(a:String) = try {a.toInt} catch {case _=>0}
        for(line <- io.Source.fromFile(args(0)).getLines) {
            val f = line.split("\\s+")
            if(f(5) == "\"GET") {
                val (host, uri, status, size, ref) = 
                    (f(0), f(6), toInt(f(8)), toInt(f(9)), f(10).slice(1,f(10).length-1))

                status match {
                    case 200 => 
                        bytes(uri) += size
                        record(host, uri, ref)
                    case 304 => record(host, uri, ref)
                    case 404 => s404s(uri) += 1
                    case _ =>
                }
            }
        } 
        report("URIs by hit", hits, false)
        report("URIs by bytes", bytes, true)
        report("404s", s404s, false)
        report("client addresses", clients, false)
        report("refs", refs, false)
    }
}

The program can be run using this command:

java -cp .:scala-library.jar WideFinder2 O.100k

Interestingly, it has similar performance to the Ruby version, taking 23 seconds to process 100k lines on the test system, producing results identical to these. Besides input performance, which is neccessarily single threaded and can only be improved incrementally, there are three areas to focus on distributing:

matching
counting
sorting

However, the incremental improvement to input, which I hinted at in my last post, changes the nature of the problem quite a bit, so before getting into these three areas, maybe I should tackled that first. The basic idea is, instead of tying up the single input thread with extra work (namely character decoding and newline detection), just read large arbitrarily aligned blocks of bytes, and let worker theads deal with them in parallel. This causes two new problems:

multibyte and variable byte character encodings are more difficult to handle as a single character may span across multiple blocks
the content you're looking for (let's call this the target data) may span across multiple blocks

Luckily he first one is easy to deal with: only support ASCII for the time-being. Any fixed-width character encoding can be handled pretty trivially by buffering and appending/truncating each input block to make it an even multiple of N. Variable-width encoding, like UTF-8, for example, would be much harder to handle. This might be a good subject for a future project.

Likewise, the second problem has a straightforward solution to deal the WideFinder data, and more complex solutions to deal with the general case. The WideFinder data is line based text, and the WideFinder benchmark deals with finding patterns on single lines of text. So we can save the head and tail of each line that spans multiple blocks and re-join them for processing (i.e. join the tail of block 1 with the head of block 2 to get the line that spans blocks 1 and 2). For the general case, there are a few other strategies to collect and join endpieces, each with a requirement on the nature of the target data:

if the target data is guaranteed to not overlap, then the head and tail to be re-joined is any data between the end of the last match of a block, and the beginning of the first match in the next block. If the targets are too few and far between, and the size of the re-joined endpieces are frequently comparable to the input blocks themselves, this strategy results in too much repeated matching to be useful.
if the target data is guaranteed to only overlap partially, the above strategy changes to taking all data between just after the start of the last match in a block, and just before the end of the first match in the next block. This strategy has the same problem for sparse matches
if the target is guaranteed to be no bigger than some reasonably small (relative to the input block) size, then either of the two overlap strategies can be augmented with a maximum size which should reduce the redundant matching in huge endpieces.

For now, there is no need for any of these complex strategies to deal with re-joining endpieces, and we only have to worry about ASCII encoding. Maybe this can all be tackled in some future WideFinder 3.

textmass

Tuesday, May 27, 2008

WideFinder 2 Strawman implemented in Scala

No comments:

Post a Comment

Blog Archive

About Me