Thursday, May 15, 2008

WideFinder in Scala

Tim Bray has put forth another challenge in Wide Finder 2. The purpose of this round is to try and take advantage of concurrent programming techniques on a single machine to better utilize the multiple cores commonly found in modern CPUs.

I didn't take part in the first Wide Finder project, so I feel I have a little catching up to do before tackling the current version. Since I'm also taking every opportunity I can to learn more Scala, I decided it would be best to just recreate the original version in Scala, preserving as much of the elegance and simplicity found in Tim's original Beautiful Code. That, plus it happens to be a requirement to participate to reproduce his results with a minimal version... Here's what I came up with:

import java.util.regex.Pattern
import scala.collection.mutable.{Map,HashMap}
import scala.io.Source.fromFile

object WideFinder extends Application {
    override def main(args: Array[String]) = {
        val counts: Map[String, Int] = new HashMap[String, Int] {
            override def default(key:String) = 0
        }
        
        val pattern = "GET /ongoing/When/\\d{3}x/(\\d{4}(/\\d{2}){2}[^ .]+) "
        val matcher = Pattern.compile(pattern).matcher("")
        
        for (input <- fromFile(args(0),"ASCII").getLines if matcher.reset(input).find) {
            counts(matcher.group(1)) += 1
        }
        
        val keysByCount = counts.keySet.toList.sort {(a,b) => counts(a) > counts(b)}
        for (key <- keysByCount.take(10)) {
            println(counts(key) + ": " + key)
        }
    }
}

Its ended up pretty close to the original Ruby code. Performance wise, it seems reasonable when tested against the 100K sample files on my 2.33 GHz MacBook Pro, taking about a second to complete. I'm chomping at the bit to do a concurrent version to handle the 45GB of log data on a 32 core server.

No comments:

Post a Comment