textmass: WiderFinder

Continuing on the theme of Scala WideFinding, I made a couple of attempts to scale the performance across multiple cores without much luck. These mostly involved using Actors in various ways to distribute load. For example, I tried distributing the regex matching, the counter lookup and incrementing, as well as the result sorting. None of that yielded much in the way of improved performance.

From what I can tell, there is some lower common denominator in all cases that keeps performance from improving, and it seems to be the use of Source.getLines in the main thread. Here's how it looked in the original version:

for (input <- fromFile(args(0),"ASCII").getLines if matcher.reset(input).find) {
    counts(matcher.group(1)) += 1
}

That didn't change too much across the experiments so far, with the one major exception of distributing the regex matching. If you haven't seen how Actors are used Scala, it might seem a little foreign, so let me briefly digress and explain how they work at a very high level so you'll understand what follows. The basic idea is that each Actor is an object that may own a thread or may participate in a thread pool. You send message objects to the Actors, and they decide when and how to handle each message in their queue. The syntax is dead simple:

actor ! message

where actor is an implementation of Actor, and message is basically any old object. Using Actors to distribute the regular expression matching made the above for expression look something like this:

var n = 0
for (input <- fromFile(args(0),"ASCII").getLines) {
    regexMatchers(n % numMatchers) ! Some(input)
    n += 1
}

This expression sends input lines to a number of different regexMatcher Actors in a round robin fashion. Regex matching is quite CPU intensive and each line is independent from the next, so this seems like a worthy candidate for parallelization. Unfortunately, the benefit is never realized since it sits behind what I suspect is a slow input source. Two things that stand our are:

Source.fromFile() returns a BufferedSource which is an Iterator[Char], and getLines() returns an Iterator[String] which itself iterates over the BufferedSource character by character, looking for a newline, building up a string as it goes. Seems rather tedious for the main input thread to be doing this while multiple regex matching worker threads wait for input.
By default, BufferedSource uses a 2048 character buffer. That's smaller than a single cluster on most filesystems, and is likely another source of slowness

Other areas to investigate are

NIO offers scattering reads that can read into a number of buffers at once
java.util.concurrent has some non-blocking and limited-blocking data structures that may help speed things up

textmass

Friday, May 16, 2008

WiderFinder

No comments:

Post a Comment

Blog Archive

About Me