textmass: Results so far

Here's an update on my Wide Finder 2 submission, including program output, some performance data, and some thoughts about what's shaping the current performance. I've been making improvements to the code while writing these posts over the past couple of weeks, so the picture you get when reading them all at once may seem somewhat smeary and inconsistent. If you want a point-in-time snapshot to see how it all fits together, you can download this executable jar + source code. The jar requires scala-library.jar from Scala 2.7.1-final to be in the same directory when executed, and requires a JVM version 1.5 or higher. Here's the command line used to invoke the application as well as the output and timings:

% java -server -Xss1m -Xms3g -Xmx3g -jar WideFinder2-Scala-RayWaldin.jar /wf1/data/logs/O.all

43,178.0 MB (48.94 MBps)

Top URIs by hit:
   218540: /ongoing/When/200x/2003/09/18/NXML
   148630: /ongoing/When/200x/2003/03/16/XML-Prog
   129746: /ongoing/When/200x/2003/10/16/Debbie
   110427: /ongoing/When/200x/2003/07/17/BrowserDream
   102513: /ongoing/When/200x/2003/07/30/OnSearchTOC
   100934: /ongoing/When/200x/2003/04/26/UTF
    94408: /ongoing/When/200x/2003/07/25/NotGaming
    93060: /ongoing/When/200x/2003/04/06/Unicode
    92006: /ongoing/When/200x/2006/01/31/Data-Protection
    89795: /ongoing/When/200x/2004/12/12/BMS

Top URIs by bytes:
870656.0M: /ongoing/ongoing.atom
373564.5M: /ongoing/potd.png
278894.3M: /ongoing/ongoing.rss
 91443.9M: /ongoing/rsslogo.jpg
 63061.5M: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
 39645.0M: /ongoing/When/200x/2006/05/16/J1d0.mov
 36557.5M: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
 36257.1M: /ongoing/
 33337.3M: /ongoing/moss60.jpg
 31966.7M: /ongoing/When/200x/2004/02/18/Bump.png

Top 404s:
    54271: /ongoing/ongoing.atom.xml
    28030: /ongoing/ongoing.pie
    27364: /ongoing/favicon.ico
    24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
    24005: /ongoing/Browser-Market-Share.png
    23924: /ongoing/Browsers-via-search.png
    23850: /ongoing/Search-Engines.png
    22637: /ongoing/ongoing.atom'
    22619: //ongoing/ongoing.atom'
    20578: /ongoing/Feeds.png

Top client addresses:
   215496: msnbot.msn.com
   110909: crawl-66-249-72-173.googlebot.com
   106361: crawler14.googlebot.com
    96366: crawl-66-249-72-172.googlebot.com
    74220: crawl-66-249-72-72.googlebot.com
    67821: sv-crawlfw3.looksmart.com
    66361: h161.c221.tor.velocet.net
    65828: 125.19.50.169
    63676: crawl-66-249-72-165.googlebot.com
    56092: crawl-66-249-72-38.googlebot.com

Top referrers:
   182996: http://tbray.org/ongoing/
   158562: http://slashdot.org/
    62218: http://www.google.com/reader/view/
    40243: http://daringfireball.net/
    30197: http://programming.reddit.com/
    20463: http://reddit.com/
    15182: http://www.scripting.com/
    13864: http://www.tbray.org/
    13618: http://www.ask.com/web?q=page=23
     9917: http://www.macsurfer.com/

[total 912,280ms]
hits: 4680, bytes: 45692, 404s: 128489, refs: 1902292, clients: 3573466

real    15:17.0
user  7:35:07.7
sys      9:13.0

Two minor differences remain between my results and those of the reference implementation, which are shown below. The differences are minor, and are due to my choice to not assume fields are space delimited, as many of the other entries do, including the reference implementation.

--- coolstack.log   2008-06-06 22:45:33.000000000 -0700
+++ myresults.txt   2008-06-08 14:02:40.000000000 -0700
@@ -5,5 +5,5 @@
      110427: /ongoing/When/200x/2003/07/17/BrowserDream
  102513: /ongoing/When/200x/2003/07/30/OnSearchTOC
-     100932: /ongoing/When/200x/2003/04/26/UTF
+     100934: /ongoing/When/200x/2003/04/26/UTF
       94408: /ongoing/When/200x/2003/07/25/NotGaming
   93060: /ongoing/When/200x/2003/04/06/Unicode
@@ -19,5 +19,5 @@
    39645.0M: /ongoing/When/200x/2006/05/16/J1d0.mov
36557.5M: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
-   36257.0M: /ongoing/
+   36257.1M: /ongoing/
    33337.3M: /ongoing/moss60.jpg
31966.7M: /ongoing/When/200x/2004/02/18/Bump.png

Profiling has shown that this process is CPU bound. It's been noted by several people that even a simple, single threaded, non-NIO Java application can traverse the entire 42GB file in about 5 minutes on the T2000 server, yet the best performance to date from any JVM based submission is still under 50MBps. It's becoming pretty clear that there's a major obstacle to any further performance improvements: the Java Wide Character Tax.

There are two forms of overhead caused by Java's support for wide characters. The most obvious, but least taxing form is character decoding. Since we're dealing with ASCII data, the overhead of decoding it is pretty light. It's definitely best to not do this as bytes are being read in a single blocking thread, but rather to let the worker threads each do character decoding for their own chunks of data. But it's still a pretty trivial process. My profiling shows that character decoding was active about 2% of the time while sampling CPU performance.

The more incidious form of the Wide Character Tax comes from the fact that, while Java characters are two bytes wide, what they actually hold are variable width UTF-16 representations of characters, which are either two bytes or four bytes, depending on the character. I call this the Double Wide Character Tax, even though it's effects are much more than twice that of the decoding tax. From the Javadocs for java.lang.Character:

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. ... The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

Each and every character must be checked for the possibility that it's a supplementary character, and to understand where character boundaries are, and to correctly compute character counts. The numbers below come from a 22 second CPU sampling of a single thread as captured by Shark


Self      Total     Library              Symbol                          
10.2 s    10.2 s    java.lang            Character::codePointAt
4.0 s     4.0 s     java.util.regex      Pattern$Curly::match0
2.1 s     2.1 s     java.util.regex      Pattern$Sub::match
1.6 s     1.6 s     java.lang            String::equals
702.6 ms  702.6 ms  scala.util.matching  Regex$MatchData$class::group
437.5 ms  437.5 ms  java.lang            StringCoding$CharsetSD::decode

As you can see, in this worker thread, almost 50% of the time is spent computing code points. Like ASCII character decoding, this should be a trivial operation as there are no ASCII characters that translate into supplementary characters. Unlike character decoding, however, this has to be done repeatedly, every time a character position is traversed. Regular expression application is particularly prone to this type of penalty because it frequently revisits character positions, and the regular expression evaluator built into Java does nothing to memoize or otherwise optimize away the repeated checks for supplementary characters. From the above numbers, it seems that a significant portion of the time in each worker thread is spent paying this Double Wide Character Tax by the narrowest of characters.

So what next? I'm not sure if there's a way around this particular obstacle short of creating a byte-oriented regular expression engine on the JVM, far more trouble than it's worth for this challenge. If there's an easier solution out there, I'm sure someone will find it and we'll know what it is soon.

4 comments:

PatrickJune 9, 2008 at 1:07 AM
JRuby is using a port, called Joni, of a regex parser called Onigoruma. One reason they moved to this was to avoid the byte/char conversion overhead--something to look into?
Ray WaldinJune 9, 2008 at 6:40 PM
Thanks Patrick. I hadn't thought to look there and it looks very promising.
Erik EngbrechtJune 10, 2008 at 3:08 PM
Why did you choose to use what looks like a form of optimistic locking on your Count map instead of using an AtomicLong?
Ray WaldinJune 10, 2008 at 5:48 PM
Hi Erik,

I did use AtomicLongs for a while, but switched to the current scheme after noticing that there was no noticable difference in performance, and it greatly simplified the use of the map values elsewhere. I talk about that a little here.

textmass

Friday, June 6, 2008

Results so far

4 comments:

Blog Archive

About Me