Here's an update on my Wide Finder 2 submission, including program output, some performance data, and some thoughts about what's shaping the current performance. I've been making improvements to the code while writing these posts over the past couple of weeks, so the picture you get when reading them all at once may seem somewhat smeary and inconsistent. If you want a point-in-time snapshot to see how it all fits together, you can download this executable jar + source code. The jar requires scala-library.jar from Scala 2.7.1-final to be in the same directory when executed, and requires a JVM version 1.5 or higher. Here's the command line used to invoke the application as well as the output and timings:
% java -server -Xss1m -Xms3g -Xmx3g -jar WideFinder2-Scala-RayWaldin.jar /wf1/data/logs/O.all
43,178.0 MB (48.94 MBps)
Top URIs by hit:
218540: /ongoing/When/200x/2003/09/18/NXML
148630: /ongoing/When/200x/2003/03/16/XML-Prog
129746: /ongoing/When/200x/2003/10/16/Debbie
110427: /ongoing/When/200x/2003/07/17/BrowserDream
102513: /ongoing/When/200x/2003/07/30/OnSearchTOC
100934: /ongoing/When/200x/2003/04/26/UTF
94408: /ongoing/When/200x/2003/07/25/NotGaming
93060: /ongoing/When/200x/2003/04/06/Unicode
92006: /ongoing/When/200x/2006/01/31/Data-Protection
89795: /ongoing/When/200x/2004/12/12/BMS
Top URIs by bytes:
870656.0M: /ongoing/ongoing.atom
373564.5M: /ongoing/potd.png
278894.3M: /ongoing/ongoing.rss
91443.9M: /ongoing/rsslogo.jpg
63061.5M: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
39645.0M: /ongoing/When/200x/2006/05/16/J1d0.mov
36557.5M: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
36257.1M: /ongoing/
33337.3M: /ongoing/moss60.jpg
31966.7M: /ongoing/When/200x/2004/02/18/Bump.png
Top 404s:
54271: /ongoing/ongoing.atom.xml
28030: /ongoing/ongoing.pie
27364: /ongoing/favicon.ico
24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
24005: /ongoing/Browser-Market-Share.png
23924: /ongoing/Browsers-via-search.png
23850: /ongoing/Search-Engines.png
22637: /ongoing/ongoing.atom'
22619: //ongoing/ongoing.atom'
20578: /ongoing/Feeds.png
Top client addresses:
215496: msnbot.msn.com
110909: crawl-66-249-72-173.googlebot.com
106361: crawler14.googlebot.com
96366: crawl-66-249-72-172.googlebot.com
74220: crawl-66-249-72-72.googlebot.com
67821: sv-crawlfw3.looksmart.com
66361: h161.c221.tor.velocet.net
65828: 125.19.50.169
63676: crawl-66-249-72-165.googlebot.com
56092: crawl-66-249-72-38.googlebot.com
Top referrers:
182996: http://tbray.org/ongoing/
158562: http://slashdot.org/
62218: http://www.google.com/reader/view/
40243: http://daringfireball.net/
30197: http://programming.reddit.com/
20463: http://reddit.com/
15182: http://www.scripting.com/
13864: http://www.tbray.org/
13618: http://www.ask.com/web?q=page=23
9917: http://www.macsurfer.com/
[total 912,280ms]
hits: 4680, bytes: 45692, 404s: 128489, refs: 1902292, clients: 3573466
real 15:17.0
user 7:35:07.7
sys 9:13.0
Two minor differences remain between my results and those of the reference implementation, which are shown below. The differences are minor, and are due to my choice to not assume fields are space delimited, as many of the other entries do, including the reference implementation.
@@ -5,5 +5,5 @@
110427: /ongoing/When/200x/2003/07/17/BrowserDream
102513: /ongoing/When/200x/2003/07/30/OnSearchTOC
- 100932: /ongoing/When/200x/2003/04/26/UTF
+ 100934: /ongoing/When/200x/2003/04/26/UTF
94408: /ongoing/When/200x/2003/07/25/NotGaming
93060: /ongoing/When/200x/2003/04/06/Unicode
@@ -19,5 +19,5 @@
39645.0M: /ongoing/When/200x/2006/05/16/J1d0.mov
36557.5M: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
- 36257.0M: /ongoing/
+ 36257.1M: /ongoing/
33337.3M: /ongoing/moss60.jpg
31966.7M: /ongoing/When/200x/2004/02/18/Bump.png
Profiling has shown that this process is CPU bound. It's been noted by several people that even a simple, single threaded, non-NIO Java application can traverse the entire 42GB file in about 5 minutes on the T2000 server, yet the best performance to date from any JVM based submission is still under 50MBps. It's becoming pretty clear that there's a major obstacle to any further performance improvements: the Java Wide Character Tax.
There are two forms of overhead caused by Java's support for wide characters. The most obvious, but least taxing form is character decoding. Since we're dealing with ASCII data, the overhead of decoding it is pretty light. It's definitely best to not do this as bytes are being read in a single blocking thread, but rather to let the worker threads each do character decoding for their own chunks of data. But it's still a pretty trivial process. My profiling shows that character decoding was active about 2% of the time while sampling CPU performance.
The more incidious form of the Wide Character Tax comes from the fact that, while Java characters are two bytes wide, what they actually hold are variable width UTF-16 representations of characters, which are either two bytes or four bytes, depending on the character. I call this the Double Wide Character Tax, even though it's effects are much more than twice that of the decoding tax. From the Javadocs for java.lang.Character:
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. ... The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Each and every character must be checked for the possibility that it's a supplementary character, and to understand where character boundaries are, and to correctly compute character counts. The numbers below come from a 22 second CPU sampling of a single thread as captured by Shark
Self Total Library Symbol
10.2 s 10.2 s java.lang Character::codePointAt
4.0 s 4.0 s java.util.regex Pattern$Curly::match0
2.1 s 2.1 s java.util.regex Pattern$Sub::match
1.6 s 1.6 s java.lang String::equals
702.6 ms 702.6 ms scala.util.matching Regex$MatchData$class::group
437.5 ms 437.5 ms java.lang StringCoding$CharsetSD::decode
As you can see, in this worker thread, almost 50% of the time is spent computing code points. Like ASCII character decoding, this should be a trivial operation as there are no ASCII characters that translate into supplementary characters. Unlike character decoding, however, this has to be done repeatedly, every time a character position is traversed. Regular expression application is particularly prone to this type of penalty because it frequently revisits character positions, and the regular expression evaluator built into Java does nothing to memoize or otherwise optimize away the repeated checks for supplementary characters. From the above numbers, it seems that a significant portion of the time in each worker thread is spent paying this Double Wide Character Tax by the narrowest of characters.
So what next? I'm not sure if there's a way around this particular obstacle short of creating a byte-oriented regular expression engine on the JVM, far more trouble than it's worth for this challenge. If there's an easier solution out there, I'm sure someone will find it and we'll know what it is soon.