Here's an update on my Wide Finder 2 submission, including program output, some performance data, and some thoughts about what's shaping the current performance. I've been making improvements to the code while writing these posts over the past couple of weeks, so the picture you get when reading them all at once may seem somewhat smeary and inconsistent. If you want a point-in-time snapshot to see how it all fits together, you can download this executable jar + source code. The jar requires scala-library.jar from Scala 2.7.1-final to be in the same directory when executed, and requires a JVM version 1.5 or higher. Here's the command line used to invoke the application as well as the output and timings:
% java -server -Xss1m -Xms3g -Xmx3g -jar WideFinder2-Scala-RayWaldin.jar /wf1/data/logs/O.all 43,178.0 MB (48.94 MBps) Top URIs by hit: 218540: /ongoing/When/200x/2003/09/18/NXML 148630: /ongoing/When/200x/2003/03/16/XML-Prog 129746: /ongoing/When/200x/2003/10/16/Debbie 110427: /ongoing/When/200x/2003/07/17/BrowserDream 102513: /ongoing/When/200x/2003/07/30/OnSearchTOC 100934: /ongoing/When/200x/2003/04/26/UTF 94408: /ongoing/When/200x/2003/07/25/NotGaming 93060: /ongoing/When/200x/2003/04/06/Unicode 92006: /ongoing/When/200x/2006/01/31/Data-Protection 89795: /ongoing/When/200x/2004/12/12/BMS Top URIs by bytes: 870656.0M: /ongoing/ongoing.atom 373564.5M: /ongoing/potd.png 278894.3M: /ongoing/ongoing.rss 91443.9M: /ongoing/rsslogo.jpg 63061.5M: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg 39645.0M: /ongoing/When/200x/2006/05/16/J1d0.mov 36557.5M: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov 36257.1M: /ongoing/ 33337.3M: /ongoing/moss60.jpg 31966.7M: /ongoing/When/200x/2004/02/18/Bump.png Top 404s: 54271: /ongoing/ongoing.atom.xml 28030: /ongoing/ongoing.pie 27364: /ongoing/favicon.ico 24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN 24005: /ongoing/Browser-Market-Share.png 23924: /ongoing/Browsers-via-search.png 23850: /ongoing/Search-Engines.png 22637: /ongoing/ongoing.atom' 22619: //ongoing/ongoing.atom' 20578: /ongoing/Feeds.png Top client addresses: 215496: msnbot.msn.com 110909: crawl-66-249-72-173.googlebot.com 106361: crawler14.googlebot.com 96366: crawl-66-249-72-172.googlebot.com 74220: crawl-66-249-72-72.googlebot.com 67821: sv-crawlfw3.looksmart.com 66361: h161.c221.tor.velocet.net 65828: 125.19.50.169 63676: crawl-66-249-72-165.googlebot.com 56092: crawl-66-249-72-38.googlebot.com Top referrers: 182996: http://tbray.org/ongoing/ 158562: http://slashdot.org/ 62218: http://www.google.com/reader/view/ 40243: http://daringfireball.net/ 30197: http://programming.reddit.com/ 20463: http://reddit.com/ 15182: http://www.scripting.com/ 13864: http://www.tbray.org/ 13618: http://www.ask.com/web?q=page=23 9917: http://www.macsurfer.com/ [total 912,280ms] hits: 4680, bytes: 45692, 404s: 128489, refs: 1902292, clients: 3573466 real 15:17.0 user 7:35:07.7 sys 9:13.0
Two minor differences remain between my results and those of the reference implementation, which are shown below. The differences are minor, and are due to my choice to not assume fields are space delimited, as many of the other entries do, including the reference implementation.
- 100932: /ongoing/When/200x/2003/04/26/UTF
+ 100934: /ongoing/When/200x/2003/04/26/UTF
94408: /ongoing/When/200x/2003/07/25/NotGaming
93060: /ongoing/When/200x/2003/04/06/Unicode
39645.0M: /ongoing/When/200x/2006/05/16/J1d0.mov
36557.5M: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
- 36257.0M: /ongoing/
+ 36257.1M: /ongoing/
33337.3M: /ongoing/moss60.jpg
31966.7M: /ongoing/When/200x/2004/02/18/Bump.png
110427: /ongoing/When/200x/2003/07/17/BrowserDream
102513: /ongoing/When/200x/2003/07/30/OnSearchTOC
Profiling has shown that this process is CPU bound. It's been noted by several people that even a simple, single threaded, non-NIO Java application can traverse the entire 42GB file in about 5 minutes on the T2000 server, yet the best performance to date from any JVM based submission is still under 50MBps. It's becoming pretty clear that there's a major obstacle to any further performance improvements: the Java Wide Character Tax.
There are two forms of overhead caused by Java's support for wide characters. The most obvious, but least taxing form is character decoding. Since we're dealing with ASCII data, the overhead of decoding it is pretty light. It's definitely best to not do this as bytes are being read in a single blocking thread, but rather to let the worker threads each do character decoding for their own chunks of data. But it's still a pretty trivial process. My profiling shows that character decoding was active about 2% of the time while sampling CPU performance.
The more incidious form of the Wide Character Tax comes from the fact that, while Java characters are two bytes wide, what they actually hold are variable width UTF-16 representations of characters, which are either two bytes or four bytes, depending on the character. I call this the Double Wide Character Tax, even though it's effects are much more than twice that of the decoding tax. From the Javadocs for java.lang.Character:
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. ... The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Each and every character must be checked for the possibility that it's a supplementary character, and to understand where character boundaries are, and to correctly compute character counts. The numbers below come from a 22 second CPU sampling of a single thread as captured by Shark
Self Total Library Symbol 10.2 s 10.2 s java.lang Character::codePointAt 4.0 s 4.0 s java.util.regex Pattern$Curly::match0 2.1 s 2.1 s java.util.regex Pattern$Sub::match 1.6 s 1.6 s java.lang String::equals 702.6 ms 702.6 ms scala.util.matching Regex$MatchData$class::group 437.5 ms 437.5 ms java.lang StringCoding$CharsetSD::decode
As you can see, in this worker thread, almost 50% of the time is spent computing code points. Like ASCII character decoding, this should be a trivial operation as there are no ASCII characters that translate into supplementary characters. Unlike character decoding, however, this has to be done repeatedly, every time a character position is traversed. Regular expression application is particularly prone to this type of penalty because it frequently revisits character positions, and the regular expression evaluator built into Java does nothing to memoize or otherwise optimize away the repeated checks for supplementary characters. From the above numbers, it seems that a significant portion of the time in each worker thread is spent paying this Double Wide Character Tax by the narrowest of characters.
So what next? I'm not sure if there's a way around this particular obstacle short of creating a byte-oriented regular expression engine on the JVM, far more trouble than it's worth for this challenge. If there's an easier solution out there, I'm sure someone will find it and we'll know what it is soon.
JRuby is using a port, called Joni, of a regex parser called Onigoruma. One reason they moved to this was to avoid the byte/char conversion overhead--something to look into?
ReplyDeleteThanks Patrick. I hadn't thought to look there and it looks very promising.
ReplyDeleteWhy did you choose to use what looks like a form of optimistic locking on your Count map instead of using an AtomicLong?
ReplyDeleteHi Erik,
ReplyDeleteI did use AtomicLongs for a while, but switched to the current scheme after noticing that there was no noticable difference in performance, and it greatly simplified the use of the map values elsewhere. I talk about that a little here.