Tuesday, June 10, 2008

java.util.regex leaks

There's a resource leak waiting to bite you if you use java.util.regex.* much. The problem is related to a known behavior of String.substring() where the returned substring is backed by the original string. Actually, the substring is backed by the same char[] that backs the original string to be more precise, which is a great optimization if:

  • both strings are of relatively equivalent sizes, or
  • you're planning on keeping the larger of the two strings around long term anyway.

After all, why copy characters out to another array when there's a perfectly good one to reuse?  Well, if either of these assumptions are untrue, you'll end up use more memory than you realize, and it's worse the longer your code runs, or the bigger the difference in size between the two strings.

It bit me again recently while working on my Wide Finder 2 submission, where the Matcher component would repeatedly read 3 megabytes from disk, decode it into a string backed by a 6 megabyte character array, and then perform regular expression matches on that string.  Whenever it found previously unseen value, it would store it as the key in some hash map.  The values were obtained from calls to java.util.regex.Matcher.group() or .group(n), and it turns out these Matcher methods use substring() on the string that was searched. So each HashMap key (just a few tens of bytes long, for sure) was backed by a 6 megabyte character array. The result was that most of the 6 megabyte character arrays were retained in memory for the duration of the application, even though only a few substrings from each string were required.

The solution I chose was to clone the string using the String copy constructor which copies the portion of the character array in the process of cloning. Applying the fix as late as possible ensures that the intended optimization remains beneficial whenever it can, but doesn't get in the way when it's inappropriate. Here's an example, in Java:

import java.util.*;
import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        String string = "really, really, really, really, really, really long string";
        Matcher matcher = Pattern.compile("(\\w+),").matcher(string);
        Map<String,List<Integer>> map = new HashMap<String,List<Integer>>();

        while(matcher.find()) {
            String found = matcher.group(1);
            List positions = map.get(found);
            if(positions == null) {
                positions = new ArrayList<Integer>();
                map.put(new String(found), positions);
            }
            positions.add(matcher.start(1));
        }
    }
}

I thought it would be best to demonstrate this in Java since that would benefit the widest potential audience, but after being so fully immersed in Groovy and Scala lately, writing even this much Java was quite tedious. There really is a lot of Syntactic Noise that I will be happy to forget.

2 comments:

  1. Nice article, thank you

    I had the same ugly problem today and searched the bug several hours.

    ReplyDelete
  2. This just saved my ass, thanks! I was running onto this with Scala and doing a large number of regex in some XML templates to do ant-style variable substitution

    ReplyDelete