Spring StringTokenizer performance issue.

Since JDK 1.8, using split() has shown to be more performance-efficient than StringTokenizer, with approximately a 30.5% difference in test results. - Test JDK versions: 1.8, 11, 17

While StringTokenizer is still being used in other parts, I've recently made a modification in one of the AOP sections under review in the spring-framework project. Although uncertain of the extent of its contribution to the project, the change not only improves performance but also makes the code more readable compared to StringTokenizer.

If this modification proves to be beneficial to the project, I'll continue to search for and update similar instances throughout the project in the future.

Comment From: snicoll

Thanks for the PR.

the change not only improves performance but also makes the code more readable compared to StringTokenizer.

Can you share the benchmark you've used to asses the performance improvement?

Comment From: dukbong

Thanks for the PR.

the change not only improves performance but also makes the code more readable compared to StringTokenizer.

Can you share the benchmark you've used to asses the performance improvement?

@snicoll I've created a benchmark report.

Performance Benchmark: StringTokenizer vs. split()

Overview

This report compares the performance of StringTokenizer and the split() method in Java for splitting strings using simple delimiters.

Experimental Environment

Language: Java
Development Environment: Eclipse
Number of Trials: 1,000,000 iterations x 10 trials
JDK Versions: 1.8, 11, 17

Methods Utilized in Performance Testing

public static String[] TestPerformanceOfExistingCode(Method method) {
    StringTokenizer nameTokens = new StringTokenizer(ARGUMENT_NAMES, ",");
    int numTokens = nameTokens.countTokens();
    if(numTokens > 0) {
        String[] names = new String[numTokens];
        for (int i = 0; i < names.length; i++) {
            names[i] = nameTokens.nextToken();
        }
        return names;
    }else {
        return null;
    }
}

public static String[] TestPerformanceOfImprovedCode(Method method) {
    String[] names = ARGUMENT_NAMES.split(",");
    if(names.length > 0) {
        return names;
    }
    return null;
}

Results

JDK 1.8

Average Time for StringTokenizer: 127.8ms
Average Time for split(): 161.3ms
Test Result: split() method is approximately 26.2% slower.

JDK 11

Average Time for StringTokenizer: 167ms
Average Time for split(): 149.8ms
Test Result: split() method is approximately 26.1% faster.

JDK 17

Average Time for StringTokenizer: 149.8ms
Average Time for split(): 125.4ms
Test Result: split() method is approximately 21.5% faster.

Analysis

Since JDK 8, the split() method has demonstrated better performance than StringTokenizer for simple string splitting using basic delimiters. Note that complex delimiters and special cases were not tested; only modifications pertaining to the current experiment were conducted. While the performance difference may not be significant, using the split() method can lead to increased efficiency, code readability, and maintainability.

Conclusion

In JDK versions 1.8 and later, utilizing the split() method for string splitting operations is more efficient. However, depending on specific requirements and scenarios, the use of StringTokenizer may still be appropriate. Considering that using the split() method can enhance performance, code readability, and maintainability, it should be considered as a preferred option.

Future Considerations

It is necessary to validate the transformation of all StringTokenizers to split() through various comparative experiments.

Observed Performance Improvement of split() Method

Explicit mention of performance improvement of the split() method since JDK 1.8 is hard to find, but the test results confirm that the performance has indeed improved.

Comment From: bclozel

Thanks for the proposal, but we're going to decline it.

I had a go with our StringUtils.tokenizeToStringArray method that uses a StringTokenizer and applied your recommendation.

    public static String[] tokenizeToStringArray(
            @Nullable String str, String delimiters, boolean trimTokens, boolean ignoreEmptyTokens) {

        if (str == null) {
            return EMPTY_STRING_ARRAY;
        }

        List<String> tokens = new ArrayList<>();
        String[] split = str.split(delimiters);
        for (String token : split) {
            if (trimTokens) {
                token = token.trim();
            }
            if (!ignoreEmptyTokens || token.length() > 0) {
                tokens.add(token);
            }
        }
        return toStringArray(tokens);
    }

Witht he help of a quick JMH benchmark:

    @Benchmark
    public void tokenizeToStringArray(TokenizerState state, Blackhole blackhole) {
        for (String source : state.source) {
            blackhole.consume(StringUtils.tokenizeToStringArray(source, ","));
        }
    }

    @State(Scope.Benchmark)
    public static class TokenizerState {

        @Param("10")
        int elementCount;

        @Param("20")
        int inputCount;

        Collection<String> source;

        @Setup(Level.Iteration)
        public void setup() {
            Random random = new Random();
            this.source = new ArrayList<>(this.inputCount);
            for (int i = 0; i < this.inputCount; i++) {
                ArrayList<String> tokens = new ArrayList<>(this.elementCount);
                for (int j = 0; j < this.elementCount; j++) {
                    tokens.add(String.format("%0" + (random.nextInt(9) + 1)  + "d", 1));
                    this.source.add(String.join(",", tokens));
                }
            }
        }

    }

I'm seeing the following results:

With StringTokenizer:

Benchmark                                                      (elementCount)  (inputCount)   Mode  Cnt      Score      Error   Units
StringUtilsBenchmark.tokenizeToStringArray                                 10            20  thrpt   10  41373.314 ± 4674.514   ops/s
StringUtilsBenchmark.tokenizeToStringArray:gc.alloc.rate                   10            20  thrpt   10   2840.427 ±  323.345  MB/sec
StringUtilsBenchmark.tokenizeToStringArray:gc.alloc.rate.norm              10            20  thrpt   10  71993.649 ±  294.345    B/op
StringUtilsBenchmark.tokenizeToStringArray:gc.count                        10            20  thrpt   10   1282.000             counts
StringUtilsBenchmark.tokenizeToStringArray:gc.time                         10            20  thrpt   10    732.000                 ms



With String#split

Benchmark                                                      (elementCount)  (inputCount)   Mode  Cnt      Score      Error   Units
StringUtilsBenchmark.tokenizeToStringArray                                 10            20  thrpt   10  33193.063 ± 2277.633   ops/s
StringUtilsBenchmark.tokenizeToStringArray:gc.alloc.rate                   10            20  thrpt   10   3002.365 ±  212.610  MB/sec
StringUtilsBenchmark.tokenizeToStringArray:gc.alloc.rate.norm              10            20  thrpt   10  94849.055 ±  352.119    B/op
StringUtilsBenchmark.tokenizeToStringArray:gc.count                        10            20  thrpt   10   1355.000             counts
StringUtilsBenchmark.tokenizeToStringArray:gc.time                         10            20  thrpt   10    741.000                 ms

We're seeing a throughput decrease (20%) and a higher allocation rate. So I don't think that StringTokenizer vs String#split is that an easy call to make in all situations. Maybe the situation is a bit different in the case of AbstractAspectJAdvisorFactory, but I think the JMH benchmark I used is quite close. Micro Benchmarking is a complex subject and I think the code you've shared has probably a lot of biases that make the result irrelevant.

Comment From: dukbong

Thank you. It's an honor to receive feedback. I'll continue to study so that I can provide insights that could be helpful to open source projects in the future.