Spring Reduce byte array allocations in StompEncoder

The use of an okio.Buffer instead of an ByteArrayOutputStream will reduce the amount of allocated byte arrays by 50% to 66%.

With the current ByteArrayOutputStream-based implementation, 2-3 byte arrays are allocated. One when the stream is created, one when toByteArray() is called and maybe a third one, when the stream must grow because the overhead(headers) exeeds 128 bytes.

With an okio.Buffer, only one byte array has to be allocated, when readByteArray() is called. All other needed arrays will be taken from Okio's internal Segment Pool.

The pattern of conditionally using classes, when they are present on the classpath is inspired by RestTemplate.

I've choosen this older okio version, because its the same version that would be pulled by the already managed okhttp version.

Context

While profiling our application we noticed that a considerable amount of byte arrays is created in org.springframework.messaging.simp.stomp.StompEncoder.encode: stomp-profiling

In other parts of our application, we already made good experience with using okio.Buffers instead of ByteArrayOutputStreams

Comment From: rstoyanchev

Good suggestion.

Couldn't we achieve that with java.nio.ByteBuffer? It's not too hard to expose it as OutputStream as in our DefaultDataBuffer and at the end get the backing array from the ByteBuffer. It doesn't require remembering to add an extra dependency.

Comment From: larsgrefer

I like the idea of not having to add/use an additional external dependency, but I'm not sure if a ByteBuffer would help here.

ByteBuffer.allocate() returns a java.nio.HeapByteBuffer which just seems to be a wrapper around a regular byte[].

Using ByteBuffer.allocateDirect() would return an java.nio.DirectByteBuffer which seems to avoid byte[] by doing some Unsafe stuff outside the JVM heap, but the Javadoc states the following:

The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. [...] It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers [...].

Since ByteBuffers are fixed-sized, we'd have the same allocation and growth problems as before.

Comment From: rstoyanchev

Yes I meant a heap buffer. A byte[] is created initially but subsequent writing the ByteBuffer does not require a copy and in the end you can get the underlying byte array. Those would account for the top 2 (34.9% and 22.5%) from the breakdown above.

With an okio.Buffer, only one byte array has to be allocated, when readByteArray() is called. All other needed arrays will be taken from Okio's internal Segment Pool.

This is saying there is still one allocation at least. I don't know quite follow, what other arrays are needed here that would be taken from a pool?

Comment From: larsgrefer

This is saying there is still one allocation at least. I don't know quite follow, what other arrays are needed here that would be taken from a pool?

In our case (because our stomp-headers are bigger than 128 byte), the current implementation allocates 3 byte arrays for each stomp message we send:

The initial internal buffer of the ByteArrayOutputStream (initialized as 128 + payload.length) (20.0%, 14414 samples)
The grown internal buffer of the ByteArrayOutputStream, which is needed because we write more than payload.length + 128 bytes. (22.5%, 16233 samples)
The result array which is created as subsequence copy of the internal buffer (34.9%, 25101 samples)

(The numbers don't add up correctly because the profiler I used doesn't track every single allocation as explained here: https://github.com/jvm-profiling-tools/async-profiler#allocation-profiling )

When sending 10000 messages, this makes 30000 byte arrays

When using okio.Buffer the first two arrays would be taken from a pool and only the result array would be be allocated. This would make 10001-10002 byte arrays for 10000 messages.

A byte[] is created initially but subsequent writing the ByteBuffer does not require a copy and in the end you can get the underlying byte array.

How would that work, if the the underlying byte array is too small, or too big? If it's too small we had to allocate a second, bigger ByteBuffer and copy the contents from the old to the new one. (effectively step 2 mentioned above) If its too big, we'd need to allocate a correctly sized result array and copy "our" subsequence of the buffer array into the result array. (effectively step 3 mentioned above)

Comment From: rstoyanchev

Thanks for elaborating. So we are doing a very poor job of estimating the required size.

We have all the input at hand. We should be able to estimate more precisely, but turning the Map<String, Object> to Map<byte[], byte[]> upfront, and only then allocate the overall array. That should avoid allocations to expand.

Come to think of it we also have https://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/util/FastByteArrayOutputStream.html which should be able to do the same.

Comment From: larsgrefer

I did a quick check for our application: We'd need a initial buffer of approx. payload.length + 300 in order to avoid resizing.

FastByteArrayOutputStream seems to be similar to okio.Buffer, as they are both linked lists of byte array segments. The main difference is that okio.Buffer uses fixed sized segments which are pooled and FastByteArrayOutputStream uses dynamically sized segments (which aren't pooled).

Comment From: rstoyanchev

Something like this https://github.com/rstoyanchev/spring-framework/commit/fa220ff4ab43c769c96a80be19fb8e31a4ced1cf, basically just aggregating and deferring allocation until the end. That eliminates resizing.

I'm also wondering if we can also avoid the array copy when taking byte[] from a String by iterating chars and encoding them individually.