Ensure streaming works in "real-time"
fix: Ensure streaming chunks are emitted individually to the client
Problem:
The previous implementation combined multiple chunks into a single response, causing all data to be sent to the client at once instead of streaming each chunk individually. This behavior was due to the use of reduce
and concatMapIterable
, which aggregated the data before emitting it.
Solution:
1. Removed the use of reduce
and concatMapIterable
, which were combining chunks into a single response.
2. Updated the code to use flatMap
on the window
to ensure each item within the window is processed and emitted individually.
3. Streamed ChatCompletions
directly to maintain the streaming behavior, ensuring that each chunk is processed and passed downstream as soon as it is received.
Changes:
- Removed the reduce
call to avoid combining chunks.
- Replaced concatMapIterable
with flatMap
to process each chunk individually.
- Modified the windowUntil
logic to correctly handle function calls and stream each chunk separately.
This change ensures that each chunk of data is streamed to the client individually, providing a smoother and more immediate streaming experience.
Tested: - Verified that the client receives each chunk individually as it is streamed. - Confirmed that the function call handling logic works correctly with the updated streaming approach.
Comment From: bruno-oliveira
@markpollack @joshlong Hello, The streaming issue was not addressed in the previous fix, as the chunks still were arriving all at once. This MR fixes that.
Comment From: bruno-oliveira
@joshlong @markpollack I think the issue lies with the actual Azure SDK client having a piece of logic that is making the streaming of the chunks a blocking call which essentially makes using streaming with Azure OpenAI and Spring AI impossible atm. Do you think this could get higher priority?
Comment From: tzolov
@bruno-oliveira I can not confirm your observation. In fact there are integration tests that confirm that the streamed response is indeed chunked for both function and non-function calls: - https://github.com/spring-projects/spring-ai/blob/16c531c36c143babdda59eff32b24ac05b291bc6/models/spring-ai-azure-openai/src/test/java/org/springframework/ai/azure/openai/function/AzureOpenAiChatModelFunctionCallIT.java#L89 - https://github.com/spring-projects/spring-ai/blob/16c531c36c143babdda59eff32b24ac05b291bc6/spring-ai-spring-boot-autoconfigure/src/test/java/org/springframework/ai/autoconfigure/azure/AzureOpenAiAutoConfigurationIT.java#L87
Please review the above tests and let me know if there is an issue with them.
If you still think there is an issue please write a test that illustrates it.
Comment From: bruno-oliveira
@tzolov Thanks for reaching out! It's a pleasure to be able to engage with the community who builds such amazing tools that tons of devs use everyday!
I have created a ticket, not here, but in the Azure SDK repo: https://github.com/Azure/azure-sdk-for-java/issues/40629#issuecomment-2177944316
Note several people confirmed this was an issue and observe this exact behavior. If I make a "streaming call" in Postman, the response "blocks" until all the chunks are "internally streamed" which makes it exactly the same as the normal, non-streaming call.
Initially I started by investigating Spring AI library but that lead me to think that the bug needs to be in Azure SDK after all.
Maybe this one can be closed as its not directly related to Spring AI. However, for all effects and purposes, streaming "doesn't work" at the moment, with the Azure OpenAI client
Comment From: timostark
@bruno-oliveira can you check https://github.com/spring-projects/spring-ai/pull/1054 if that solves the issue for you?