OpenAI has recently introduced audio multimodality support, both for input and output.

The input audio modality support is introduced in https://github.com/spring-projects/spring-ai/issues/1560 all the way up to the Spring AI abstractions.

The output audio modality is only supported at the lower level (OpenAIApi). Its usage is demonstrated in this integration test: https://github.com/spring-projects/spring-ai/blob/bdb66e5770836dc9dec6be40af801d9cd9e41e2a/models/spring-ai-openai/src/test/java/org/springframework/ai/openai/api/OpenAiApiIT.java#L98-L118

It would be nice to start identifying what type of abstractions are needed in the ChatResponse API to include audio response data.