Expected Behavior

It should be possible to send an audio sample with speech (in supported audio formats) to the OpenAI transcription endpoint and retrieve the extracted speech as text.

Current Behavior

Speech to text is currently not supported by the OpenAI integration and users must fallback to custom implementations. See: https://github.com/thomasdarimont/quadropole-welcome-hero/blob/main/src/main/java/com/welcomehero/app/openai/OpenAiFacade.java#L34

Context

The motivating use case for this was to build a AI augmented information kiosk for hospitals to enable visitors, patients and non-native staff to ask questions with natural speech about the hospital environment, based on a controlled knowledge-base. See: https://github.com/thomasdarimont/quadropole-welcome-hero

Greetings to the team :)

Comment From: markpollack

Hi thanks, glad our paths cross again. There is a long list of every model types, let alone whatever happens with multi-modal support within a single model, that we should eventually cover. Audio, Video, Voice, Code (perhaps something more specific than text), Images... See https://github.com/spring-projects/spring-ai/issues/144 for example.

I plan to put in a sort of 'sandbox' module for people to make contributions of clients for these various models and see what abstractions and commonalities come out of it. You sample code looks like it can be extracted into a VoiceClient or SpeachClient class. Once I have this area setup, perhaps you can extract the essence of what did and make a PR?

Comment From: ENate

Hi. I will like to contribute to this project (spring AI) and look forward to a good first issue. Greetings and great work from the team.

Comment From: michaellavelle

Hi - I've created a PR to add support for audio transcriptions here : https://github.com/spring-projects/spring-ai/pull/300

Comment From: markpollack

Closing as it as been merged. Docs will follow shortly.