Expected Behavior
It should be possible to send an audio sample with speech (in supported audio formats) to the OpenAI transcription endpoint and retrieve the extracted speech as text.
Current Behavior
Speech to text is currently not supported by the OpenAI integration and users must fallback to custom implementations. See: https://github.com/thomasdarimont/quadropole-welcome-hero/blob/main/src/main/java/com/welcomehero/app/openai/OpenAiFacade.java#L34
Context
The motivating use case for this was to build a AI augmented information kiosk for hospitals to enable visitors, patients and non-native staff to ask questions with natural speech about the hospital environment, based on a controlled knowledge-base. See: https://github.com/thomasdarimont/quadropole-welcome-hero
Greetings to the team :)
Comment From: markpollack
Hi thanks, glad our paths cross again. There is a long list of every model types, let alone whatever happens with multi-modal support within a single model, that we should eventually cover. Audio, Video, Voice, Code (perhaps something more specific than text), Images... See https://github.com/spring-projects/spring-ai/issues/144 for example.
I plan to put in a sort of 'sandbox' module for people to make contributions of clients for these various models and see what abstractions and commonalities come out of it. You sample code looks like it can be extracted into a VoiceClient
or SpeachClient
class. Once I have this area setup, perhaps you can extract the essence of what did and make a PR?
Comment From: ENate
Hi. I will like to contribute to this project (spring AI) and look forward to a good first issue. Greetings and great work from the team.
Comment From: michaellavelle
Hi - I've created a PR to add support for audio transcriptions here : https://github.com/spring-projects/spring-ai/pull/300
Comment From: markpollack
Closing as it as been merged. Docs will follow shortly.