Expected Behavior

Spring AI should support Speech-to-Text functionality in it.

Current Behavior

Spring AI supports OpenAI Text-to-Speech (TTS).

Context

We are building an application that converts speech to text where we can have audio input from website./ mobile device. Referring to cognitive-services-speech-sdk

Comment From: ThomasVitale

Hi @radhakrishna67. Spring AI provides APIs for Audio use cases, including OpenAI implementations for text-to-speech (https://docs.spring.io/spring-ai/reference/api/audio/speech/openai-speech.html) and speech-to-text (https://docs.spring.io/spring-ai/reference/api/audio/transcriptions/openai-transcriptions.html).

Does that solve your problem?

Comment From: radhakrishna67

@ThomasVitale I have forgot to mention about Azure cognitive-services-speech and Google Speech-to-Text support.

Comment From: csterwa

@ThomasVitale does Spring AI support the Azure and Google services that @radhakrishna67 mentioned?

Comment From: ThomasVitale

Spring has support for LLM-based text-to-speech (OpenAI) and speech-to-text (OpenAI and Azure OpenAI). To complete the picture, I think it would make sense to have a feature request to implement support for text-to-speech for Azure OpenAI as well.

Azure provides speech services other than Azure OpenAI. Those are not supported by Spring AI. @radhakrishna67 what is the name of the specific service/API you'd like to integrate your app with? A link to the service documentation would also help, thanks.

About Google, same question: what's the specific service/API you're interested in?

Spring AI provides integrations with Google Vertex (which is about to be removed from the project since Google deprecated the service) and Google Gemini. As far as I know, Gemini itself doesn't provide speech-related capabilities.

Comment From: ddobrin

@ThomasVitale : The PaLM2 models are being deprecated in Google, and recommendation is to switch to Gemini models https://ai.google.dev/palm_docs/palm

PaLM2 support is to be removed from SpringAI.

Comment From: ddobrin

Please see 2 examples of transcribing audio and video data with multimodality, as supported directly by Vertex with Gemini models: https://github.com/ddobrin/gemini-workshop-for-spring-ai-java-developers/blob/main/src/main/java/gemini/workshop/MultimodalAudioExample.java

https://github.com/ddobrin/gemini-workshop-for-spring-ai-java-developers/blob/main/src/main/java/gemini/workshop/MultimodalVideoExample.java