The current Media API provides a generic way of adding multimedia content to a prompt when calling models with multimodality support.

So far, the Media API has been used for images. Starting with https://github.com/spring-projects/spring-ai/issues/1560, it's also used for audio files. It's working correctly, but there are two points that might be improved:

  1. The MimeTypeUtils from Spring Framework doesn't include any audio-related mime types. Therefore, developers need to use an explicit one, such as MimeTypeUtils.parseMimeType("audio/mp3"). Perhaps we can introduce an audio-specific utility in Spring AI?

  2. When the media content is extracted from the Spring AI UserMessage into the provider-specific APIs (such as OpenAI), there's no immediate way to filter the media content based on whether it's image or audio content. For now, support only exists in the OpenAI integration and the audio content is checked individually (see: https://github.com/spring-projects/spring-ai/pull/1561), where there's also the additional challenge of mapping mime type to an OpenAI-specific enum. There might be room for streamlining this logic.