Spring AI Review and develop new design for VectorStore

Current vectorstore interface is lacking in two important areas

Does not support other types of search other than similarity search
Does not support full api coverate for admin cases, e.g. delete, update operations

Comment From: maxhov

In its current form it also assumes that the content is next to the embeddings. Does it make sense to separate these two out? One could keep the actual data on disk, while keeping the (embedding) vector in a different place (e.g. Postgres).

Comment From: mp911de

In its current form it also assumes that the content is next to the embeddings. Does it make sense to separate these two out? One could keep the actual data on disk, while keeping the (embedding) vector in a different place (e.g. Postgres).

Such a case could be handled by using delegation or better, a domain object that describes the file path (an identifier, actual path, or some other form of resource locator).

Comment From: mp911de

I started exploring a Domain-bound VectorStore abstraction mostly related to lifecycle usecases (insert, update, delete). The overlap between AI and data is where we want to associate data with some embedding and where we want to retrieve data (by similarity returning some way to access content).

Most databases provide some sort of batching for inserting data, a few other batching for updating/deleting. It makes sense to distinguish updating from inserting.

Going a step further, a lot of applications have some sort of data already that they want to associate with embeddings. Using a domain-object-oriented approach makes much more sense than trying to pour everything through Document, especially since the additional metadata is the actual data that is relevant for other cases.

A typical application would have batch orchestration for initial data embedding while transactional updates target individual objects. I assume that Spring AI isn't the best place to provide production-ready batching, instead some orchestrator like Spring Batch might provide something.

It would be good to have a component that can associate objects with embeddings (T Embedder.embed(T), List<T> Embedder.embedAll(List<T>)) potentially with different strategies (a single vector field, a vector field per model identifier, a data row per model identifier). So separating VectorStore from the actual EmbeddingModel would be a good thing.

Another aspect is that there should be a way to translate domain objects into the representation that is needed for RAG (maybe a DomainVectorStoreRetriever<T> accepting Function<T, Document> to glue components into their current form).

Keeping the lifecycle separate, supporting it with a tiny abstraction that isn't overly prominent for getting-started-quickly seems a neat balance.

I toyed around with an initial domain-based design at https://github.com/mp911de/spring-ai/commit/c58f6ee9aec598a7aa689e7134cc8c13de14b893 that explores the domain part. From the lifecycle side, it feels much more natural and avoids duplicating model-metadata of the "other component" that deals with the non-AI part of data.

Comment From: maxhov

In its current form it also assumes that the content is next to the embeddings. Does it make sense to separate these two out? One could keep the actual data on disk, while keeping the (embedding) vector in a different place (e.g. Postgres).

Such a case could be handled by using delegation or better, a domain object that describes the file path (an identifier, actual path, or some other form of resource locator).

After my comment I indeed explored that idea and implemented it. It makes a lot of sense to extend already existing domain entities with embeddings and just treat it as yet another way of retrieving data from the store.

Another aspect is that there should be a way to translate domain objects into the representation that is needed for RAG (maybe a DomainVectorStoreRetriever accepting Function to glue components into their current form).

I've had an experience that might make sense to share here. We are processing documents that surpass the limit of 8191 tokens of the current embedding models and for that reason the data cannot be embedded in its entire form. The only way I see is to chunk the documents into chunks of max 8000'ish tokens and embed them per chunk, which leads to a vector per chunk. However, a chunk (quite literally) tells only part of the story and you might miss context when supplying only matching chunks to the LLM via RAG. So I was playing with the idea of having multiple embedding vectors refer to the same object, such that when any of the vectors of the chunks match that you supply retrieve the entire document and use that for RAG. I think that kind of supports the idea of having a DomainVectorStoreRetriever<T>.

Comment From: youngmoneee

Regarding the design of VectorStore, I have the following opinions:

The current ETL pipeline is implemented synchronously, which feels unnatural and lacks the essence of a “pipeline.” If the design is revised, I suggest considering a “stream”-based implementation. This approach would allow data to flow through the pipeline while transforming, making it more intuitive. An example of such an interface is discussed in #1253.
Currently, VectorStore has an unnecessary dependency on the embedding model, resulting in boilerplate code in methods like doAdd. This creates challenges during refactoring processes, such as in getEmbedding, as highlighted in #1239. By accepting DocumentWithEmbedding—a document transformed via a stream—rather than calculating embeddings internally, VectorStore’s dependency on the embedding model can be eliminated.

Comment From: markpollack

The current scope of this effort in order go GA is a refinement to the current API and not a major change like sync/async. We can revisit post GA with additional large design considerations, I always appreciate your input @youngmoneee .

Delete operation (with Filter Expression) use case is upload v2 of a doc, delete v1 of the doc badsed on the metadata.

post GA we can add similar for upsert.

Similar to spring cache abstraction, have a getNativeClient method that return the underlying vectorstore API class so users can always have an escape hatch to get client specific functionality