Spring-ai Avoid duplicated entries in VectorStore(s) by allowing generation of Document ID based on the hashed document content.

Currently the Document if not provided with an explicit ID, generates a random UUID for every document. Even if the document content/metadata haven't changed a new ID is generated every time. This will lead to document content duplications in the Vector store.

To prevent this type of unnecessary duplications we can allow generation of Document ID based on the hashed document content+metadata.

Following snippet is inspired by a langchain4j vector store implementations.

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

....

public static String generateIdFrom(String contentWithMetadata) {
    try {
        byte[] hashBytes = MessageDigest.getInstance("SHA-256").digest(contentWithMetadata.getBytes(StandardCharsets.UTF_8));
        StringBuilder sb = new StringBuilder();
        for (byte b : hashBytes) {
            sb.append(String.format("%02x", b));
        }
        return UUID.nameUUIDFromBytes(sb.toString().getBytes(StandardCharsets.UTF_8)).toString();
    }
    catch (NoSuchAlgorithmException e) {
        throw new IllegalArgumentException(e);
    }
}

Comment From: markpollack

Currenlty we have

    public Document(String content, Map<String, Object> metadata) {
        this(UUID.randomUUID().toString(), content, metadata);
    }

Perhaps we can add a strategy interface as an option to pass in with an implementation based on what is listed above.

    public Document(String content, Map<String, Object> metadata, IdGenerator idGenerator) {

...

Comment From: nurlicht

I have just created a PR for this feature (my first PR here): https://github.com/spring-projects/spring-ai/pull/272