Currently the Document
if not provided with an explicit ID
, generates a random UUID for every document.
Even if the document content/metadata haven't changed a new ID is generated every time.
This will lead to document content duplications in the Vector store.
To prevent this type of unnecessary duplications we can allow generation of Document ID based on the hashed document content+metadata.
Following snippet is inspired by a langchain4j vector store implementations.
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
....
public static String generateIdFrom(String contentWithMetadata) {
try {
byte[] hashBytes = MessageDigest.getInstance("SHA-256").digest(contentWithMetadata.getBytes(StandardCharsets.UTF_8));
StringBuilder sb = new StringBuilder();
for (byte b : hashBytes) {
sb.append(String.format("%02x", b));
}
return UUID.nameUUIDFromBytes(sb.toString().getBytes(StandardCharsets.UTF_8)).toString();
}
catch (NoSuchAlgorithmException e) {
throw new IllegalArgumentException(e);
}
}
Comment From: markpollack
Currenlty we have
public Document(String content, Map<String, Object> metadata) {
this(UUID.randomUUID().toString(), content, metadata);
}
Perhaps we can add a strategy interface as an option to pass in with an implementation based on what is listed above.
public Document(String content, Map<String, Object> metadata, IdGenerator idGenerator) {
...
Comment From: nurlicht
I have just created a PR for this feature (my first PR here): https://github.com/spring-projects/spring-ai/pull/272