Spring AI TokenTextSplitter creates document with different id than it's input document

Bug description when giving a document to TokenTextSplitter after the split the resulting documents have a different id than the input. in a pipeline scenario this is completely unexpected.

Environment Kotlin + Java 21 Spring Boot 3.2.1 Spring AI 1.0.0-M1 (Also on 1.0.0-SNAPSHOT)

Steps to reproduce create a document with specific id, pass it to tokenTextSplitter, the returned document will have a different id.

Expected behavior document id should not change in textSplitter or even any DocumentTransformer for that matter unless explicitly done so.

Minimal Complete Reproducible example val input = Document("123", "content", emptyMap()) val output = TokenTextSplitter().split(input).first() assert(input.id == output.id)

Comment From: Devansh-Rastogi

@hamedmonji , I think its fair to have separate id for the splitted documents as your input document will be splitted into multiple output documents and it wont be good to keep the same ids for each output document. However, do you think an additional field to map the output id to input id would be good?

Comment From: hamedmonji

@Devansh-Rastogi I don't see any reason why the outputs can't have the same ids, having the same ids even more implies that the documents are sub parts of the original documents, it is not really an issue of losing the id, as in many cases the use case would be something like

documents.map( doc -> TokenTextSplitter().split(doc) )

so i can easily put the ids back in but to me it was strange that the ids are changed, it was unexpected and in fact i realized it in debugging an issue. i expect TokenTextSplitter to only split a document to multiple documents and do nothing else which is why I don't think we can say it's fair to have separate id because it really is not within the boundary of the TokenTextSplitter to be concerned about weather id is unique and therefore it should keep that uniqueness.

as a side note I think defaults like this can become the root of somewhat hard to find bugs, another issue i faced was after splitting my document with TokenTextSplitter and having the chunk size set to maximum length allowed for open ai's ada_002 model, I would get an error saying I have exceeded the maximum size for the model, after debugging I realized EmbeddingModel by default includes the metadata as well as the document content for embedding which caused the model size overflow.

Comment From: Devansh-Rastogi

@hamedmonji cool, make sense totally, i have found the issue actually, the tokensplitter doesnt even take a look at id and create a fresh set of documents and thats why the id is been different

Comment From: novakma2

honestly, I can see also another way, I came to the same issue as @hamedmonji , maybe a predictable id generating with just appending subdocument id. i.e. I have document id0 that would be split into three documents and ids would be id00 id01 id02 or some similar formatting

Comment From: markpollack

My perspective is that when you split the large document, for example the text from an entire PDF, you get new documents instances, so they would have new id's. For tracking purposes, the new documents can have a reference back to the original document from which they came if you want to somehow link them back togther. i don't quite follow the logic of having sequential IDs or giving them structure.

As far as the surprise to have metadata added, it is true that it is a big of an art to split up documents and metadata is cruical to have so that you can show the user what information was used in a RAG scenario to supply an answer. Not having metadata means all the entries in the vector store are impossible to distinguish from each other, for example one can't use the query languages (and our portable filter expression syntax) to help the search by picking a subset of all the entries.

In short, I'd propose keeping the linkage and making the docs more clear that some metadata will be added and how to customize it.

Comment From: hamedmonji

To me id is what "semantically" identifies a document, as in if I had a PDF document as you said and transformed it into lets say a CSV document, having the document have the same id is what makes this new CSV document semantically be the same pdf document even though the format or even the content may have changed.

However I can also see your perspective, as I said this was an expectation from my perspective as a user, It could be that it is just me and other users mental model thought the id would change too.

Either way I do think it is good to have something if not the original id, to allow to for recognizing a document even after many transformations.

For the metadata, I did not meant to not include metadata, I meant the EmbeddingModel includes the metadata a document has in the call to the embedding service by default which again from my perspective was not obvious. A document has a content field which in my opinion strongly implies it is what is sent to the embedding service.

embeddingModel.embed(document)

I think this does not say the metadata for that document will also be sent to the embedding service (also metadata could have more private data in it)

My point for this part was that i think this inclusion of metadata in the embeddingModels embedding process should be opt-in rather than Opt-out.

(I may have used the word semantically wrong, not sure)

Comment From: novakma2

The way I see it simply the inability to know deterministically the next id. I quite like the @markpollack way, of referencing back the original document, I would personally even do so we can provide either enum in construct, so we can just choose, random or sequential, or some sort of factory to pass in.

Comment From: eric-eldard

Agree with @markpollack, for both the RAG identification reason, and because (per the Spring AI PGvector setup guide) my documents' ids are also their unique PK. I have other identifiers that go into metadata to aggregate docs with the same source.

{"type":"WEB_CONTENT","source":"https://url.of/scraped/article"}

If knowing the pre-split doc is actually important, then I can imagine the caller adding something like:

{"type":"DOCUMENT","source":"ORIGINAL-DOC-UUID"}

I'm wondering if folks want the split chunks to share the same id because it's how they perform updates, since the id is easy to query against. Example use case: I've re-scraped a url and want to discard any docs from the previous scrape. What would smooth this out is the ability to easily query against the metadata. Here's my hack, which I'd love to have a framework replacement for:

public List<Document> metadataExactSearch(Filter.Expression filterExpression)
{
    return jdbcTemplate.query(
        "SELECT *, 0 AS distance FROM vector_store WHERE metadata::jsonb @@ ?::jsonpath;",
        makeRowMapper(), // don't ask about what gross gymnastics went into getting the private inner class PgVectorStore.DocumentRowMapper
        expressionConverter.convertExpression(filterExpression)
    );
}

I don't see the case for knowing the doc order, but if it's really needed, the caller can always add that into metadata too. The splitter isn't parallelized, and the resulting docs come out in an ordered list.

{"type":"DOCUMENT","source":"ORIGINAL-DOC-UUID","chunk":1}