Spring-ai TikaDocumentReader can't read large PDFs

When trying to load a large PDF with TikaDocumentReader, I get this exception:

org.apache.tika.exception.WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

It seems that it is possible to overcome this limit by setting Tika's BodyContentHandler's write limit. Setting it to -1 means no limit. See https://tika.apache.org/1.4/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler%28int%29

Unfortunately, this setting isn't exposed through TikaDocumentReader, so it can't be set by an application developer using Spring AI. In fact, the BodyContentHandler used in TikaDocumentReader is created directly at https://github.com/spring-projects/spring-ai/blob/main/document-readers/tika-reader/src/main/java/org/springframework/ai/reader/tika/TikaDocumentReader.java#L119 and can't be changed.

Exposing configuration on TikaDocumentReader to set the write limit would address this problem. (And there may be other settings that could/should be exposed for developers to fine-tune how TikaDocumentReader works.)

Comment From: habuma

Okay, I see now that I can construct a TikaDocumentReader with a ContentHandler, for which BodyContentHandler could be given. I'm gonna try that and (assuming it works) will close this issue.

Comment From: habuma

Followup: It does work in that if I give it a BodyContentHandler with a write limit of -1, it succeeds in loading the entire large PDF without failing. But then it fails because it loads the entire PDF into a single Document object (without splitting it up into multiple document nodes) which is too large for sending to the embedding API. It seems that the 100000 character limit was helping avoid an oversized Document problem, while limiting the size of the PDFs that TikaDocumentReader can load.

In light of this new understanding, I'm going to close this issue and think over how to overcome the new problem I've encountered.

(Bear in mind that TikaDocumentReader is much more flexible with regard to PDF formatting than the other PDF document readers and it is also able to read multiple types of documents...not just PDF. So it would be nice if it could handle larger PDF documents better, by splitting them into several document nodes small enough to be sent to the embedding API.)