When trying to load a large PDF with TikaDocumentReader
, I get this exception:
org.apache.tika.exception.WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
It seems that it is possible to overcome this limit by setting Tika's BodyContentHandler
's write limit. Setting it to -1 means no limit. See https://tika.apache.org/1.4/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler%28int%29
Unfortunately, this setting isn't exposed through TikaDocumentReader
, so it can't be set by an application developer using Spring AI. In fact, the BodyContentHandler
used in TikaDocumentReader
is created directly at https://github.com/spring-projects/spring-ai/blob/main/document-readers/tika-reader/src/main/java/org/springframework/ai/reader/tika/TikaDocumentReader.java#L119 and can't be changed.
Exposing configuration on TikaDocumentReader
to set the write limit would address this problem. (And there may be other settings that could/should be exposed for developers to fine-tune how TikaDocumentReader
works.)
Comment From: habuma
Okay, I see now that I can construct a TikaDocumentReader
with a ContentHandler
, for which BodyContentHandler
could be given. I'm gonna try that and (assuming it works) will close this issue.
Comment From: habuma
Followup: It does work in that if I give it a BodyContentHandler
with a write limit of -1, it succeeds in loading the entire large PDF without failing. But then it fails because it loads the entire PDF into a single Document
object (without splitting it up into multiple document nodes) which is too large for sending to the embedding API. It seems that the 100000 character limit was helping avoid an oversized Document
problem, while limiting the size of the PDFs that TikaDocumentReader
can load.
In light of this new understanding, I'm going to close this issue and think over how to overcome the new problem I've encountered.
(Bear in mind that TikaDocumentReader
is much more flexible with regard to PDF formatting than the other PDF document readers and it is also able to read multiple types of documents...not just PDF. So it would be nice if it could handle larger PDF documents better, by splitting them into several document nodes small enough to be sent to the embedding API.)