This makes it possible for TikaDocumentReader
to read large PDFs.
Comment From: habuma
Added a simple setter for chunk size, still defaulting to 800. I considered other options...
- Yet another constructor argument. Decided against this because
TikaDocumentReader
already has a lot of constructors and generally, I favor constructors for things that must be set and setters for things that can optionally be set (as is the case here). - A separate configuration class to carry this property (and potentially other properties to configure the reader). I decided against this because at the moment there's only this proper to be set.
I still kinda think that TextSplitter
's splitText()
method could be public. If it were, then TikaDocumentReader
could be injected with a TextSplitter
and simply call splitText()
without having to concern itself with the chunk size. Chunk size would be solely a TextSplitter
concern. It'd also open up the possibility of other TextSplitter
implementations to be used. In short...I'd likely do the following:
- Make
TextSplitter
'ssplitText()
method public. - Change
TokenTextSplitter
to not hardcode 800 as the chunk size and allow it to be configured. (While maybe still defaulting to 800.) - Either create a no-op implementation of
TextSplitter
as the defaultTextSplitter
thatTikaDocumentReader
uses or simply use the extracted text as-is if noTextSplitter
has been specified forTikaDocumentReader
.
This was a bit removed from my goal of making TikaDocumentReader
capable of handling large PDFs, and would be a more involved change that I'd rather get feedback on before I applied it. Instead I chose the more direct approach in this PR. But, I'll be happy to alter this PR (or close it and open a new PR) with the TextSplitter
changes described above if you think that'd be a good change.
Comment From: markpollack
I did a bit of research and found this https://stackoverflow.com/questions/31079433/how-to-read-large-files-using-tika
It seems like changing
public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
this(resource, new BodyContentHandler(), textFormatter);
}
to
public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
this(resource, new BodyContentHandler(-1), textFormatter);
}
Will remove the limitation you encountered. I think getting the full text in the TikaDocumentReader and then composing/chaining the output of TikaDocumentReader the input of TokenTextSplitter
gives full control. If a user wants to have a limit inside Tika, then passing in a property configured BodyContentHandler instance would achieve that goal.
Chaining functions in the "ETL Pipeline" vs. having TokenTextSplitter as a property of TikaDocumentReader gives the most flexibility in the spirit of having a Java based functional pipeline. The same approach applies to anyone using any other implementation of DocumentReader.
Comment From: markpollack
@habuma I'd like to close this, please read my last comment.
Comment From: markpollack
closing as adding a text splitter to the functional composition chain achieves the request.