This makes it possible for TikaDocumentReader to read large PDFs.

Comment From: habuma

Added a simple setter for chunk size, still defaulting to 800. I considered other options...

  • Yet another constructor argument. Decided against this because TikaDocumentReader already has a lot of constructors and generally, I favor constructors for things that must be set and setters for things that can optionally be set (as is the case here).
  • A separate configuration class to carry this property (and potentially other properties to configure the reader). I decided against this because at the moment there's only this proper to be set.

I still kinda think that TextSplitter's splitText() method could be public. If it were, then TikaDocumentReader could be injected with a TextSplitter and simply call splitText() without having to concern itself with the chunk size. Chunk size would be solely a TextSplitter concern. It'd also open up the possibility of other TextSplitter implementations to be used. In short...I'd likely do the following:

  • Make TextSplitter's splitText() method public.
  • Change TokenTextSplitter to not hardcode 800 as the chunk size and allow it to be configured. (While maybe still defaulting to 800.)
  • Either create a no-op implementation of TextSplitter as the default TextSplitter that TikaDocumentReader uses or simply use the extracted text as-is if no TextSplitter has been specified for TikaDocumentReader.

This was a bit removed from my goal of making TikaDocumentReader capable of handling large PDFs, and would be a more involved change that I'd rather get feedback on before I applied it. Instead I chose the more direct approach in this PR. But, I'll be happy to alter this PR (or close it and open a new PR) with the TextSplitter changes described above if you think that'd be a good change.

Comment From: markpollack

I did a bit of research and found this https://stackoverflow.com/questions/31079433/how-to-read-large-files-using-tika

It seems like changing

    public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
        this(resource, new BodyContentHandler(), textFormatter);
    }

to

    public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
        this(resource, new BodyContentHandler(-1), textFormatter);
    }

Will remove the limitation you encountered. I think getting the full text in the TikaDocumentReader and then composing/chaining the output of TikaDocumentReader the input of TokenTextSplitter gives full control. If a user wants to have a limit inside Tika, then passing in a property configured BodyContentHandler instance would achieve that goal.

Chaining functions in the "ETL Pipeline" vs. having TokenTextSplitter as a property of TikaDocumentReader gives the most flexibility in the spirit of having a Java based functional pipeline. The same approach applies to anyone using any other implementation of DocumentReader.

Comment From: markpollack

@habuma I'd like to close this, please read my last comment.

Comment From: markpollack

closing as adding a text splitter to the functional composition chain achieves the request.