This pull request introduces a new PdfLoader class to the project. The PdfLoader class is responsible for loading PDF documents and converting them into a list of Document objects. This is achieved by using the Apache PDFBox library to read the PDF file and extract the text.
Comment From: tzolov
Thank you for your contribution @nikitsenka,
The PDF parsing/extraction presents various configuration options and I've been looking for an "optimal" reader implementation that can benefit from the internal structure of the PDF documents.
For example, I would assume that preserving the pdf text formatting, helps the LLM to better detect and interpret any tabular info in the document.
The default PDFTextStripper
doesn't preserver the format. There are impl. like PDFLayoutTextStripper
trying to do this.
Also, I wonder if stripping the page footer/headers from the text would improve the LLM responses.
There is a PDFTextStripperByArea
that allow to cut the page margins before text extraction.
Alternatively this can be achieve by altering the extracted text directly.
Another important (IMO) area is to semantically (pre)split the documents. (e.g. using the PDF internal structure knowledge).
For example, splitting the pdf by paragraphs (e.g. PDOutlineNode
). Unfortunately not all PDFs come with TOC.
Also even if the paragraphs are present, there are not guarantees that they always have the in-page position reference (e.g. PDPageXYZDestination
) ...
So i've been playing around all those ideas and here is the current work: https://github.com/tzolov/spring-ai/tree/pdf-loader-2/spring-ai-core/src/main/java/org/springframework/ai/reader/pdf
I'm afraid that until we build a LLM evaluation utilities it will be difficult to objectively, justify any of the ideas.
What is your opinion this?
Comment From: nikitsenka
Preserving the PDF text formatting can indeed be beneficial, especially when dealing with tabular information. However it requires more parameters to be provided from client side. In real live i mostly need to load a lot of different pdf documents without knowing structure and contents. In addition to tabular information, PDF documents can contain various types of metadata: Document structure, Fonts and Colors, Hyperlinks, Images and Multimedia. By analyzing the document beforehand, we can gather information such as the presence of tabular data, scanned images or non-searchable text, the types of content (text, images, etc.), and other metadata
My intention was to start with common approach applied to all types of documents and move forward with improvements without losing easiness of use.
Comment From: markpollack
Hi. I totally agree with all the challenges your mention regarding 'just throwing a bunch of pdf's and the code being smart to figure it all out'. Not very sure how we get there as PDFs very so much in their content and structure. The code you have here was done while other PDF functionality was in flight at the same time. This PR doesn't go beyond what is now in the code base, so I will close it. If there are any ideas on how to go about making the current implementations in the code base more robust to achieve things like autodetect columns, tables, etc. let's chat and contributions are always welcome. Thanks for submitting the PR and hope to see more.