Bug description ParagraphPdfDocumentReader causing NullPointerException when reading sample1.pdf https://github.com/spring-projects-experimental/spring-ai/blob/main/document-readers/pdf-reader/src/test/resources/sample1.pdf
Environment Spring Boot version: 3.1.4 Spring AI version: 0.7.0-SNAPSHOT Java version: openjdk version "17.0.2" 2022-01-18
Steps to reproduce Add dependency spring-ai-pdf-document-reader: 0.7.0-SNAPSHOT version to pom.xml
`
<dependency>
<groupId>org.springframework.experimental.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
<version>0.7.0-SNAPSHOT</version>
</dependency>
`
Code to read paragraphs: `
var documents = pdfReader.get();
ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
"file:\\C:\\Users\\test\\sample1.pdf",
PdfDocumentReaderConfig.builder()
.build());
var documents = pdfReader.get();
for (Document document : documents) {
System.out.println(document.getContent());
}
`
Expected behavior It should read each paragraph from the sample1.pdf file
Exception `
java.lang.NullPointerException: Cannot invoke "org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode.getFirstChild()" because "bookmark" is null
at org.springframework.ai.reader.pdf.config.ParagraphManager.generateParagraphs(ParagraphManager.java:131) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.config.ParagraphManager.<init>(ParagraphManager.java:82) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:109) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:92) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
`
Comment From: markpollack
Thanks for reporting this. Parsing PDFs is a challenge and one might not be able to use the ParagraphPdfDocumentReader on all PDFs and then one should look for another strategy to parse the PDF. ParagraphPdfDocumentReader relies on a PDF object called 'outline'. The other options in Spring AI are PagePdfDocumentReader
and TikaDocumentReader
. I would also suggest looking into https://developer.adobe.com/document-services/apis/pdf-extract/
All that said, there should be no NPE. This will be fixed by adding a check on the code this.document.getDocumentCatalog().getDocumentOutline(),
and if it returns null will indicate that the ParagraphPdfDocumentReader
can not be used for the provided PDF since it contains no document outline.