Bug description ParagraphPdfDocumentReader causing NullPointerException when reading sample1.pdf https://github.com/spring-projects-experimental/spring-ai/blob/main/document-readers/pdf-reader/src/test/resources/sample1.pdf

Environment Spring Boot version: 3.1.4 Spring AI version: 0.7.0-SNAPSHOT Java version: openjdk version "17.0.2" 2022-01-18

Steps to reproduce Add dependency spring-ai-pdf-document-reader: 0.7.0-SNAPSHOT version to pom.xml

`

    <dependency>
     <groupId>org.springframework.experimental.ai</groupId>
     <artifactId>spring-ai-pdf-document-reader</artifactId>
     <version>0.7.0-SNAPSHOT</version>
   </dependency>

`

Code to read paragraphs: `

    var documents = pdfReader.get();

   ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
            "file:\\C:\\Users\\test\\sample1.pdf",
            PdfDocumentReaderConfig.builder()
                    .build());

    var documents = pdfReader.get();

    for (Document document : documents) {
        System.out.println(document.getContent());
    }

`

Expected behavior It should read each paragraph from the sample1.pdf file

Exception `

  java.lang.NullPointerException: Cannot invoke "org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode.getFirstChild()" because "bookmark" is null
at org.springframework.ai.reader.pdf.config.ParagraphManager.generateParagraphs(ParagraphManager.java:131) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.config.ParagraphManager.<init>(ParagraphManager.java:82) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:109) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:92) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]

`

Comment From: markpollack

Thanks for reporting this. Parsing PDFs is a challenge and one might not be able to use the ParagraphPdfDocumentReader on all PDFs and then one should look for another strategy to parse the PDF. ParagraphPdfDocumentReader relies on a PDF object called 'outline'. The other options in Spring AI are PagePdfDocumentReader and TikaDocumentReader. I would also suggest looking into https://developer.adobe.com/document-services/apis/pdf-extract/

All that said, there should be no NPE. This will be fixed by adding a check on the code this.document.getDocumentCatalog().getDocumentOutline(), and if it returns null will indicate that the ParagraphPdfDocumentReader can not be used for the provided PDF since it contains no document outline.