Please do a quick search on GitHub issues first, there might be already a duplicate issue for the one you are about to create. If the bug is trivial, just go ahead and create the issue. Otherwise, please take a few moments and fill in the following sections:

Bug description When using the class PagePdfDocumentReader to process pdf files there is a bug in org.springframework.ai.reader.pdf.layout.TextLine. In method getNextValidIndex the index is decremented by one which can lead to negative indexes. isSpaceCharacterAtIndex(index - 1) which then lead into call StringLatin1(-1). This will raise StringIndexOutOfBoundsException. This happens with many pdf files.

Environment Please provide as many details as possible: Spring AI version, Java version, which vector store you use if any, etc spring component version spring-ai-pdf-document-reader-1.0.0-20241106.071720 java 21 Steps to reproduce

 new PagePdfDocumentReader(new InputStreamResource(file.getInputStream()),
                PdfDocumentReaderConfig.builder()
                        .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                                .build())
                        .withPagesPerDocument(1)
                        .build()).get()

on the attached pdf file. Uploading service-manual-whirlpool-dishwashers.pdf…

Expected behavior Of course a proper boundary check :)

Minimal Complete Reproducible example See the attached code. Pay attention to the TextLine class isSpaceCharacterAtIndex method, do a negativity check on the index.

Comment From: jazzm0

Simply call the method https://github.com/spring-projects/spring-ai/blob/main/document-readers/pdf-reader/src/main/java/org/springframework/ai/reader/pdf/layout/TextLine.java#L88 getNextValidIndex(0, false)

Comment From: jazzm0

I've created a PR showing the bug https://github.com/spring-projects/spring-ai/pull/1690/files

Comment From: jazzm0

I added also a unit test showing the problematic code. Btw the TextLine is completely uncovered.

Comment From: jazzm0

@ilayaperumalg ilayaperumalg Hi, I added tests, and rewrote the class to be more efficient. The string methods used while changing the line content can be made much more performant. Please double check! I also fixed the index out of bound issue.

Comment From: ilayaperumalg

@jazzm0 sure, thank you!