Please do a quick search on GitHub issues first, there might be already a duplicate issue for the one you are about to create. If the bug is trivial, just go ahead and create the issue. Otherwise, please take a few moments and fill in the following sections:
Bug description
When using the class PagePdfDocumentReader to process pdf files there is a bug in org.springframework.ai.reader.pdf.layout.TextLine. In method getNextValidIndex the index is decremented by one which can lead to negative indexes. isSpaceCharacterAtIndex(index - 1)
which then lead into call StringLatin1(-1). This will raise StringIndexOutOfBoundsException. This happens with many pdf files.
Environment Please provide as many details as possible: Spring AI version, Java version, which vector store you use if any, etc spring component version spring-ai-pdf-document-reader-1.0.0-20241106.071720 java 21 Steps to reproduce
new PagePdfDocumentReader(new InputStreamResource(file.getInputStream()),
PdfDocumentReaderConfig.builder()
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.build())
.withPagesPerDocument(1)
.build()).get()
on the attached pdf file. Uploading service-manual-whirlpool-dishwashers.pdf…
Expected behavior Of course a proper boundary check :)
Minimal Complete Reproducible example See the attached code. Pay attention to the TextLine class isSpaceCharacterAtIndex method, do a negativity check on the index.
Comment From: jazzm0
Simply call the method https://github.com/spring-projects/spring-ai/blob/main/document-readers/pdf-reader/src/main/java/org/springframework/ai/reader/pdf/layout/TextLine.java#L88 getNextValidIndex(0, false)
Comment From: jazzm0
I've created a PR showing the bug https://github.com/spring-projects/spring-ai/pull/1690/files
Comment From: jazzm0
I added also a unit test showing the problematic code. Btw the TextLine is completely uncovered.
Comment From: jazzm0
@ilayaperumalg ilayaperumalg Hi, I added tests, and rewrote the class to be more efficient. The string methods used while changing the line content can be made much more performant. Please double check! I also fixed the index out of bound issue.
Comment From: ilayaperumalg
@jazzm0 sure, thank you!