Current Version of the Spring AI allows JsonReader and TextReader.

Expected Behavior

More DocumentReader implementations should allow reading all kinds documents (PDF, word, html, ...)

Current Behavior

DocumentReader is implemented only for Json and Text content types.

Context

I am working on document querying use-case with azure openai to load all the documents and create embeddings to store them in Redis. So i request the team to implement DocumentLoader.

Comment From: markpollack

Yep, we need more implementations. I'll start a sort of 'priority' ordered list here

  • [x] PDF
  • [ ] markdown (GH)
  • [x] text

Also, note that we can make use of pandoc to convert from several formats into markdown, which can be a very effective way to get support for some formats, for example MediaWiki format (from wikipedia) is notoriously difficult to parse as there is no spec - the implementation is the spec!

Comment From: radhakrishna67

@markpollack, thanks for considering this. I hope it would extract Text, Images and Tabular data from PDF.

Comment From: markpollack

We have several document readers already available. I will close this issue and open one specifically around Markdown. There is also a companion issue related to restructuring the signature of the readers and writings to be functions an not suppliers/consumers. See https://github.com/spring-projects-experimental/spring-ai/issues/105 and https://github.com/spring-projects-experimental/spring-ai/issues/106

Comment From: mzhafez

Yep, we need more implementations. I'll start a sort of 'priority' ordered list here

  • [x] PDF
  • [ ] markdown (GH)
  • [x] text

Also, note that we can make use of pandoc to convert from several formats into markdown, which can be a very effective way to get support for some formats, for example MediaWiki format (from wikipedia) is notoriously difficult to parse as there is no spec - the implementation is the spec!

Might be worth while supporting ADOC as well.