Motivation
@markpollack and @tzolov, maybe you are interested in the new MarkdownDocumentReader
, which can read structured Markdown documents. As @markpollack wrote in #105, it could be valuable. I agree.
So, I've prepared a simple implementation of that DocumentReader
.
Description
For parsing Markdown documents, I've used the commonmark/commonmark-java library.
Document dividing
By default, all documents are divided by headers. This includes all header types from 1 to 6. For a simple document like:
# AAA
content 1
## BBB
content 2
### CCC
content 3
#### DDD
content 4
##### EEE
content 5
###### FFF
content 6
Six documents will be generated. Each of these documents will have entries in the metadata as follows:
category
=>header_X
, whereX
is the number of the headertitle
=><header title>
, e.g.:BBB
from the example
There is also an option to divide the Markdown document by horizontal lines. This is not the default option, but it can be turned on through configuration.
Blockquotes and Code Blocks support
All blockquotes and code blocks are treated as separate documents. For code blocks where we the language can be determined, it is included in the lang
metadata entry.
This behavior can be changed by setting options.
Additional metadata
The Markdown reader configuration also provides support for additional metadata, which may be set for all processed documents. It contains fixed values that offer more context about the created document, such as the service name that provides the document, or the environment in which it was created.
TODO
- [x] Add
MarkdownDocumentReader
to the documentation - [ ] Update a ETL Class Diagram
Comment From: markpollack
Hi, I'm super happy to see this. I'll review asap. I have a version of this on my machine from way back when i started the project. Markdown is a great "lingua franca" for document ETL.
Comment From: piotrooo
@markpollack great to hear it!
Any ideas or enhancements are more than appreciated.
I also have some future ideas about handling tables, but first things first. Baby steps.
Comment From: markpollack
I've added docs. I haven't tried it in anger yet but it looks great. Merged in 56e678c487e64b9731d4f16e9bbaafe4a729b7f3