GitHub code discovery adventures! ๐
Introdcution
For those who are familiar with my posts, you know that I spend a lot of time digging into all kinds of code, often maintaining subscriptions to far too many developer blogs and repositories. My recent rabbit hole began when I spotted the docling-hierarchical-pdf repository. Curious about its claims regarding robust PDF-to-Markdown conversion and document post-processing โ especially since itโs not actually produced by the official Docling team โ I decided to take it for a serious spin. What follows is my complete test, including setting up a recursive processing pipeline, and all the discoveries I made along the way.
Disclaimer: The author/developer of the repository is Roman, Kayan (https://www.linkedin.com/in/roman-kreuzhuber/).
Reminder - What is Docling
Docling is an open-source framework specifically designed to help applications understand, convert, and process documents, with a strong focus on turning complex, unstructured content like PDFs, Excel files etcโฆ and images into structured, usable data formats. It achieves this by combining various state-of-the-art machine learning models โ such as computer vision for layout analysis and natural language processing (NLP) for content extraction โ to accurately interpret the hierarchy, tables, and paragraphs within a document. Following its development, Docling was formally accepted as a Sandbox project by the Cloud Native Computing Foundation (CNCF)
Implementation and Test
- First of all prepare the environment ๐ฑ
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# this includes all docling packages too
pip install docling-hierarchical-pdf
- Using the sample code provided by the developer*, I modified it to my usual habit of coding structure.
#### sample provided
from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor
source = "my_file.pdf" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result).process()
# enjoy the reordered document - for example convert it to markdown
result.document.export_to_markdown()
# or use a chunker on it...
- Modified code โฌ๏ธ
import os
from pathlib import Path
from datetime import datetime
from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor
def process_documents(input_dir: str, output_dir: str, extensions: list = None):
"""
Recursively reads documents from input_dir, converts them, and saves
timestamped Markdown files to output_dir.
Args:
input_dir: The source directory for documents.
output_dir: The destination directory for Markdown files.
extensions: List of file extensions to process (e.g., ['.pdf', '.docx']).
"""
if extensions is None:
extensions = ['.pdf', '.doc', '.docx', '.txt']
print(f"Starting document conversion pipeline.")
print(f"Input Directory: {input_dir}")
print(f"Output Directory: {output_dir}\n")
input_path = Path(input_dir)
output_path = Path(output_dir)
# 1. Create the output directory if it doesn't exist
print(f"Checking output directory...")
output_path.mkdir(parents=True, exist_ok=True)
print(f"Output directory '{output_dir}' is ready.\n")
converter = DocumentConverter()
processed_count = 0
# Recursively find documents
found_files = set()
for ext in extensions:
search_pattern = f"**/*{ext}"
print(f"Searching for files with pattern: {search_pattern}")
for source_file in input_path.glob(search_pattern):
if source_file.is_file() and source_file not in found_files:
found_files.add(source_file)
processed_count += 1
try:
print(f"Processing file: {source_file.name}")
# conversion
result = converter.convert(source_file)
# Post-process the result
ResultPostprocessor(result).process()
# Export to markdown
markdown_content = result.document.export_to_markdown()
# Generate timestamped filename
base_name = source_file.stem
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_filename = f"{base_name}_{timestamp}.md"
output_filepath = output_path / output_filename
output_filepath.write_text(markdown_content, encoding='utf-8')
print(f"-> SUCCESS: Saved as '{output_filename}'\n")
except Exception as e:
print(f"-> ERROR processing {source_file}: {e}\n")
if processed_count == 0:
print("Completed: No documents found matching the specified extensions. Ensure files exist in './input' and its subdirectories.")
else:
print(f"Completed: Processed {processed_count} documents.")
if __name__ == "__main__":
INPUT_FOLDER = "./input"
OUTPUT_FOLDER = "./output"
Path(INPUT_FOLDER).mkdir(parents=True, exist_ok=True)
print(f"--- Running Document Processor ---")
print(f"Please ensure your target files are located in the '{INPUT_FOLDER}' directory (and its subdirectories).")
process_documents(INPUT_FOLDER, OUTPUT_FOLDER, extensions=['.pdf', '.docx', '.doc'])
print("\n--- Program Execution Complete ---")
- The input data could be found here: https://github.com/krrome/docling-hierarchical-pdf/tree/main/tests/samples
- The output exercpt in Markdown format ๐
## Some kind of text document
## 1. Introduction
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis consequat tincidunt tempor.
### 1.1 Background
Praesent venenatis leo vel nibh tincidunt, nec iaculis est dictum. Integer ac purus at mi volutpat feugiat.
### 1.2 Purpose
Nunc rhoncus, risus ut pretium hendrerit, ipsum nulla lobortis augue, nec vestibulum magna turpis sit amet erat.
## 2. Main Content
Aliquam erat volutpat. Morbi interdum, eros eget venenatis euismod, velit sapien tristique risus, ac tincidunt massa lorem nec felis.
### 2.1 Section One
Phasellus ullamcorper nunc a lacus imperdiet, at luctus sem euismod. Integer feugiat commodo orci.
#### 2.1.1 Subsection
Duis porttitor magna ut urna efficitur, ac sollicitudin odio volutpat. Suspendisse potenti.
#### 2.1.2 Another Subsection
Sed vestibulum porta sem, eu egestas nulla gravida vel. Morbi feugiat vulputate odio sed placerat.
### 2.2 Section Two
Vivamus venenatis magna at metus imperdiet consequat. Ut semper risus nec odio ultrices gravida.
## 3. Conclusion
In hac habitasse platea dictumst. Sed pretium, felis ac dignissim dictum, lectus erat pharetra quam, vel ullamcorper est lorem vitae ligula.
## Some kind of text document
1. Introduction
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis consequat tincidunt tempor.
## 1.1 Background
Praesent venenatis leo vel nibh tincidunt, nec iaculis est dictum. Integer ac purus at mi volutpat feugiat.
- 1.2 Purpose
Nunc rhoncus, risus ut pretium hendrerit, ipsum nulla lobortis augue, nec vestibulum magna turpis sit amet erat.
## 2. Main Content
Aliquam erat volutpat. Morbi interdum, eros eget venenatis euismod, velit sapien tristique risus, ac tincidunt massa lorem nec felis.
### 2.1 Section One
Phasellus ullamcorper nunc a lacus imperdiet, at luctus sem euismod. Integer feugiat commodo orci.
- 2.1.1 Subsection
Duis porttitor magna ut urna efficitur, ac sollicitudin odio volutpat. Suspendisse potenti.
- 2.1.2 Another Subsection
Sed vestibulum porta sem, eu egestas nulla gravida vel. Morbi feugiat vulputate odio sed placerat.
### 2.2 Section Two
Vivamus venenatis magna at metus imperdiet consequat. Ut semper risus nec odio ultrices gravida.
## 3. Conclusion
In hac habitasse platea dictumst. Sed pretium, felis ac dignissim dictum, lectus erat pharetra quam, vel ullamcorper est lorem vitae ligula.Conclusion
It seems quite cool ๐
Conclusion
Overall, this non-official Docling Hierarchical code base seems like a well-built and promising repository that is absolutely worth a try. If you ever find yourself running into specific shortcomings or needing a highly customized processing pipeline that the official Docling packages or services donโt quite accommodate, this project could be an excellent alternative to explore. Thank you very much for reading, and happy coding!
Links
- GitHub Repository for docling-hierarchical-pdf: https://github.com/krrome/docling-hierarchical-pdf?tab=readme-ov-file
- Documentation Page: https://krrome.github.io/docling-hierarchical-pdf/
- Docling Documentation: https://docling-project.github.io/docling/
- Docling Repository: https://github.com/docling-project/docling


Top comments (0)