Unlock Your Data: Supercharge Excel Document Processing with Docling

#docling #rag #documentconversion #opensource

Docling does not manage Excel files natively for now, hereafter a workaround and implementation.

Introduction

In the world of document processing and AI ingestion, finding the right tool for the job is half the battle. We’ve all been there: you discover a powerful library like Docling, with its incredible ability to parse and structure various file formats, only to realize your crucial data is locked away in a format it doesn’t natively support, like Excel. My journey recently hit this exact roadblock. While the initial workaround seemed simple — a two-step process of converting Excel to CSV and then feeding the resulting file to Docling’s robust engine — I quickly ran into a major glitch. My Excel workbooks with multiple rich tabs were leaving out crucial information, as only the first sheet was being processed. This post will detail a refined code solution to (attempt partially) overcome this specific challenge, ensuring no data is left behind as you unlock the full potential of your documents.

Docling provides a multi-format conversion sample as shown below which is really great, but no Excel in it.

import json
import logging
from pathlib import Path

import yaml

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
    DocumentConverter,
    PdfFormatOption,
    WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

_log = logging.getLogger(__name__)


def main():
    input_paths = [
        Path("README.md"),
        Path("tests/data/html/wiki_duck.html"),
        Path("tests/data/docx/word_sample.docx"),
        Path("tests/data/docx/lorem_ipsum.docx"),
        Path("tests/data/pptx/powerpoint_sample.pptx"),
        Path("tests/data/2305.03393v1-pg9-img.png"),
        Path("tests/data/pdf/2206.01062.pdf"),
        Path("tests/data/asciidoc/test_01.asciidoc"),
    ]

    ## for defaults use:
    # doc_converter = DocumentConverter()

    ## to customize use:

    # Below we explicitly whitelist formats and override behavior for some of them.
    # You can omit this block and use the defaults (see above) for a quick start.
    doc_converter = DocumentConverter(  # all of the below is optional, has internal defaults.
        allowed_formats=[
            InputFormat.PDF,
            InputFormat.IMAGE,
            InputFormat.DOCX,
            InputFormat.HTML,
            InputFormat.PPTX,
            InputFormat.ASCIIDOC,
            InputFormat.CSV,
            InputFormat.MD,
        ],  # whitelist formats, non-matching files are ignored.
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
            ),
            InputFormat.DOCX: WordFormatOption(
                pipeline_cls=SimplePipeline  # or set a backend, e.g., MsWordDocumentBackend
                # If you change the backend, remember to import it, e.g.:
                #   from docling.backend.msword_backend import MsWordDocumentBackend
            ),
        },
    )

    conv_results = doc_converter.convert_all(input_paths)

    for res in conv_results:
        out_path = Path("scratch")  # ensure this directory exists before running
        print(
            f"Document {res.input.file.name} converted."
            f"\nSaved markdown output to: {out_path!s}"
        )
        _log.debug(res.document._export_to_indented_text(max_text_len=16))
        # Export Docling document to Markdown:
        with (out_path / f"{res.input.file.stem}.md").open("w") as fp:
            fp.write(res.document.export_to_markdown())

        with (out_path / f"{res.input.file.stem}.json").open("w") as fp:
            fp.write(json.dumps(res.document.export_to_dict()))

        with (out_path / f"{res.input.file.stem}.yaml").open("w") as fp:
            fp.write(yaml.safe_dump(res.document.export_to_dict()))


if __name__ == "__main__":
    main()

There is a also a very simple code provided for CSV conversions.

Tests and implementation

To tackle this, I’ve built a custom solution based on the foundational examples provided by the Docling project itself. The following implementation acts as a bridge, transforming Excel documents with multiple sheets into a single, comprehensive CSV file. Once this conversion is complete, the application then leverages Docling’s powerful capabilities to process the intermediate CSV and generate both clean Markdown and structured JSON outputs, providing a flexible and robust document processing pipeline.

1st step is to have the Python environment ready (or use a requirements.txt file, more professional 🙂‍↕️).

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install docling
pip install pandas
pip install xlrd
pip install openpyxl

And there goes the code ⬇️

import pandas as pd
from pathlib import Path
from docling.document_converter import DocumentConverter
import json

def main():
    """
    Main function to handle document conversion from an input directory to an output directory.
    """
    # Define the input, output, and temporary directories
    input_dir = Path("./input")
    output_dir = Path("./output")
    tmp_csv_dir = Path("./tmp_csv")

    # Create the output and temporary directories if they don't exist
    try:
        output_dir.mkdir(parents=True, exist_ok=True)
        tmp_csv_dir.mkdir(parents=True, exist_ok=True)
        print(f"Ensured '{output_dir.resolve()}' and '{tmp_csv_dir.resolve()}' directories exist.")
    except OSError as e:
        print(f"Error creating directories: {e}")
        return

    # Initialize the DocumentConverter
    converter = DocumentConverter()

    # Process files in the input directory
    for file_path in input_dir.iterdir():
        # Check if the path is a file and not a directory
        if file_path.is_file():
            print(f"Processing file: {file_path.name}")

            original_file_path = file_path

            # Check if the file is an Excel file
            if file_path.suffix.lower() in ['.xlsx', '.xls']:
                print(f"Detected Excel file. Converting '{file_path.name}' to CSV first.")
                try:
                    # Read all sheets from the Excel file
                    xls = pd.ExcelFile(file_path)
                    all_sheets_df = pd.DataFrame()
                    for sheet_name in xls.sheet_names:
                        print(f"  - Reading sheet: {sheet_name}")
                        df = pd.read_excel(xls, sheet_name=sheet_name)
                        all_sheets_df = pd.concat([all_sheets_df, df], ignore_index=True)

                    csv_file_path = tmp_csv_dir / file_path.with_suffix('.csv').name
                    all_sheets_df.to_csv(csv_file_path, index=False)
                    file_path = csv_file_path
                    print(f"Successfully converted Excel to temporary CSV: {file_path.name}")
                except Exception as e:
                    print(f"Could not convert Excel file '{original_file_path.name}' to CSV. Error: {e}")
                    continue

            try:
                # Convert the document using the converter
                result = converter.convert(file_path)

                # Export the document to markdown format
                output_content = result.document.export_to_markdown()

                # Define the output file path based on the original file name
                output_file_path = output_dir / original_file_path.with_suffix('.md').name

                # Write the output content to the new file
                with open(output_file_path, "w", encoding="utf-8") as f:
                    f.write(output_content)
                print(f"Successfully converted to markdown: '{output_file_path.name}'")

                # Export the document to JSON format
                json_output = result.document.export_to_dict()
                json_file_path = output_dir / original_file_path.with_suffix('.json').name
                with open(json_file_path, "w") as fp:
                    fp.write(json.dumps(json_output, indent=4))
                print(f"Successfully created JSON output: '{json_file_path.name}'")

            except Exception as e:
                print(f"Could not convert file '{file_path.name}'. Error: {e}")
        else:
            print(f"Skipping non-file path: {file_path}")

if __name__ == "__main__":
    # Call the main function when the script is executed
    main()

Based on the console output, the application is performing exactly as intended. It demonstrates its enhanced capability by successfully iterating through each sheet of the Excel file. The console logs serve as a clear and verifiable record, confirming that the conversion pipeline is robustly processing every tab, leaving no data behind, and generating the final Markdown and JSON files as a comprehensive output.

python app4.py
Ensured '/Users/xxx/Devs/docling-excel/output' and '/Users/xxx/Devs/docling-excel/tmp_csv' directories exist.
Processing file: Bank Extraction.xlsx
Detected Excel file. Converting 'Bank Extraction.xlsx' to CSV first.
  - Reading sheet: RECAP 2013-2014 old
  - Reading sheet: RECAP2012-2013
  - Reading sheet: RECAP 2014
  - Reading sheet: Chéquier n°1
  - Reading sheet: JUIN12
  - Reading sheet: JUILL12
  - Reading sheet: AOUT12
  - Reading sheet: SEPT12
  - Reading sheet: OCT12
  - Reading sheet: NOV12
  - Reading sheet: DEC12
  - Reading sheet: RECAP 2015
  - Reading sheet: Bilan 2012
  - Reading sheet: JAN13
  - Reading sheet: FEV13
  - Reading sheet: MARS13
  - Reading sheet: AVRIL13
  - Reading sheet: MAI13
  - Reading sheet: JUIN13
  - Reading sheet: JUILL13
  - Reading sheet: AOUT13
  - Reading sheet: Bilan 2013
  - Reading sheet: Bilan 2014
  - Reading sheet: JAN14
  - Reading sheet: FEV14
  - Reading sheet: MAR14
  - Reading sheet: AVR14
  - Reading sheet: MAI14
  - Reading sheet: JUIN14
  - Reading sheet: JUILL14
  - Reading sheet: AOUT14
  - Reading sheet: SEPT14
  - Reading sheet: OCT14
  - Reading sheet: Diagrammes 2013-2014 old
  - Reading sheet: NOV14
  - Reading sheet: DEC14
  - Reading sheet: PASS 92
  - Reading sheet: Mvts en attente
  - Reading sheet: Bilan 2015
  - Reading sheet: JAN16
  - Reading sheet: FEV2016
  - Reading sheet: MAR2016
  - Reading sheet: AVR2016
  - Reading sheet: MAI2016
  - Reading sheet: JUIN2016
  - Reading sheet: JUIL2016
  - Reading sheet: AOUT2016
  - Reading sheet: SEPT2016
  - Reading sheet: OCT2016
  - Reading sheet: NOV2016
  - Reading sheet: DEC2016
  - Reading sheet: Bilan 2016
  - Reading sheet: JAN2017
  - Reading sheet: FEV2017
  - Reading sheet: MAR2017
  - Reading sheet: AVR2017
  - Reading sheet: MAI2017
  - Reading sheet: JUN2017
  - Reading sheet: JUIL2017
  - Reading sheet: AOUT2017
  - Reading sheet: SEPT2017
  - Reading sheet: OCT2017
  - Reading sheet: NOV2017
  - Reading sheet: DEC2017
  - Reading sheet: JAN2018
  - Reading sheet: FEV2018
  - Reading sheet: MAR2018
  - Reading sheet: AVR2018
  - Reading sheet: MAI2018
  - Reading sheet: JUIN2018
  - Reading sheet: JUIL2018
  - Reading sheet: AOUT2018
  - Reading sheet: SEPT2018
  - Reading sheet: OCT2018
  - Reading sheet: NOV2018
  - Reading sheet: DEC2018
  - Reading sheet: JAN2019
  - Reading sheet: FEV2019
  - Reading sheet: MAR2019
  - Reading sheet: AVR2019
  - Reading sheet: MAI2019
  - Reading sheet: JUIN2019
  - Reading sheet: JUIL2019
  - Reading sheet: AOUT2019
  - Reading sheet: SEPT2019
  - Reading sheet: OCT2019
  - Reading sheet: NOV2019
  - Reading sheet: DEC2019
  - Reading sheet: JAN2020
  - Reading sheet: FEV2020
  - Reading sheet: MAR2020
  - Reading sheet: AVR2020
  - Reading sheet: MAI2020
  - Reading sheet: JUIN2020
  - Reading sheet: JUIL2020
  - Reading sheet: AOUT2020
  - Reading sheet: SEPT2020
  - Reading sheet: OCT2020
  - Reading sheet: NOV2020
  - Reading sheet: DEC2020
  - Reading sheet: JAN2021
  - Reading sheet: FEB2021
  - Reading sheet: MAR2021
  - Reading sheet: APR2021
  - Reading sheet: MAI2021
  - Reading sheet: JUIN2021
  - Reading sheet: JUIL2021
  - Reading sheet: AOUT2021
  - Reading sheet: SEPT2021
  - Reading sheet: OCT2021
  - Reading sheet: NOV2021
  - Reading sheet: DEC2021
  - Reading sheet: JAN2022
  - Reading sheet: FEB2022
  - Reading sheet: MAR2022
  - Reading sheet: APR2022
  - Reading sheet: MAI2022
  - Reading sheet: JUIN2022
  - Reading sheet: JUIL2022
  - Reading sheet: AOUT2022
  - Reading sheet: SEPT2022
  - Reading sheet: OCT2022
  - Reading sheet: NOV2022
  - Reading sheet: DEC2022
  - Reading sheet: JAN2023
  - Reading sheet: FEV2023
  - Reading sheet: MAR2023
  - Reading sheet: AVR2023
  - Reading sheet: MAI2023
  - Reading sheet: JUIN2023
  - Reading sheet: JUI2023
  - Reading sheet: AOU2023
  - Reading sheet: SEP2023
  - Reading sheet: OCT2023
  - Reading sheet: NOV2023
  - Reading sheet: DEC2023
  - Reading sheet: JAN2024
  - Reading sheet: FEV2024
  - Reading sheet: MARS2024
  - Reading sheet: AVR2024
  - Reading sheet: MAI2024
  - Reading sheet: JUIN2024
  - Reading sheet: JUIL2024
  - Reading sheet: AOU2024
  - Reading sheet: SEPT2024
  - Reading sheet: OCT2024
  - Reading sheet: NOV2024
  - Reading sheet: DEC2024
  - Reading sheet: JAN2025
  - Reading sheet: FEV2025
  - Reading sheet: MARS2025
  - Reading sheet: AVR2025
  - Reading sheet: MAI2025
  - Reading sheet: JUIN2025
  - Reading sheet: JUIL2025
  - Reading sheet: AOUT2025
  - Reading sheet: SEPT2025
  - Reading sheet: Bilans
  - Reading sheet: bilan-0923-0824
  - Reading sheet: Liste cdes tee-shirts
  - Reading sheet: Résultats
  - Reading sheet: Budget prév jan-aout 2014
  - Reading sheet: Budget prévisionnel-association
  - Reading sheet: Compte de Résultat-Association
  - Reading sheet: Bilan financier- Association
  - Reading sheet: Bilan financier- Associatio (2)
Successfully converted Excel to temporary CSV: Bank Extraction.csv
2025-09-22 10:47:45,240 - INFO - detected formats: [<InputFormat.CSV: 'csv'>]
2025-09-22 10:47:45,244 - INFO - Going to convert document batch...
2025-09-22 10:47:45,244 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-09-22 10:47:45,248 - INFO - Loading plugin 'docling_defaults'
2025-09-22 10:47:45,249 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-09-22 10:47:45,249 - INFO - Processing document Bank Extraction.csv
2025-09-22 10:47:45,249 - INFO - Parsing CSV with delimiter: ","
2025-09-22 10:47:45,292 - INFO - Detected 8510 lines
/Users/xxx/Devs/docling-excel/venv/lib/python3.12/site-packages/docling/backend/csv_backend.py:76: UserWarning: Inconsistent column lengths detected in CSV data. Expected 280 columns, but found rows with varying lengths. Ensure all rows have the same number of columns.
  warnings.warn(
2025-09-22 10:47:52,088 - INFO - Finished converting document Bank Extraction.csv in 6.85 sec.
Successfully converted to markdown: 'Bank Extraction.md'
Successfully created JSON output: 'Bank Extraction.json'
Processing file: Inscription.xls
Detected Excel file. Converting 'Inscription.xls' to CSV first.
  - Reading sheet: Worksheet1
Successfully converted Excel to temporary CSV: Inscription.csv
2025-09-22 10:48:52,853 - INFO - detected formats: [<InputFormat.CSV: 'csv'>]
2025-09-22 10:48:52,854 - INFO - Going to convert document batch...
2025-09-22 10:48:52,854 - INFO - Processing document Inscription.csv
2025-09-22 10:48:52,855 - INFO - Parsing CSV with delimiter: ","
2025-09-22 10:48:52,855 - INFO - Detected 7 lines
2025-09-22 10:48:52,856 - INFO - Finished converting document Inscription.csv in 0.00 sec.
Successfully converted to markdown: 'Inscription.md'
Successfully created JSON output: 'Inscription.json'

This “busyness” of the Excel file is perfectly reflected in the resulting JSON output. As demonstrated below, the more complex and data-rich the original Excel document, the more comprehensive — and indeed, “heavier” — its corresponding JSON representation becomes. This is a direct testament to the code’s ability to accurately capture and preserve the entire data structure, including all the intricate details from every sheet, transforming it into a fully structured, machine-readable format.

Conclusion

Navigating the world of document processing with powerful tools like Docling can sometimes feel like a puzzle, especially when you encounter file formats that aren’t natively supported. This post has shown a robust and repeatable workaround for a common challenge: ingesting Excel data, including from multi-tab workbooks. By leveraging a simple Python application to act as a crucial bridge, we’ve demonstrated how to successfully transform complex Excel files into a single, unified CSV. From there, Docling takes over, turning that data into both readable Markdown and structured JSON. This approach ensures no information is lost, and as we’ve seen, the console output confirms a complete, successful conversion, even for the most complex files. Ultimately, this solution empowers you to integrate your Excel data into any workflow, unlocking its full potential for further analysis and AI-driven applications.

That’s a crucial point to make. While this open-source solution is an excellent and powerful proof of concept, ideal for specific tasks and internal projects, its design has limitations. For genuine heavy-duty, industrial-scale use cases that demand high performance, stringent security, and robust business capacities, specialized platforms become essential. Solutions like watsonx BI (see the link below) are built to handle the complexities of business-critical data pipelines, offering advanced features like built-in governance, massive scalability, and integrated business intelligence tools that are not typically found in a DIY implementation. It’s all about choosing the right tool for the job.

Thanks for reading! 🤗

DEV Community

Unlock Your Data: Supercharge Excel Document Processing with Docling

Introduction

Tests and implementation

Conclusion

Links

Top comments (0)