Building Your Own Data Parser with Docling

#rag #ai

When building AI agents, one of the most crucial steps is preparing the data you feed to the language models (LLMs). If your data is not well-structured and context-ready, your agents may severely underperform and fail to deliver the results you expect.

While there are well-known data parsers such as LlamaParser, Amazon Textract, and Azure AI Document Intelligence, this article will focus on setting up your own data parser using an open-source alternative called Docling. Docling allows you to efficiently transform messy data into structured formats ready for AI workflows.

What this article covers:

Extract document content
Create document chunks
Create embeddings and storing them in ChromaDB
Testing basic search functionality

EXTRACT DOCUMENT CONTENT
First we import the DocumentConverter from docling and initialize it.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

Next, convert your PDF (or other document formats) to a structured Docling document and export it as JSON:

result = converter.convert("https://arxiv.org/pdf/240.09869")

document = result.document
json_output = document.export_to_dict()

Docling supports a wide range of formats including PDF, DOCX, HTML, Markdown, and even PowerPoint. You can also use URLs to extract HTML content directly..

DOCUMENT CHUNKING

Instead of storing the entire document at once, we split it into smaller, meaningful pieces called chunks. This improves retrieval relevance and reduces the amount of data sent to the language model at once.

Docling offers powerful chunking methods that understand document structure beyond just splitting text blindly:

Hierarchial chunker - Recognizes natural breaks in documents like sections and paragraphs.
Hybrid chunker - Builds on hierarchical chunking and further splits chunks too large for your embedding model's token limits.

Here’s an example using Docling’s HybridChunker with an open-source tokenizer:

from tokenizer import Tokenizer

tokenizer = Tokenizer()
MAX_TOKENS = 8191

chunker = HybridChunker(
     tokenizer=tokenizer,
     max_tokens=MAX_TOKENS
)

chunk_iter = chunker.chunk(dl_doc=result.document)
chunks = list(chunk_iter)

EMBEDDING THE CHUNKS

Now that we have our data ready we can proceed to storing them in our vector database. First, we will initialize chromadb client, create our collection with our embedding function using OpenAI as follows;

import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions

# Initialize persistent Chroma client for local storage
client = chromadb.Client(Settings(
    persist_directory="chroma_persistent_storage"
))

# Set up OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="<OPENAI_API_KEY>",
    model_name="text-embedding-3-small"
)

# Create a collection with the embedding function
collection = client.create_collection(
    name='document_collection',
    embedding_function=openai_ef
)

# Add chunks to the collection
for idx, chunk in enumerate(chunks):
    collection.add(
        documents=[chunk.page_content],
        metadatas=[{"chunk_index": idx}],
        ids=[f"chunk_{idx}"]
    )

# Persist changes to disk
client.persist()

Test Basic Search Functionality

Test if your setup works by querying the vector database with sample questions and retrieving relevant chunks:

query = "What is the main contribution of the paper?"
results = collection.query(query_texts=[query], n_results=3)

for doc in results['documents'][0]:
    print(doc)

Summary

With Docling, you can easily:

Extract documents from multiple formats into a clean, structured format.
Chunk documents intelligently, preserving their logical hierarchy.
Generate embeddings using your preferred model (like OpenAI).
Store and retrieve embeddings efficiently in a local vector database like Chroma.

DEV Community

Building Your Own Data Parser with Docling

DOCUMENT CHUNKING

EMBEDDING THE CHUNKS

Test Basic Search Functionality

Summary

Top comments (0)