When building AI agents, one of the most crucial steps is preparing the data you feed to the language models (LLMs). If your data is not well-structured and context-ready, your agents may severely underperform and fail to deliver the results you expect.
While there are well-known data parsers such as LlamaParser, Amazon Textract, and Azure AI Document Intelligence, this article will focus on setting up your own data parser using an open-source alternative called Docling. Docling allows you to efficiently transform messy data into structured formats ready for AI workflows.
What this article covers:
- Extract document content
- Create document chunks
- Create embeddings and storing them in ChromaDB
- Testing basic search functionality
EXTRACT DOCUMENT CONTENT
First we import the DocumentConverter
from docling and initialize it.
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
Next, convert your PDF (or other document formats) to a structured Docling document and export it as JSON:
result = converter.convert("https://arxiv.org/pdf/240.09869")
document = result.document
json_output = document.export_to_dict()
Docling supports a wide range of formats including PDF, DOCX, HTML, Markdown, and even PowerPoint. You can also use URLs to extract HTML content directly..
DOCUMENT CHUNKING
Instead of storing the entire document at once, we split it into smaller, meaningful pieces called chunks
. This improves retrieval relevance and reduces the amount of data sent to the language model at once.
Docling offers powerful chunking methods that understand document structure beyond just splitting text blindly:
- Hierarchial chunker - Recognizes natural breaks in documents like sections and paragraphs.
- Hybrid chunker - Builds on hierarchical chunking and further splits chunks too large for your embedding model's token limits.
Here’s an example using Docling’s HybridChunker
with an open-source tokenizer:
from tokenizer import Tokenizer
tokenizer = Tokenizer()
MAX_TOKENS = 8191
chunker = HybridChunker(
tokenizer=tokenizer,
max_tokens=MAX_TOKENS
)
chunk_iter = chunker.chunk(dl_doc=result.document)
chunks = list(chunk_iter)
EMBEDDING THE CHUNKS
Now that we have our data ready we can proceed to storing them in our vector database. First, we will initialize chromadb client, create our collection with our embedding function using OpenAI as follows;
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
# Initialize persistent Chroma client for local storage
client = chromadb.Client(Settings(
persist_directory="chroma_persistent_storage"
))
# Set up OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="<OPENAI_API_KEY>",
model_name="text-embedding-3-small"
)
# Create a collection with the embedding function
collection = client.create_collection(
name='document_collection',
embedding_function=openai_ef
)
# Add chunks to the collection
for idx, chunk in enumerate(chunks):
collection.add(
documents=[chunk.page_content],
metadatas=[{"chunk_index": idx}],
ids=[f"chunk_{idx}"]
)
# Persist changes to disk
client.persist()
Test Basic Search Functionality
Test if your setup works by querying the vector database with sample questions and retrieving relevant chunks:
query = "What is the main contribution of the paper?"
results = collection.query(query_texts=[query], n_results=3)
for doc in results['documents'][0]:
print(doc)
Summary
With Docling, you can easily:
- Extract documents from multiple formats into a clean, structured format.
- Chunk documents intelligently, preserving their logical hierarchy.
- Generate embeddings using your preferred model (like OpenAI).
- Store and retrieve embeddings efficiently in a local vector database like Chroma.
Top comments (0)