Testing Chonkie 🧪
Introduction
It’s been a few months now since I first spotted the Chonkie project page, thanks to a link that appeared in one of my many tech blog subscriptions. I’ve been fascinated by its potential ever since, and although I haven’t had the chance to commit to a full-scale deployment, I immediately began the process of integrating and testing its features — slowly but surely — to evaluate its capabilities for my complex document handling needs.
First of all, what is Chonkie?
ℹ️ Excerpt from the repository
_Tired of making your gazillionth chunker? Sick of the overhead of large libraries? Want to chunk your texts quickly and efficiently? Chonkie the mighty hippo is here to help!
🚀 Feature-rich: All the CHONKs you’d ever need
🔄 End-to-end: Fetch, CHONK, refine, embed and ship straight to your vector DB!
✨ Easy to use: Install, Import, CHONK
⚡ Fast: CHONK at the speed of light! zooooom
🪶 Light-weight: No bloat, just CHONK
🔌 32+ integrations: Works with your favorite tools and vector DBs out of the box!
💬 ️Multilingual: Out-of-the-box support for 56 languages
☁️ Cloud-Friendly: CHONK locally or in the Cloud
🦛 Cute CHONK mascot: psst it’s a pygmy hippo btw
❤️ Moto Moto’s favorite python library
Chonkie is a chunking library that “just works” ✨
What is Chunking? TL;DR

Image from https://medium.com/@bijit211987/chunking-strategies-for-fine-tuning-llms-30d2988c3b7a by @Bijit Ghosh
In the domain of Large Language Models (LLMs) and related applications like Retrieval-Augmented Generation (RAG), chunking is the crucial process of breaking down large documents or extended text into smaller, manageable, and semantically coherent segments called “chunks.”
This technique is essential for two primary reasons:
- Context Window Management: LLMs have a strict limit on the amount of text (measured in tokens) they can process at one time. Chunking ensures that the retrieved information fits within this context window limit, preventing errors or incomplete processing.
- Improved Retrieval Accuracy (in RAG): For RAG systems, large documents are first chunked, converted into vector embeddings, and stored in a database. When a user asks a question, the system searches the database to retrieve only the most relevant chunks. If chunks are too large, they mix too many topics, leading to a noisy, “averaged” embedding that reduces search accuracy. Smaller, focused chunks lead to more precise retrieval and, ultimately, a more accurate and grounded response from the LLM.
| Strategy | Description | Best For |
| ---------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Fixed-Size Chunking** | Splits text into segments of a predetermined, uniform size (e.g., 512 tokens), usually with a specified **overlap** to preserve context continuity. | Simple, structured text like tables or code where content is already organized. |
| **Recursive Chunking** | Attempts to split text using a hierarchical list of separators (e.g., first by paragraph, then by sentence, then by word) until the chunks fit the desired size. | Unstructured, free-flowing text like articles or reports, as it prioritizes natural boundaries. |
| **Semantic Chunking** | Uses an embedding model to measure the **semantic similarity** between sentences. A significant drop in similarity indicates a topic change, which is then used as the breakpoint. | Highly topic-dense or complex documents where maintaining complete ideas is paramount, such as research papers. |
| **Structural/Content-Aware** | Splits based on document structure elements like headers, sections, or tables (e.g., Markdown or HTML tags). | Semi-structured documents where the visual layout reflects the logical organization. |
The repository gives some code snippets which could work out-of-the-box… but I wanted to try my own use case.
My application started life as a very basic proof-of-concept, but as I progressed with my testing and integration, the scope quickly expanded. Moving forward in time with my tests, I added more and more components — tackling everything from robust document conversion with Docling to real-time vector database connectivity with Milvus. The result of this steady, iterative process, and frankly, a lot of debugging and frustrating fails 💀, is what you see now: the sixteenth version of the application.
So let’s jump into the code 🪂
Test and Implementation
Elements used to build this application;
- Well obviously Chonkie to begin with…
- Docling for document conversion and preparation
- Milvus as RAG
- Ollama for using local embeddings
Below is a functional description of the components ⬇️
| Component | Role in the Pipeline | Description |
| ---------------------------------- | ---------------------------------------- | ------------------------------------------------------------ |
| **Docling** | **Document Conversion & Parsing** | Docling is the crucial first step. It is responsible for ingesting diverse file types (like PDFs, DOCX, etc.) and converting them into a structured, unified format (Markdown text). It handles the complexities of layout, table extraction, and text recognition, ensuring the data is clean and ready for the next step. |
| **Chonkie** | **Text Chunking & Structuring** | After Docling produces the clean text, **Chonkie's RecursiveChunker** takes over. Its role is to intelligently split the large documents into smaller, semantically coherent blocks (or chunks). This recursive process is essential for RAG, as sending a small, relevant chunk to the LLM works much better than sending an entire document. |
| **Custom Gemini Embeddings Class** | **Vectorization** | This custom class is the pipeline's connection to the powerful Gemini embedding model. Its sole purpose is to transform each textual chunk generated by Chonkie into a **high-dimensional vector** (a list of numbers). This vector representation is how the Milvus database understands the meaning and context of the text. |
| **Milvus** | **Vector Database (Storage & Indexing)** | Milvus is the final destination for your vectorized data. Its role is to store and efficiently index the millions of text chunks and their corresponding Gemini vector embeddings. When a user runs a query, Milvus performs the ultra-fast similarity search (Retrieval) to find the most relevant chunks. |
| **Ollama** | **Local LLM Hosting** | **Ollama's** primary role in the overall RAG system is to serve as the local, open-source platform for running the **Large Language Model (LLM)**. After Milvus retrieves the relevant chunks, the LLM hosted by Ollama will use that context to generate the final, grounded answer (Generation). |
As my test and development environment is my laptop, all the elements should be up-and-running.
- Prepare the Python environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
- For once I created a requirements.txt 😅
pymilvus
ollama
requests
"chonkie[all]"
"docling[all]"
pip install -r requirements.txt
- Make sure Milvus is running (for the first versions of the app I was simulating Mivus)
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh
bash standalone_embed.sh start
# in my case, as I use Podman and even though Docker=Podman on my laptop
# I have another .sh
bash standalone_embed_podman.sh start
- Test your Milvus instance
from pymilvus import connections
try:
connections.connect(host="127.0.0.1", port="19530")
print("Milvus connection successful!")
connections.disconnect("default")
except Exception as e:
print(f"Milvus connection failed: {e}")
- If you want to use GemminiEmbeddings as I did, you need to have a Gemmini API Key, which you should make available to the app, either using and “.env” file or doing manually such as;
export GEMINI_API_KEY="YOUR-API-KEY"
OK… now comes the huge part of the code… 😰. Basically it contains three-stage process focused on extracting complex document structure, generating high-quality vector embeddings, and ensuring robust storage. The logic first uses Docling’s DocumentConverter to ingest multi-format files from the ./input folder, reliably transforming them (including complex PDFs via the StandardPdfPipeline and PyPdfiumDocumentBackend) into structured content. Then, the process leverages Chonkie's RecursiveChunker to intelligently break this structured content into meaningful textual segments, which are then passed to a custom GeminiEmbeddings class. This class is designed to securely and efficiently call the Gemini API directly, generating the high-dimensional vector embeddings necessary for search. Finally, the resulting text chunks, metadata, and their corresponding embeddings are transmitted via the MilvusManager class, which handles the connection, collection setup, and data insertion into the real Milvus vector database, completing the crucial ingestion cycle for the RAG system.
import os
import requests
import time
from pathlib import Path
from typing import List, Optional
from datetime import datetime
try:
from pymilvus import connections, utility, FieldSchema, CollectionSchema, DataType, Collection
PYMILVUS_INSTALLED = True
except ImportError:
PYMILVUS_INSTALLED = False
print("⚠️ pymilvus not installed. Milvus operations will be simulated.")
from chonkie import RecursiveChunker
from chonkie.types import Chunk
# Docling Imports
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
# --- Custom Gemini Embeddings Class ---
class GeminiEmbeddings:
def __init__(self, model: str):
self.model = model
self.base_url = "https://generativelanguage.googleapis.com/v1beta/models"
self._embedding_dimension = 0
self.api_key = os.getenv("GEMINI_API_KEY")
if not self.api_key:
raise ValueError("GEMINI_API_KEY environment variable is not set.")
@property
def embedding_dimension(self) -> int:
if self._embedding_dimension == 0:
self.embed(["test for dimension"])
if self._embedding_dimension == 0:
raise RuntimeError("Embedding dimension could not be determined after dummy call.")
return self._embedding_dimension
def embed(self, texts: List[str]) -> List[List[float]]:
embeddings_vectors = []
for text in texts:
url = f"{self.base_url}/{self.model}:embedContent?key={self.api_key}"
payload = {
"content": {"parts": [{"text": text}]}
}
try:
max_retries = 3
wait_time = 1
for attempt in range(max_retries):
response = requests.post(url, json=payload, timeout=60)
if response.status_code == 429 and attempt < max_retries - 1:
time.sleep(wait_time)
wait_time *= 2
continue
response.raise_for_status()
break
data = response.json()
vector = data.get('embedding', {}).get('values', [])
if not vector:
raise ValueError(f"Gemini API returned an empty embedding for: '{text[:20]}...'")
embeddings_vectors.append(vector)
if self._embedding_dimension == 0:
self._embedding_dimension = len(vector)
except requests.exceptions.RequestException as e:
# The 500 Server Error happens here. We catch it and let the outer loop continue.
raise RuntimeError(f"Gemini API Request Failed for text: '{text[:20]}...'. Error: {e}")
return embeddings_vectors
# --- Configuration (Gemini & Milvus) ---
INPUT_DIR = Path("./input")
OUTPUT_DIR = Path("./output")
# Milvus Configuration
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
MILVUS_PORT = os.getenv("MILVUS_PORT", "19530")
MILVUS_COLLECTION_NAME = "document_chunks"
# Gemini
GEMINI_MODEL_ID = "text-embedding-004"
# --- Milvus
class MilvusManager:
"""
Milvus.
"""
def __init__(self, collection_name: str, dim: int, host: str, port: str):
self.collection_name = collection_name
self.dim = dim
self.host = host
self.port = port
self.collection: Optional[Collection] = None
self.data_count = 0
self.is_real_milvus = PYMILVUS_INSTALLED
if self.is_real_milvus:
print(f"✅ Initialized Milvus Manager for real connection to {host}:{port}")
else:
print(f"⚠️ Initialized Milvus Manager in SIMULATION mode.")
def connect(self):
if not self.is_real_milvus:
print("🔗 Connecting to local Milvus instance (Simulated)...")
print("🔗 Connection successful (Simulated).")
return
try:
print(f"🔗 Connecting to Milvus at {self.host}:{self.port}...")
connections.connect(host=self.host, port=self.port)
print("🔗 Connection successful.")
self._setup_collection()
except Exception as e:
print(f"❌ Failed to connect to Milvus: {e}")
print("❌ Switching to SIMULATION mode.")
self.is_real_milvus = False
self.connect() # Call simulation connect
def _setup_collection(self):
"""Creates the collection if it doesn't exist, otherwise loads it."""
fields = [
FieldSchema(name="chunk_id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="source_file", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=self.dim)
]
schema = CollectionSchema(fields, description="Chunks from document ingestion pipeline")
if utility.has_collection(self.collection_name):
print(f"📂 Collection '{self.collection_name}' already exists. Loading...")
self.collection = Collection(self.collection_name)
else:
print(f"✨ Creating new collection: '{self.collection_name}'...")
self.collection = Collection(self.collection_name, schema, consistency_level="Strong")
# Create an index
index_params = {
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {"nlist": 128}
}
self.collection.create_index(
field_name="embedding",
index_params=index_params
)
print("✅ Collection created and index applied.")
# Ensure collection is loaded into memory for operations
self.collection.load()
def insert_chunks(self, chunks: List[Chunk], embeddings: List[List[float]], source_file: str):
"""Inserts chunks and embeddings into the vector store."""
if len(chunks) != len(embeddings):
raise ValueError("Mismatched number of chunks and embeddings.")
chunk_texts = [c.text for c in chunks]
source_files = [source_file] * len(chunks)
if self.is_real_milvus and self.collection:
data = [
chunk_texts,
source_files,
embeddings
]
try:
# Insert the data: [text, source_file, embedding]
insert_result = self.collection.insert(data)
self.collection.flush()
inserted_count = len(insert_result.primary_keys)
self.data_count += inserted_count
print(f" -> Successfully inserted {inserted_count} records into Milvus for {Path(source_file).name}.")
except Exception as e:
print(f" [ERROR] Failed to insert data into Milvus for {Path(source_file).name}: {e}")
else:
# Simulation mode
inserted_count = len(chunks)
self.data_count += inserted_count
print(f" -> Successfully prepared {inserted_count} records from {Path(source_file).name} for storage (Simulated).")
# Core
def prepare_and_store_documents():
"""Main function to orchestrate reading, conversion (Docling), chunking, embedding, and storage."""
if not INPUT_DIR.exists():
INPUT_DIR.mkdir()
print(f"📂 Created input directory: '{INPUT_DIR}'")
print("\n**Please add your documents (PDF, DOCX, TXT, etc.) to the './input' folder and run again.**\n")
return
if not os.getenv("GEMINI_API_KEY"):
print("\n" + "="*70)
print("❌ CRITICAL: GEMINI_API_KEY environment variable is missing.")
print("Please set your API key using:")
print("export GEMINI_API_KEY='your-api-key-here'")
print("="*70 + "\n")
return
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"📂 Output directory '{OUTPUT_DIR}' ensured.")
file_paths: List[Path] = [p for p in INPUT_DIR.rglob('*.*') if p.is_file()]
if not file_paths:
print("⚠️ No documents found in the './input' directory to process. Exiting.")
return
start_time = datetime.now()
pdf_pipeline_options = PdfPipelineOptions()
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=StandardPdfPipeline,
backend=PyPdfiumDocumentBackend,
pipeline_options=pdf_pipeline_options,
)
}
)
print("📚 Initialized Docling DocumentConverter for multi-format support (PDF using StandardPdfPipeline and PyPdfiumDocumentBackend).")
try:
print(f"\n--- Initializing Gemini Embeddings (Custom Provider: {GEMINI_MODEL_ID}) ---")
embeddings = GeminiEmbeddings(model=GEMINI_MODEL_ID)
print(" -> Generating dummy embedding to determine dimension...")
embedding_dim = embeddings.embedding_dimension
actual_model = GEMINI_MODEL_ID
if embedding_dim < 100:
raise ValueError(f"Invalid embedding dimension ({embedding_dim}).")
print(f"✨ Initialized Embeddings Model: {actual_model} (Custom Implementation). Dim: {embedding_dim}")
except Exception as e:
print(f"❌ Failed to initialize embedding model (Gemini). Error: {e}")
return
chunker = RecursiveChunker()
print("✂️ Initialized Recursive Chunker.")
milvus_manager = MilvusManager(
MILVUS_COLLECTION_NAME,
embedding_dim,
MILVUS_HOST,
MILVUS_PORT
)
milvus_manager.connect()
processed_files_count = 0
file_reports: List[str] = []
print("\n--- Starting Docling Conversion ---")
conversion_results = doc_converter.convert_all(file_paths)
print("--- Docling Conversion Complete ---")
for res in conversion_results:
file_name = res.input.file.name
print(f"\n--- Processing: {file_name} ---")
if res.document is None:
error_message = res.conversion_error if res.conversion_error else "Unknown conversion error."
print(f" [SKIP] Docling failed to convert document: {file_name}. Reason: {error_message}")
file_reports.append(f"- **{file_name}**: SKIPPED (Docling conversion failed: `{error_message}`).")
continue
try:
text_content = res.document.export_to_markdown()
chunks: List[Chunk] = chunker.chunk(text_content)
print(f" -> Generated {len(chunks)} chunks.")
chunk_texts = [c.text for c in chunks]
try:
embeddings_vectors = embeddings.embed(chunk_texts)
print(f" -> Generated {len(embeddings_vectors)} embeddings.")
except RuntimeError as api_error:
raise RuntimeError(f"Gemini API Error during embedding: {api_error}")
# C. Storage / Milvus
milvus_manager.insert_chunks(chunks, embeddings_vectors, str(res.input.file))
processed_files_count += 1
file_reports.append(f"- **{file_name}**: {len(chunks)} chunks ingested (Milvus Status: {'Real' if milvus_manager.is_real_milvus else 'Simulated'}).")
except Exception as e:
print(f" [ERROR] Failed to chunk/embed/store {file_name}: {e}")
file_reports.append(f"- **{file_name}**: ERROR (Chunking/Embedding failed: `{e}`).")
end_time = datetime.now()
total_time = end_time - start_time
# timestamp
timestamp_str = start_time.strftime("%Y%m%d_%H%M%S")
report_lines = [
f"# Document Ingestion Report (Gemini / {GEMINI_MODEL_ID} via Custom Implementation)",
f"**Run Timestamp (Start Time):** {start_time.strftime('%Y-%m-%d %H:%M:%S')}",
f"**Input Directory:** `{INPUT_DIR.resolve()}`",
f"**Vector Store Status:** {'REAL Milvus connection' if milvus_manager.is_real_milvus else 'SIMULATION MODE'}",
f"**Milvus Target:** `{MILVUS_HOST}:{MILVUS_PORT}`",
f"**Total Files Found**: {len(file_paths)}\n",
"## Processed Files\n"
]
report_lines.extend(file_reports)
report_lines.append("\n## Final Summary\n")
report_lines.append(f"- **Total Files Scanned**: {len(file_paths)}")
report_lines.append(f"- **Total Files Processed (Successfully Ingested)**: {processed_files_count}")
report_lines.append(f"- **Total Chunks Created**: {milvus_manager.data_count}")
report_lines.append(f"- **Milvus Collection**: `{MILVUS_COLLECTION_NAME}`")
report_lines.append(f"- **Total Processing Time**: {total_time}\n")
report_lines.append("\n---")
report_lines.append("Note: Documents were pre-processed using **Docling** to handle multi-format input.")
report_filepath = OUTPUT_DIR / f"ingestion_report_{timestamp_str}.md"
with open(report_filepath, 'w', encoding='utf-8') as f:
f.write("\n".join(report_lines))
print("\n" + "=" * 60)
print("🎉 Document Ingestion Pipeline Completed!")
print(f"⏱️ Total time taken: {total_time}")
print(f"📝 Report saved to: {report_filepath.resolve()}")
print(f"💾 Total records stored/prepared: {milvus_manager.data_count}")
print("=" * 60)
if __name__ == "__main__":
prepare_and_store_documents()
- The documents I used to run the application are in .pdf, .xlsx and .pptx format (thanks to Docling).
As I said earlier, I made several runs of the application while debugging it, hence the reason for “timestamping” the outputs in a Markdown file.
# Document Ingestion Report (Gemini / text-embedding-004 via Custom Implementation)
**Run Timestamp (Start Time):** 2025-10-23 21:05:37
**Input Directory:** `/Users/xxx/Devs/chonkie-milvus-docling/input`
**Vector Store Status:** REAL Milvus connection
**Milvus Target:** `localhost:19530`
**Total Files Found**: 3
## Processed Files
- **Medium-Articles.xlsx**: 84 chunks ingested (Milvus Status: Real).
- **2203.01017v2.pdf**: 44 chunks ingested (Milvus Status: Real).
- **powerpoint_with_image.pptx**: 1 chunks ingested (Milvus Status: Real).
## Final Summary
- **Total Files Scanned**: 3
- **Total Files Processed (Successfully Ingested)**: 3
- **Total Chunks Created**: 129
- **Milvus Collection**: `document_chunks`
- **Total Processing Time**: 0:01:20.684797
---
Note: Documents were pre-processed using **Docling** to handle multi-format input.
Conclusion
This project demonstrates the power of combining specialized tools for scalable RAG implementation. The use of Docling is a game-changer because it moves beyond simple text extraction; it intelligently parses complex structures like tables and headers across diverse file formats (PDFs, DOCX, etc.), delivering clean, semantically rich Markdown.
This structured input is precisely why Chonkie is so effective. By receiving a clean, hierarchical text format, Chonkie’s RecursiveChunker can accelerate the chunking process and—crucially—ensure that chunks maintain semantic coherence. This leads to higher-quality embeddings and, ultimately, significantly more accurate answers from the final RAG system. By minimizing manual preprocessing and maximizing chunk relevance, this pipeline lays a solid foundation for robust LLM-powered applications.
Thanks for reading 👋
Links
- Chonkie Repository: https://github.com/chonkie-inc/chonkie
- Docling: https://docling-project.github.io/docling/
- Milvus: https://github.com/milvus-io/milvus
- This code repository: https://github.com/aairom/chonkie-milvus-docling


Top comments (0)