Alain Airom

Posted on Oct 24

Using Chonkie

#chonkie #ollama #milvus #docling

Testing Chonkie 🧪

Introduction

It’s been a few months now since I first spotted the Chonkie project page, thanks to a link that appeared in one of my many tech blog subscriptions. I’ve been fascinated by its potential ever since, and although I haven’t had the chance to commit to a full-scale deployment, I immediately began the process of integrating and testing its features — slowly but surely — to evaluate its capabilities for my complex document handling needs.

First of all, what is Chonkie?

ℹ️ Excerpt from the repository

_Tired of making your gazillionth chunker? Sick of the overhead of large libraries? Want to chunk your texts quickly and efficiently? Chonkie the mighty hippo is here to help!

🚀 Feature-rich: All the CHONKs you’d ever need
🔄 End-to-end: Fetch, CHONK, refine, embed and ship straight to your vector DB!
✨ Easy to use: Install, Import, CHONK
⚡ Fast: CHONK at the speed of light! zooooom
🪶 Light-weight: No bloat, just CHONK
🔌 32+ integrations: Works with your favorite tools and vector DBs out of the box!
💬 ️Multilingual: Out-of-the-box support for 56 languages
☁️ Cloud-Friendly: CHONK locally or in the Cloud
🦛 Cute CHONK mascot: psst it’s a pygmy hippo btw
❤️ Moto Moto’s favorite python library

Chonkie is a chunking library that “just works” ✨

What is Chunking? TL;DR

Image from https://medium.com/@bijit211987/chunking-strategies-for-fine-tuning-llms-30d2988c3b7a by @Bijit Ghosh

In the domain of Large Language Models (LLMs) and related applications like Retrieval-Augmented Generation (RAG), chunking is the crucial process of breaking down large documents or extended text into smaller, manageable, and semantically coherent segments called “chunks.”

This technique is essential for two primary reasons:

Context Window Management: LLMs have a strict limit on the amount of text (measured in tokens) they can process at one time. Chunking ensures that the retrieved information fits within this context window limit, preventing errors or incomplete processing.
Improved Retrieval Accuracy (in RAG): For RAG systems, large documents are first chunked, converted into vector embeddings, and stored in a database. When a user asks a question, the system searches the database to retrieve only the most relevant chunks. If chunks are too large, they mix too many topics, leading to a noisy, “averaged” embedding that reduces search accuracy. Smaller, focused chunks lead to more precise retrieval and, ultimately, a more accurate and grounded response from the LLM.

| Strategy                     | Description                                                  | Best For                                                     |
| ---------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Fixed-Size Chunking**      | Splits text into segments of a predetermined, uniform size (e.g., 512 tokens), usually with a specified **overlap** to preserve context continuity. | Simple, structured text like tables or code where content is already organized. |
| **Recursive Chunking**       | Attempts to split text using a hierarchical list of separators (e.g., first by paragraph, then by sentence, then by word) until the chunks fit the desired size. | Unstructured, free-flowing text like articles or reports, as it prioritizes natural boundaries. |
| **Semantic Chunking**        | Uses an embedding model to measure the **semantic similarity** between sentences. A significant drop in similarity indicates a topic change, which is then used as the breakpoint. | Highly topic-dense or complex documents where maintaining complete ideas is paramount, such as research papers. |
| **Structural/Content-Aware** | Splits based on document structure elements like headers, sections, or tables (e.g., Markdown or HTML tags). | Semi-structured documents where the visual layout reflects the logical organization. |

The repository gives some code snippets which could work out-of-the-box… but I wanted to try my own use case.

My application started life as a very basic proof-of-concept, but as I progressed with my testing and integration, the scope quickly expanded. Moving forward in time with my tests, I added more and more components — tackling everything from robust document conversion with Docling to real-time vector database connectivity with Milvus. The result of this steady, iterative process, and frankly, a lot of debugging and frustrating fails 💀, is what you see now: the sixteenth version of the application.

So let’s jump into the code 🪂

Test and Implementation

Elements used to build this application;

Well obviously Chonkie to begin with…
Docling for document conversion and preparation
Milvus as RAG
Ollama for using local embeddings

Below is a functional description of the components ⬇️

| Component                          | Role in the Pipeline                     | Description                                                  |
| ---------------------------------- | ---------------------------------------- | ------------------------------------------------------------ |
| **Docling**                        | **Document Conversion & Parsing**        | Docling is the crucial first step. It is responsible for ingesting diverse file types (like PDFs, DOCX, etc.) and converting them into a structured, unified format (Markdown text). It handles the complexities of layout, table extraction, and text recognition, ensuring the data is clean and ready for the next step. |
| **Chonkie**                        | **Text Chunking & Structuring**          | After Docling produces the clean text, **Chonkie's RecursiveChunker** takes over. Its role is to intelligently split the large documents into smaller, semantically coherent blocks (or chunks). This recursive process is essential for RAG, as sending a small, relevant chunk to the LLM works much better than sending an entire document. |
| **Custom Gemini Embeddings Class** | **Vectorization**                        | This custom class is the pipeline's connection to the powerful Gemini embedding model. Its sole purpose is to transform each textual chunk generated by Chonkie into a **high-dimensional vector** (a list of numbers). This vector representation is how the Milvus database understands the meaning and context of the text. |
| **Milvus**                         | **Vector Database (Storage & Indexing)** | Milvus is the final destination for your vectorized data. Its role is to store and efficiently index the millions of text chunks and their corresponding Gemini vector embeddings. When a user runs a query, Milvus performs the ultra-fast similarity search (Retrieval) to find the most relevant chunks. |
| **Ollama**                         | **Local LLM Hosting**                    | **Ollama's** primary role in the overall RAG system is to serve as the local, open-source platform for running the **Large Language Model (LLM)**. After Milvus retrieves the relevant chunks, the LLM hosted by Ollama will use that context to generate the final, grounded answer (Generation). |

As my test and development environment is my laptop, all the elements should be up-and-running.

Prepare the Python environment

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

For once I created a requirements.txt 😅

pymilvus
ollama
requests
"chonkie[all]" 
"docling[all]"

pip install -r requirements.txt

Make sure Milvus is running (for the first versions of the app I was simulating Mivus)

curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh

bash standalone_embed.sh start
# in my case, as I use Podman and even though Docker=Podman on my laptop 
# I have another .sh
bash standalone_embed_podman.sh start

Test your Milvus instance

from pymilvus import connections

try:
    connections.connect(host="127.0.0.1", port="19530")
    print("Milvus connection successful!")
    connections.disconnect("default")
except Exception as e:
    print(f"Milvus connection failed: {e}")

If you want to use GemminiEmbeddings as I did, you need to have a Gemmini API Key, which you should make available to the app, either using and “.env” file or doing manually such as;

export GEMINI_API_KEY="YOUR-API-KEY"

OK… now comes the huge part of the code… 😰. Basically it contains three-stage process focused on extracting complex document structure, generating high-quality vector embeddings, and ensuring robust storage. The logic first uses Docling’s DocumentConverter to ingest multi-format files from the ./input folder, reliably transforming them (including complex PDFs via the StandardPdfPipeline and PyPdfiumDocumentBackend) into structured content. Then, the process leverages Chonkie's RecursiveChunker to intelligently break this structured content into meaningful textual segments, which are then passed to a custom GeminiEmbeddings class. This class is designed to securely and efficiently call the Gemini API directly, generating the high-dimensional vector embeddings necessary for search. Finally, the resulting text chunks, metadata, and their corresponding embeddings are transmitted via the MilvusManager class, which handles the connection, collection setup, and data insertion into the real Milvus vector database, completing the crucial ingestion cycle for the RAG system.

import os
import requests 
import time 
from pathlib import Path
from typing import List, Optional
from datetime import datetime

try:
    from pymilvus import connections, utility, FieldSchema, CollectionSchema, DataType, Collection
    PYMILVUS_INSTALLED = True
except ImportError:
    PYMILVUS_INSTALLED = False
    print("⚠️ pymilvus not installed. Milvus operations will be simulated.")


from chonkie import RecursiveChunker
from chonkie.types import Chunk

# Docling Imports
from docling.document_converter import DocumentConverter, PdfFormatOption 
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.pipeline.simple_pipeline import SimplePipeline 
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline 
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend 


# --- Custom Gemini Embeddings Class ---
class GeminiEmbeddings:
    def __init__(self, model: str):
        self.model = model
        self.base_url = "https://generativelanguage.googleapis.com/v1beta/models"
        self._embedding_dimension = 0 
        self.api_key = os.getenv("GEMINI_API_KEY")

        if not self.api_key:
            raise ValueError("GEMINI_API_KEY environment variable is not set.")

    @property
    def embedding_dimension(self) -> int:
        if self._embedding_dimension == 0:
            self.embed(["test for dimension"])
            if self._embedding_dimension == 0:
                 raise RuntimeError("Embedding dimension could not be determined after dummy call.")
        return self._embedding_dimension

    def embed(self, texts: List[str]) -> List[List[float]]:
        embeddings_vectors = []

        for text in texts:
            url = f"{self.base_url}/{self.model}:embedContent?key={self.api_key}"
            payload = {
                "content": {"parts": [{"text": text}]}
            }

            try:
                max_retries = 3
                wait_time = 1
                for attempt in range(max_retries):
                    response = requests.post(url, json=payload, timeout=60)

                    if response.status_code == 429 and attempt < max_retries - 1:
                        time.sleep(wait_time)
                        wait_time *= 2
                        continue

                    response.raise_for_status()
                    break 

                data = response.json()
                vector = data.get('embedding', {}).get('values', [])

                if not vector:
                    raise ValueError(f"Gemini API returned an empty embedding for: '{text[:20]}...'")

                embeddings_vectors.append(vector)

                if self._embedding_dimension == 0:
                    self._embedding_dimension = len(vector)

            except requests.exceptions.RequestException as e:
                # The 500 Server Error happens here. We catch it and let the outer loop continue.
                raise RuntimeError(f"Gemini API Request Failed for text: '{text[:20]}...'. Error: {e}")

        return embeddings_vectors

# --- Configuration (Gemini & Milvus) ---
INPUT_DIR = Path("./input")
OUTPUT_DIR = Path("./output")

# Milvus Configuration 
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
MILVUS_PORT = os.getenv("MILVUS_PORT", "19530")
MILVUS_COLLECTION_NAME = "document_chunks"

# Gemini 
GEMINI_MODEL_ID = "text-embedding-004" 


# --- Milvus 
class MilvusManager:
    """
    Milvus.
    """
    def __init__(self, collection_name: str, dim: int, host: str, port: str):
        self.collection_name = collection_name
        self.dim = dim
        self.host = host
        self.port = port
        self.collection: Optional[Collection] = None
        self.data_count = 0
        self.is_real_milvus = PYMILVUS_INSTALLED

        if self.is_real_milvus:
            print(f"✅ Initialized Milvus Manager for real connection to {host}:{port}")
        else:
            print(f"⚠️ Initialized Milvus Manager in SIMULATION mode.")

    def connect(self):
        if not self.is_real_milvus:
            print("🔗 Connecting to local Milvus instance (Simulated)...")
            print("🔗 Connection successful (Simulated).")
            return

        try:
            print(f"🔗 Connecting to Milvus at {self.host}:{self.port}...")
            connections.connect(host=self.host, port=self.port)
            print("🔗 Connection successful.")
            self._setup_collection()

        except Exception as e:
            print(f"❌ Failed to connect to Milvus: {e}")
            print("❌ Switching to SIMULATION mode.")
            self.is_real_milvus = False
            self.connect() # Call simulation connect

    def _setup_collection(self):
        """Creates the collection if it doesn't exist, otherwise loads it."""
        fields = [
            FieldSchema(name="chunk_id", dtype=DataType.INT64, is_primary=True, auto_id=True),
            FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
            FieldSchema(name="source_file", dtype=DataType.VARCHAR, max_length=256),
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=self.dim)
        ]
        schema = CollectionSchema(fields, description="Chunks from document ingestion pipeline")

        if utility.has_collection(self.collection_name):
            print(f"📂 Collection '{self.collection_name}' already exists. Loading...")
            self.collection = Collection(self.collection_name)
        else:
            print(f"✨ Creating new collection: '{self.collection_name}'...")
            self.collection = Collection(self.collection_name, schema, consistency_level="Strong")

            # Create an index 
            index_params = {
                "metric_type": "COSINE", 
                "index_type": "IVF_FLAT",
                "params": {"nlist": 128}
            }
            self.collection.create_index(
                field_name="embedding", 
                index_params=index_params
            )
            print("✅ Collection created and index applied.")

        # Ensure collection is loaded into memory for operations
        self.collection.load()

    def insert_chunks(self, chunks: List[Chunk], embeddings: List[List[float]], source_file: str):
        """Inserts chunks and embeddings into the vector store."""
        if len(chunks) != len(embeddings):
            raise ValueError("Mismatched number of chunks and embeddings.")

        chunk_texts = [c.text for c in chunks]
        source_files = [source_file] * len(chunks)

        if self.is_real_milvus and self.collection:
            data = [
                chunk_texts,
                source_files,
                embeddings
            ]

            try:
                # Insert the data: [text, source_file, embedding]
                insert_result = self.collection.insert(data)
                self.collection.flush()
                inserted_count = len(insert_result.primary_keys)
                self.data_count += inserted_count
                print(f"   -> Successfully inserted {inserted_count} records into Milvus for {Path(source_file).name}.")
            except Exception as e:
                print(f"   [ERROR] Failed to insert data into Milvus for {Path(source_file).name}: {e}")
        else:
            # Simulation mode
            inserted_count = len(chunks)
            self.data_count += inserted_count
            print(f"   -> Successfully prepared {inserted_count} records from {Path(source_file).name} for storage (Simulated).")


# Core 

def prepare_and_store_documents():
    """Main function to orchestrate reading, conversion (Docling), chunking, embedding, and storage."""

    if not INPUT_DIR.exists():
        INPUT_DIR.mkdir()
        print(f"📂 Created input directory: '{INPUT_DIR}'")
        print("\n**Please add your documents (PDF, DOCX, TXT, etc.) to the './input' folder and run again.**\n")
        return

    if not os.getenv("GEMINI_API_KEY"):
        print("\n" + "="*70)
        print("❌ CRITICAL: GEMINI_API_KEY environment variable is missing.")
        print("Please set your API key using:")
        print("export GEMINI_API_KEY='your-api-key-here'")
        print("="*70 + "\n")
        return

    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    print(f"📂 Output directory '{OUTPUT_DIR}' ensured.")

    file_paths: List[Path] = [p for p in INPUT_DIR.rglob('*.*') if p.is_file()]

    if not file_paths:
        print("⚠️ No documents found in the './input' directory to process. Exiting.")
        return

   start_time = datetime.now()

    pdf_pipeline_options = PdfPipelineOptions()

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=StandardPdfPipeline, 
                backend=PyPdfiumDocumentBackend,
                pipeline_options=pdf_pipeline_options, 
            )
        }
    )
    print("📚 Initialized Docling DocumentConverter for multi-format support (PDF using StandardPdfPipeline and PyPdfiumDocumentBackend).")


    try:
        print(f"\n--- Initializing Gemini Embeddings (Custom Provider: {GEMINI_MODEL_ID}) ---")
        embeddings = GeminiEmbeddings(model=GEMINI_MODEL_ID)

        print("    -> Generating dummy embedding to determine dimension...")
        embedding_dim = embeddings.embedding_dimension 
        actual_model = GEMINI_MODEL_ID

        if embedding_dim < 100:
             raise ValueError(f"Invalid embedding dimension ({embedding_dim}).")

        print(f"✨ Initialized Embeddings Model: {actual_model} (Custom Implementation). Dim: {embedding_dim}")

    except Exception as e:
        print(f"❌ Failed to initialize embedding model (Gemini). Error: {e}")
        return

    chunker = RecursiveChunker()
    print("✂️ Initialized Recursive Chunker.")

    milvus_manager = MilvusManager(
        MILVUS_COLLECTION_NAME, 
        embedding_dim, 
        MILVUS_HOST, 
        MILVUS_PORT
    )
    milvus_manager.connect()

    processed_files_count = 0

    file_reports: List[str] = [] 

    print("\n--- Starting Docling Conversion ---")
    conversion_results = doc_converter.convert_all(file_paths)
    print("--- Docling Conversion Complete ---")

    for res in conversion_results:
        file_name = res.input.file.name
        print(f"\n--- Processing: {file_name} ---")

        if res.document is None:
            error_message = res.conversion_error if res.conversion_error else "Unknown conversion error."
            print(f"   [SKIP] Docling failed to convert document: {file_name}. Reason: {error_message}")
            file_reports.append(f"- **{file_name}**: SKIPPED (Docling conversion failed: `{error_message}`).")
            continue

        try:
            text_content = res.document.export_to_markdown()

            chunks: List[Chunk] = chunker.chunk(text_content)
            print(f"   -> Generated {len(chunks)} chunks.")

            chunk_texts = [c.text for c in chunks]

            try:
                embeddings_vectors = embeddings.embed(chunk_texts)
                print(f"   -> Generated {len(embeddings_vectors)} embeddings.")
            except RuntimeError as api_error:
                raise RuntimeError(f"Gemini API Error during embedding: {api_error}")


            # C. Storage / Milvus
            milvus_manager.insert_chunks(chunks, embeddings_vectors, str(res.input.file))

            processed_files_count += 1
            file_reports.append(f"- **{file_name}**: {len(chunks)} chunks ingested (Milvus Status: {'Real' if milvus_manager.is_real_milvus else 'Simulated'}).")

        except Exception as e:
            print(f"   [ERROR] Failed to chunk/embed/store {file_name}: {e}")
            file_reports.append(f"- **{file_name}**: ERROR (Chunking/Embedding failed: `{e}`).")

    end_time = datetime.now()
    total_time = end_time - start_time

    # timestamp 
    timestamp_str = start_time.strftime("%Y%m%d_%H%M%S")

    report_lines = [
        f"# Document Ingestion Report (Gemini / {GEMINI_MODEL_ID} via Custom Implementation)",
        f"**Run Timestamp (Start Time):** {start_time.strftime('%Y-%m-%d %H:%M:%S')}",
        f"**Input Directory:** `{INPUT_DIR.resolve()}`",
        f"**Vector Store Status:** {'REAL Milvus connection' if milvus_manager.is_real_milvus else 'SIMULATION MODE'}",
        f"**Milvus Target:** `{MILVUS_HOST}:{MILVUS_PORT}`",
        f"**Total Files Found**: {len(file_paths)}\n",
        "## Processed Files\n"
    ]

    report_lines.extend(file_reports)

    report_lines.append("\n## Final Summary\n")
    report_lines.append(f"- **Total Files Scanned**: {len(file_paths)}")
    report_lines.append(f"- **Total Files Processed (Successfully Ingested)**: {processed_files_count}")
    report_lines.append(f"- **Total Chunks Created**: {milvus_manager.data_count}")
    report_lines.append(f"- **Milvus Collection**: `{MILVUS_COLLECTION_NAME}`")
    report_lines.append(f"- **Total Processing Time**: {total_time}\n")
    report_lines.append("\n---")
    report_lines.append("Note: Documents were pre-processed using **Docling** to handle multi-format input.")

    report_filepath = OUTPUT_DIR / f"ingestion_report_{timestamp_str}.md"

    with open(report_filepath, 'w', encoding='utf-8') as f:
        f.write("\n".join(report_lines))

    print("\n" + "=" * 60)
    print("🎉 Document Ingestion Pipeline Completed!")
    print(f"⏱️ Total time taken: {total_time}")
    print(f"📝 Report saved to: {report_filepath.resolve()}")
    print(f"💾 Total records stored/prepared: {milvus_manager.data_count}")
    print("=" * 60)

if __name__ == "__main__":
    prepare_and_store_documents()

The documents I used to run the application are in .pdf, .xlsx and .pptx format (thanks to Docling).

As I said earlier, I made several runs of the application while debugging it, hence the reason for “timestamping” the outputs in a Markdown file.

# Document Ingestion Report (Gemini / text-embedding-004 via Custom Implementation)
**Run Timestamp (Start Time):** 2025-10-23 21:05:37
**Input Directory:** `/Users/xxx/Devs/chonkie-milvus-docling/input`
**Vector Store Status:** REAL Milvus connection
**Milvus Target:** `localhost:19530`
**Total Files Found**: 3

## Processed Files

- **Medium-Articles.xlsx**: 84 chunks ingested (Milvus Status: Real).
- **2203.01017v2.pdf**: 44 chunks ingested (Milvus Status: Real).
- **powerpoint_with_image.pptx**: 1 chunks ingested (Milvus Status: Real).

## Final Summary

- **Total Files Scanned**: 3
- **Total Files Processed (Successfully Ingested)**: 3
- **Total Chunks Created**: 129
- **Milvus Collection**: `document_chunks`
- **Total Processing Time**: 0:01:20.684797


---
Note: Documents were pre-processed using **Docling** to handle multi-format input.

Conclusion

This project demonstrates the power of combining specialized tools for scalable RAG implementation. The use of Docling is a game-changer because it moves beyond simple text extraction; it intelligently parses complex structures like tables and headers across diverse file formats (PDFs, DOCX, etc.), delivering clean, semantically rich Markdown.

This structured input is precisely why Chonkie is so effective. By receiving a clean, hierarchical text format, Chonkie’s RecursiveChunker can accelerate the chunking process and—crucially—ensure that chunks maintain semantic coherence. This leads to higher-quality embeddings and, ultimately, significantly more accurate answers from the final RAG system. By minimizing manual preprocessing and maximizing chunk relevance, this pipeline lays a solid foundation for robust LLM-powered applications.

Thanks for reading 👋

DEV Community