Testing ‘chunkweaver’

Testing an open-source library I just read about based on the work of Oleksii Aleksapolskyi!

Introduction and Disclaimer

I’m always on the lookout for innovative open-source projects, and the article ‘Right Document, Wrong Chunk’ immediately caught my eye. After exploring the author’s repository and finding the approach incredibly compelling, I decided to build a custom interface and application to experiment with it firsthand.

Disclaimer: A huge shout-out to Oleksii Aleksapolskyi (Medium / LinkedIn), whose original research and backbone work made this project possible.

Excerpt from the author’s article;

Most RAG tutorials stop at: split, embed, retrieve top-K. Works fine on a README. Point the same pipeline at a 200-page regulation, a pile of clinical notes, or an RFC. Retrieval looks fine. Recall@5 is fine. MRR is fine. Users still complain.
Often it isn’t the wrong document. It’s the chunk: Article 17 cut in half, two RFC sections welded together, a table header orphaned from its rows. The model gets text that sort of matches the query and sort of doesn’t contain the answer. Right document. Wrong chunk. Standard metrics mostly won’t flag that.

Image from the author Oleksii Aleksapolskyi

The article gives the link to the GitHub repositry of “chunkweaver” and based of the samples provided on that repository, I built the application for my own tests!

The sample architecture schema provided by the author explains well the purpose of his implementation!

 PDF / DOCX / HTML              Your vector DB
        │                              ▲
        ▼                              │
  ┌───────────┐    ┌──────────────┐    │
  │ Extractor │───▶│ chunkweaver  │────┘
  │           │    │              │
  │ unstructured   │ boundaries   │  Embed + upsert
  │ marker-pdf     │ detectors    │  into Pinecone,
  │ docling        │ presets      │  Qdrant, Weaviate,
  │ pdfminer       │ overlap      │  ChromaDB, etc.
  └───────────┘    └──────────────┘

More interestingly, what I found really great on the GitHub repository are the following paragraphs;

Features

Zero dependencies — stdlib only, no LangChain/LlamaIndex tax
Regex boundaries — you tell the chunker where sections start (^Article \d+, ^## , ^Item 1.)
Hierarchical levels — CHAPTER > Section > Article > clause; split only as deep as needed
Heuristic detectors — HeadingDetector, TableDetector discover structure from text patterns
Annotation ingestion — accept pre-computed structure from any extractor
Semantic overlap — sentences, not characters
Full metadata — offsets, boundary types, hierarchy levels, overlap tracking
Integrations — LangChain and LlamaIndex drop-ins_

Presets

Built-in boundary patterns for common document types:

from chunkweaver.presets import (
    LEGAL_EU, LEGAL_US, RFC, MARKDOWN,
    CHAT, CLINICAL, FINANCIAL, FINANCIAL_TABLE,
    SEC_10K, FDA_LABEL, PLAIN,
)

| Preset            | Domain             | Detects                                           |
| ----------------- | ------------------ | ------------------------------------------------- |
| `LEGAL_EU`        | EU legislation     | `Article N`, `CHAPTER`, `SECTION`, `(1)` recitals |
| `LEGAL_US`        | US law / contracts | `§ N`, `Section N`, `WHEREAS`, `1.1` clauses      |
| `RFC`             | IETF RFCs          | `1. Intro`, `3.1 Overview`, `Appendix A`          |
| `MARKDOWN`        | Markdown           | `# headings`, `---` rules                         |
| `CHAT`            | Chat logs          | `[14:30]`, ISO timestamps, `speaker:` turns       |
| `CLINICAL`        | Medical notes      | `HPI:`, `ASSESSMENT:`, `PLAN:`, etc.              |
| `FINANCIAL`       | SEC filings        | `Item 1.`, `PART I`, `NOTE 1`, `Schedule A`       |
| `FINANCIAL_TABLE` | Data tables        | `TABLE N`, markdown/ASCII separators              |
| `SEC_10K`         | SEC annual reports | `PART I`–`IV`, `Item N.`, ALL-CAPS sub-headings   |
| `FDA_LABEL`       | Drug labels        | `1 INDICATIONS`, `## 2.1 Adult Dosage`            |
| `PLAIN`           | Any                | No boundaries — pure paragraph/sentence fallback  |

My local Implementation

My codebase in order to use and test the “chunkweaver” library, is based on;

Streamlit UI application,
Milvus Vector Database,
Ollama for local chat with the RAG’s content,
And sample data from public sources (At this point I didn’t implement any document transformation from various formats to text format yet, I used original text documents, but I saw the the author mentioned Docling as a solution among the document tranformation tools 🤗).

This gives the following architecture for my application ⬇️

The main application

As mentioned above, I built a Python application using Streamlit for the UI which proivdes document uploading and/or batch processing capacities. The chuncked data is then stored in my local Milvus instance.

# app.py
import os
import requests
import streamlit as st
from pathlib import Path

# Chunking & Embedding
from chunkweaver import Chunker
from chunkweaver.presets import LEGAL_EU
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector
from sentence_transformers import SentenceTransformer

# Vector DB
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema

# --- CONFIGURATION ---
INPUT_DIR = Path("input")
OUTPUT_DIR = Path("output")
INPUT_DIR.mkdir(exist_ok=True)
OUTPUT_DIR.mkdir(exist_ok=True)

COLLECTION_NAME = "chunkweaver_rag"
MODEL_NAME = 'all-MiniLM-L6-v2'
DIMENSION = 384 # Dimension for the MiniLM model

# --- RESOURCE INITIALIZATION ---
@st.cache_resource
def load_resources():
    # Load Embedding Model
    embedder = SentenceTransformer(MODEL_NAME)
    # Connect to Local Milvus
    client = MilvusClient(uri="http://localhost:19530")
    return embedder, client

embedder, milvus_client = load_resources()

# --- MILVUS ---
def init_milvus_collection():
    """Ensures collection exists, has an index, and is loaded."""
    if not milvus_client.has_collection(COLLECTION_NAME):
        from pymilvus import CollectionSchema, FieldSchema
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
            FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION),
            FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
            FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=500),
            FieldSchema(name="boundary_type", dtype=DataType.VARCHAR, max_length=100),
        ]
        schema = CollectionSchema(fields, description="Chunkweaver RAG collection")
        milvus_client.create_collection(collection_name=COLLECTION_NAME, schema=schema)

        # --- NEW: Create Index ---
        index_params = milvus_client.prepare_index_params()
        index_params.add_index(
            field_name="vector",
            index_type="HNSW",      
            metric_type="L2",       
            params={"M": 8, "efConstruction": 64}
        )
        milvus_client.create_index(COLLECTION_NAME, index_params)
        st.success("Collection and Index created!")


    milvus_client.load_collection(COLLECTION_NAME)

def run_indexing(file_paths, target_size, overlap):
    init_milvus_collection()

    chunker = Chunker(
        target_size=target_size,
        overlap=overlap,
        boundaries=LEGAL_EU,
        detectors=[HeadingDetector(), TableDetector()]
    )

    all_records = []
    for file_path in file_paths:
        text = file_path.read_text(encoding="utf-8")
        # Process using chunkweaver's metadata-rich chunking
        chunks = chunker.chunk_with_metadata(text)

        for i, c in enumerate(chunks):
            # Save raw chunk to output folder for inspection
            out_file = OUTPUT_DIR / f"{file_path.stem}_chunk_{i}.txt"
            out_file.write_text(c.text, encoding="utf-8")

            # Create embedding
            vector = embedder.encode(c.text).tolist()

            all_records.append({
                "vector": vector,
                "text": c.text,
                "source": file_path.name,
                "boundary_type": c.boundary_type
            })

    if all_records:
        res = milvus_client.insert(collection_name=COLLECTION_NAME, data=all_records)
        return len(all_records)
    return 0

# --- OLLAMA ---
def get_ollama_models():
    try:
        response = requests.get("http://localhost:11434/api/tags")
        return [m['name'] for m in response.json().get('models', [])]
    except:
        return []

def query_ollama(model, prompt):
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt, "stream": False},
            timeout=60
        )
        return response.json().get("response", "Error: No response field.")
    except Exception as e:
        return f"Error connecting to Ollama: {e}"

# --- STREAMLIT ---
st.set_page_config(page_title="Chunkweaver RAG", layout="wide")
st.title("🧶 Chunkweaver + Local Milvus + Ollama")

tab1, tab2, tab3 = st.tabs(["📁 Ingest Documents", "🔍 Milvus Data Viewer", "💬 RAG Chat"])

# TAB 1: INGESTION
with tab1:
    col1, col2 = st.columns([1, 2])
    with col1:
        st.subheader("Settings")
        t_size = st.number_input("Target Chunk Size (chars)", 128, 4096, 1024)
        ov_size = st.slider("Sentence Overlap", 0, 5, 2)

        st.divider()
        uploaded_files = st.file_uploader("Upload .txt files", type=['txt'], accept_multiple_files=True)
        mode = st.radio("Processing Mode", ["Only Process Newly Uploaded", "Batch Process ALL in 'input/' folder"])

        if st.button("🚀 Run Chunkweaver & Index"):
            files_to_process = []
            if mode == "Only Process Newly Uploaded":
                if not uploaded_files:
                    st.error("No files uploaded.")
                for f in uploaded_files:
                    p = INPUT_DIR / f.name
                    p.write_bytes(f.getbuffer())
                    files_to_process.append(p)
            else:
                files_to_process = list(INPUT_DIR.glob("*.txt"))

            if files_to_process:
                with st.spinner(f"Processing {len(files_to_process)} files..."):
                    count = run_indexing(files_to_process, t_size, ov_size)
                    st.success(f"Success! {count} chunks added to Milvus.")
            else:
                st.warning("No files found to process.")

    with col2:
        st.subheader("Files in 'input/' folder")
        existing_files = list(INPUT_DIR.glob("*.txt"))
        if existing_files:
            for f in existing_files:
                st.text(f"📄 {f.name} ({f.stat().st_size / 1024:.1f} KB)")
        else:
            st.info("The input folder is currently empty.")

# TAB 2: VIEWER
with tab2:
    st.subheader("Current Data in Milvus")
    if st.button("Refresh View"):
        try:
            if milvus_client.has_collection(COLLECTION_NAME):
                milvus_client.load_collection(COLLECTION_NAME)
                results = milvus_client.query(
                    collection_name=COLLECTION_NAME,
                    filter="",
                    limit=20,
                    output_fields=["source", "boundary_type", "text"]
                )
                if results:
                    st.table(results)
                else:
                    st.info("Collection is empty.")
            else:
                st.warning("Collection not created yet. Please index some documents.")
        except Exception as e:
            st.error(f"Milvus Error: {e}")

# TAB 3: CHAT
with tab3:
    st.subheader("Chat with your Data")
    models = get_ollama_models()
    if not models:
        st.error("Ollama not detected. Please ensure 'ollama serve' is running.")
    else:
        selected_model = st.selectbox("Select your Local Model", models)
        user_input = st.text_input("Ask a question about your documents:")

        if user_input:
            try:
                # 1. Retrieval
                milvus_client.load_collection(COLLECTION_NAME)
                query_vec = embedder.encode(user_input).tolist()
                search_results = milvus_client.search(
                    collection_name=COLLECTION_NAME,
                    data=[query_vec],
                    limit=3,
                    output_fields=["text", "source"]
                )

                # 2. Context Construction
                retrieved_chunks = []
                for hits in search_results:
                    for hit in hits:
                        retrieved_chunks.append(hit['entity']['text'])

                context = "\n---\n".join(retrieved_chunks)

                # 3. Generation
                full_prompt = f"""Use the following context to answer the user's question. 
                If the answer isn't in the context, say you don't know.

                Context:
                {context}

                Question: {user_input}

                Answer:"""

                with st.spinner("Ollama generating response..."):
                    answer = query_ollama(selected_model, full_prompt)
                    st.markdown(f"### Answer\n{answer}")

                with st.expander("View Source Context"):
                    st.write(context)

            except Exception as e:
                st.error(f"RAG Error: {e}")

Install the requirements.txt

pip install -r requirements.txt

chunkweaver
streamlit
pymilvus 
sentence-transformers

Milvus Vector DB Implementation

To use Milvus locally, I use Podman alongside with Podman Desktop.

Create the local Milvus folder(s)

mkdir volumes && mkdir volumes/milvus

Download Milvus Images and Run them

# Download the official standalone script
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh

# Replace 'docker' with 'podman' for compatibility
sed -i 's/docker/podman/g' standalone_embed.sh
# or if you have an error message on a MacOS;
sed -i '' 's/docker/podman/g' standalone_embed.sh

# Start Milvus
bash standalone_embed.sh start

Once the code has ran, you’ll get the following structure (or something like it!) ⬇️

As I tried several times creating chunks I used the following script to test my Milvus connection and remove the collections I already made (very basic but works fine).

from pymilvus import MilvusClient

try:
    MilvusClient.connect(host="127.0.0.1", port="19530")
    print("Milvus connection successful!")
    connections.disconnect("default")
except Exception as e:
    print(f"Milvus connection failed: {e}")


client = MilvusClient(uri="http://localhost:19530")
if client.has_collection("chunkweaver_rag"):
    client.drop_collection("chunkweaver_rag")
    print("Dropped broken collection.")

You’ll also have the chunks outputs 📩

2024/1689

12.7.2024

REGULATION (EU) 2024/1689 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL

of 13 June 2024

laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act)

(Text with EEA relevance)

The chat implementation using Ollama

As always, I rely on my local Ollama setup alongside with my downloaded models. The UI let’s the user pick any installed LLM.

Documents I used to enrich the RAG

Trying to be as close as possible to the original thaoughts of the implementation, I used the following sites/documents and created two text format files in order to chuck and embed in Milvus vector database.

Eur-lex — Access to European Union law (Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA relevance)): https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689
United Nations site (Universal Declaration of Human Rights): https://www.un.org/en/about-us/universal-declaration-of-human-rights

Conclusion

In conclusion, exploring and implementing the chunkweaver library has been an eye-opening experience. The results demonstrate that moving beyond rigid, fixed-size splitting toward a structure-aware approach is not just a marginal improvement, but a fundamental shift in ensuring RAG reliability. This project has truly given me a new perspective on how we should handle document ingestion — prioritizing the ‘natural’ boundaries of information to preserve the author’s original context. Huge kudos to Oleksii Aleksapolskyi for his innovative work and for providing the community with such a promising tool; it has significantly changed how I approach building high-quality retrieval systems.

>>> Thanks for reading <<<