Split Long Documents Into Overlapping Chunks Without Losing Context

#hermeschallenge #ai #python #agents

Your agent needs to process a 50-page PDF. The document is 80,000 tokens. The model's context window is 200,000 tokens, so it fits — barely — on the first call. But on the second call, you add conversation history and tool results. Now it does not fit.

Or maybe you want to build a retrieval index. You need to split documents into chunks, embed them, and store them. The chunk size matters: too large and retrieval is imprecise, too small and you lose context that spans chunk boundaries.

llm-token-split handles document chunking with overlapping windows and token-aware boundaries.

The Shape of the Fix

from llm_token_split import TokenSplitter, Chunk

splitter = TokenSplitter(
    chunk_tokens=2000,
    overlap_tokens=200,
    model="claude-sonnet-4-6",
)

chunks: list[Chunk] = splitter.split(long_document)

for chunk in chunks:
    print(f"Chunk {chunk.index}: tokens={chunk.token_count}, "
          f"chars={chunk.char_start}-{chunk.char_end}")
    embed_and_store(chunk.text)

Each chunk overlaps the previous by 200 tokens. A sentence that falls at a chunk boundary appears in both chunks, preserving context for retrieval.

What It Does NOT Do

llm-token-split does not use a real tokenizer by default. Token counts are approximated using character-to-token ratios (roughly 4 characters per token for English text). For exact token counts, pass exact=True which uses tiktoken if installed.

It does not split on sentence boundaries by default. It splits on whitespace. For semantic chunking (splitting at paragraph or sentence boundaries), use the boundary="sentence" or boundary="paragraph" options.

It does not embed or store the chunks. It produces chunk text and metadata. Embedding and storage are separate concerns.

Inside the Library

The core is a sliding-window text splitter:

def split(self, text: str) -> list[Chunk]:
    words = text.split()
    chunks = []

    step_tokens = self._chunk_tokens - self._overlap_tokens
    step_words = int(step_tokens * self._chars_per_token / avg_word_length)
    chunk_words = int(self._chunk_tokens * self._chars_per_token / avg_word_length)

    for i in range(0, len(words), step_words):
        chunk_text = " ".join(words[i:i + chunk_words])
        token_count = self._estimate_tokens(chunk_text)
        chunks.append(Chunk(
            index=len(chunks),
            text=chunk_text,
            token_count=token_count,
            char_start=len(" ".join(words[:i])),
            char_end=len(" ".join(words[:i + chunk_words])),
        ))

    return chunks

The word-count approximation is fast and good enough for most use cases. The exact=True path uses tiktoken (cl100k_base for OpenAI models, claude tokenizer for Anthropic via a compatible implementation).

Boundary modes:

boundary="word" (default): split on whitespace
boundary="sentence": split at ., !, ? followed by whitespace
boundary="paragraph": split on double newlines

Boundary splitting makes chunks more natural but may produce variable chunk sizes. The splitter caps overlap at the actual boundary, so the overlap_tokens parameter is a maximum, not a guarantee.

When to Use It

Use it for any pipeline that processes documents longer than a few pages. RAG indexing (split, embed, store, retrieve). Batch document analysis (split, process each chunk, aggregate). Summarization of long documents (split, summarize each chunk, combine).

The overlap is important for retrieval. A question about content near a chunk boundary should be answerable from retrieval, not fail because the relevant text is split across two non-overlapping chunks. 10-15% overlap (200 tokens overlap on 2000-token chunks) is a good starting point.

Skip it for documents that fit in the model's context window with room to spare. If your document is 5,000 tokens and your context is 200,000, chunking adds complexity without benefit.

Install

pip install git+https://github.com/MukundaKatta/llm-token-split

# With exact tokenization
pip install "git+https://github.com/MukundaKatta/llm-token-split[tiktoken]"

from llm_token_split import TokenSplitter
import anthropic

splitter = TokenSplitter(
    chunk_tokens=4000,
    overlap_tokens=400,
    boundary="paragraph",
    model="claude-sonnet-4-6",
)

def build_rag_index(document: str, doc_id: str) -> None:
    chunks = splitter.split(document)

    for chunk in chunks:
        embedding = embed(chunk.text)
        vector_store.upsert(
            id=f"{doc_id}-{chunk.index}",
            vector=embedding,
            metadata={
                "doc_id": doc_id,
                "chunk_index": chunk.index,
                "char_start": chunk.char_start,
                "char_end": chunk.char_end,
                "token_count": chunk.token_count,
            },
            text=chunk.text,
        )

Sibling Libraries

Library	What it solves
`prompt-token-counter`	Approximate token counts for prompts
`agent-message-window`	Manage conversation history for context window
`agentfit`	Fit content into a token budget
`llm-context-rotate`	Stateful rolling chat history
`llm-cost-cap`	Pre-flight cost gate before LLM calls on each chunk

The RAG pipeline: llm-token-split for chunking, prompt-token-counter for budget checking, llm-cost-cap to ensure batch processing stays within budget.

What's Next

Semantic chunking is the main gap. Instead of splitting on word or paragraph boundaries, split where meaning changes, using embedding similarity between adjacent sentences. This produces more coherent chunks for retrieval but requires an embedding model call during indexing.

Recursive splitting: if a paragraph is larger than chunk_tokens, split it further on sentences. If a sentence is larger than chunk_tokens, split on words. This handles the edge case of very long paragraphs without producing oversized chunks.

A merge pass: after splitting, merge adjacent small chunks so that none is under a minimum size. This avoids index entries that are too small to carry meaningful context.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.