Our chunking strategy that keeps chapters coherent, respects context windows, and handles multi-lingual books.
The problem: books don’t fit in a prompt
At LectuLibre, we translate entire books — novels, technical manuals, poetry — using large language models. It sounds simple: feed each paragraph to an LLM, concatenate results, done. But the moment we tried a 300‑page EPUB, chaos ensued. Chapters bled into each other, sentences were chopped mid‑word, and the translation of chapter 5 had no idea what happened in chapter 4.
LLMs have limited context windows. Even the massive 200K token window of Claude 3 can’t hold a whole 150K‑word book. And even if it could, the cost and latency would be absurd. We needed a way to split the book into manageable chunks while preserving enough context so that the translation remains coherent across thousands of pages.
Here’s how we designed a chunking pipeline that respects your wallet, the context window, and the book’s narrative flow.
Step 1: extract structure, not just text
Naively splitting by character count is a recipe for disaster. Instead, we first parse the document to understand its logical units: chapters, sections, headings. For EPUB, we use ebooklib; for PDF, pdfplumber. Both give us a stream of items (paragraphs, headings) that we then organize into a tree of chapters and sub‑sections.
import ebooklib
from ebooklib import epub
def get_chapters(epub_path):
book = epub.read_epub(epub_path)
chapters = []
for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
# Simplified: each document is a chapter
content = item.get_content().decode('utf-8')
chapters.append(content)
return chapters
In practice, we use BeautifulSoup to extract <body> text and identify heading tags (<h1>–<h6>) to build a table of contents. This way, even if a chapter is 20,000 tokens, we keep it together as a single unit until later splitting.
Step 2: sentence‑aware splitting with token budgets
A chapter still needs to be broken down to fit the model’s context window. But we never split mid‑sentence. We use spaCy to tokenize the text into sentences, then greedily group them until we hit a token limit.
Why not simple character‑based splitting? Because sentences carry semantic boundaries. Breaking inside a sentence occasionally produces artefacts like “He walked to the sta‑” / “‑tion.” LLMs are forgiving but not that forgiving.
import spacy
from transformers import AutoTokenizer # for accurate token count
nlp = spacy.load("en_core_web_sm")
tokenizer = AutoTokenizer.from_pretrained("claude-tokenizer") # custom tokenizer for Claude
def sentence_split(text):
doc = nlp(text)
return [sent.text for sent in doc.sents]
def chunk_sentences(sentences, max_tokens=1800, overlap_sentences=5):
chunks = []
current_chunk = []
current_token_count = 0
for i, sent in enumerate(sentences):
sent_tokens = len(tokenizer.encode(sent))
if current_token_count + sent_tokens > max_tokens:
# Store chunk with a sliding overlap
chunks.append(current_chunk)
# Overlap: take last `overlap_sentences` from the chunk just concluded
current_chunk = sentences[i - overlap_sentences : i] if i - overlap_sentences > 0 else []
current_token_count = sum(len(tokenizer.encode(s)) for s in current_chunk)
current_chunk.append(sent)
current_token_count += sent_tokens
if current_chunk:
chunks.append(current_chunk)
return chunks
We set max_tokens to 1800, leaving room for the system prompt, context from previous chunks, and the model’s response. That’s for Claude Haiku, which has a 32K context window. For longer‑context models we’d scale up, but keeping chunks smaller also means faster, cheaper API calls.
Step 3: passing context across chunks
The real magic is what we do between chunks. A standalone translation of chunk #5 has no clue that the protagonist just entered a dark cave in chunk #4. Two techniques solved this:
- Sliding window of previous sentences — we include the last 5–10 sentences from the preceding chunk directly in the prompt as “context left.”
- A running summary — after translating a chunk, we ask the LLM to generate a one‑sentence summary of that chunk. This summary is accumulated and fed into every subsequent prompt, so the model remembers high‑level events.
def build_prompt(chunk, previous_context_sentences, summary_so_far):
context_left = " ".join(previous_context_sentences)
prompt = f"""You are translating a book. Here is a summary of the story so far:
{summary_so_far}
And the previous text (for immediate context):
"{context_left}"
Now translate the following text to Spanish, preserving tone and style:
{chunk}"""
return prompt
The summary is generated using a separate, cheap call (we use DeepSeek for summaries, even if the main translation uses Claude). This keeps the context token usage minimal while still giving long‑range coherence.
Why not just include the entire previous chunk? That doubles the token count per call. On a 200K‑word book, that adds up to hundreds of dollars. Summaries cut that cost by ~80% with negligible quality loss.
The translation loop then looks like this:
overall_summary = ""
previous_context = []
full_translation = []
for chapter_chunks in all_chunks_by_chapter:
chapter_summary = ""
for i, chunk in enumerate(chapter_chunks):
prompt = build_prompt(
" ".join(chunk),
previous_context,
chapter_summary + "\n" + overall_summary if i > 0 else ""
)
translated = call_llm(prompt)
full_translation.append(translated)
# Update context: keep last 5 sentences of the translated chunk as next context
trans_sents = sentence_split(translated)
previous_context = trans_sents[-5:]
# Generate chunk summary asynchronously to save time
chunk_summary = call_llm(f"Summarize this passage in one sentence: {chunk}")
chapter_summary += chunk_summary + " "
overall_summary += chapter_summary
We process chunks concurrently using asyncio and httpx to keep translation times reasonable.
Real‑world results and trade‑offs
Translating a 120K‑word Spanish novel (“El Quijote”) into English took about 4 minutes end‑to‑end with Claude 3 Haiku. Total API cost: $0.67. The translation was surprisingly fluid — chapters felt connected, and the occasional flashback or pronoun reference (“she” referring to a character introduced three pages earlier) was correctly resolved. Without the context pipeline, the same book would have been riddled with inconsistencies.
We experimented with other models: DeepSeek‑V3 gave similar quality at half the price but with higher latency, making it better for batch jobs where speed isn’t critical. GPT‑4 Turbo reproduced stylistic flourishes more naturally, but its 16K context window forced us to use even smaller chunks, which sometimes fragmented dialogue. Claude struck the best balance.
But it’s not perfect. Humor and idioms still occasionally fall flat because the summary can’t encapsulate a running joke. Code blocks and tables inside technical books need special handling — we’re working on a parser that detects them and wraps them in [CODE] markers so the LLM doesn’t try to translate variable names. And poetry, with its line breaks and meter, remains a challenge; we’re considering a dedicated poetry‑aware chunker.
The key takeaway
If you’re building long‑document translation using LLMs, invest in a pipeline that:
- Respects document structure (chapters, paragraphs) before splitting.
- Splits on sentences, and always leaves room for context.
- Provides both immediate context (last few sentences) and global context (summaries) to each chunk.
- Uses separate, cheap models for auxiliary tasks like summarization to keep costs down.
Our code is not open‑source yet, but we plan to release the core chunking library once we’ve battle‑tested it on more formats.
How do you handle context in LLM translations? We’re especially curious about handling highly technical books with equations, footnotes, and cross‑references. Drop your ideas in the comments — let’s figure this out together.
Top comments (0)