Breaking long documents into overlapping chunks, preserving context, and reassembling with FastAPI
At LectuLibre, we’ve built an AI‑powered platform that translates entire books—EPUBs and PDFs—using large language models. When we first hooked up Claude’s API, we naively fed it a 300‑page PDF in one request. It failed immediately. Claude 3 Opus has a 200K token window, but a 300‑page book can easily run to 300K tokens or more. Even if we squeezed it in, the output would be truncated and the quality would degrade at the extremes of the context window.
So we faced a classic long‑document problem: how do you translate a book that’s larger than the model’s context window? Here’s the real approach we ended up with, the code we wrote, and the lessons we learned.
The Problem: Token Limits Are Real
Claude 3 Opus and Haiku models (and most LLMs) have a maximum context length—200,000 tokens for Opus. A token is roughly ¾ of a word. A 300‑page novel with ~75,000 words translates to about 100K tokens, so it should fit, right? But translations from English to Spanish can expand by 15–20%, and the prompt instructions, system message, and the user message itself all eat into that budget. Plus, we needed to send the entire source text in every call to give the model full context. That’s not feasible.
We could have tried a simple split: cut the book at arbitrary page boundaries and translate piecemeal. That fails spectacularly. Narrative breaks mid‑sentence, and phrases like “the previous chapter” lose their referents. We needed a more intelligent chunking strategy.
Our Approach: Sliding Window with Overlapping Paragraphs
We settled on a sliding window chunking algorithm based on paragraphs, with a generous overlap. Here’s the idea:
- Split the source text into paragraphs (using
\n\n). - Build chunks of
max_chunk_tokens(we used 180,000 to keep a safety margin), adding paragraphs one by one and counting tokens withtiktoken. - When the chunk exceeds the limit, we start a new chunk but we include the last few paragraphs of the previous chunk as context. This overlap (we used 5 paragraphs) gives the model continuity across chunk boundaries.
- We translate each chunk independently, then stitch them back together, removing the overlap.
This isn’t perfect—some chapters may still be split—but it preserves far more context than any fixed‑size split.
Implementation in Python with FastAPI
We built our translation pipeline inside a FastAPI background task. Here’s the core chunking function:
import tiktoken
from typing import List
from langchain_text_splitters import RecursiveCharacterTextSplitter
def chunk_by_paragraphs(text: str, max_tokens: int = 180000, overlap_paragraphs: int = 5) -> List[str]:
"""
Split text into chunks of at most `max_tokens` tokens,
using paragraphs as atomic units and overlapping the last
`overlap_paragraphs` from the previous chunk.
"""
enc = tiktoken.get_encoding("cl100k_base") # Claude's tokenizer
paragraphs = text.split('\n\n')
chunks = []
current_chunk = []
current_token_count = 0
for para in paragraphs:
para_tokens = len(enc.encode(para))
# If a single paragraph exceeds the limit (rare), split it further
if para_tokens > max_tokens:
# Fallback to sentence splitting
para_texts = RecursiveCharacterTextSplitter(
chunk_size=max_tokens, chunk_overlap=100,
length_function=lambda x: len(enc.encode(x))
).split_text(para)
for p in para_texts:
p_tokens = len(enc.encode(p))
if current_token_count + p_tokens > max_tokens and current_chunk:
chunks.append('\n\n'.join(current_chunk))
overlap = current_chunk[-overlap_paragraphs:] if len(current_chunk) >= overlap_paragraphs else current_chunk
current_chunk = overlap.copy()
current_token_count = sum(len(enc.encode(p)) for p in overlap)
current_chunk.append(p)
current_token_count += p_tokens
else:
if current_token_count + para_tokens > max_tokens and current_chunk:
chunks.append('\n\n'.join(current_chunk))
# Keep overlapping paragraphs
overlap = current_chunk[-overlap_paragraphs:] if len(current_chunk) >= overlap_paragraphs else current_chunk
current_chunk = overlap.copy()
current_token_count = sum(len(enc.encode(p)) for p in overlap)
current_chunk.append(para)
current_token_count += para_tokens
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Then we translate each chunk using Anthropic’s Python SDK, with back‑pressure and retry logic to handle rate limits:
from anthropic import Anthropic, RateLimitError
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
async def translate_chunk(client: Anthropic, chunk: str, target_lang: str) -> str:
system_prompt = f"You are a professional translator. Translate the following text from English to {target_lang}. Preserve all formatting, line breaks, and special characters. Do not add commentary."
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60))
async def _call():
try:
response = await asyncio.to_thread(
client.messages.create,
model="claude-3-opus-20240229",
max_tokens=4096,
system=system_prompt,
messages=[{"role": "user", "content": chunk}]
)
return response.content[0].text
except RateLimitError:
# Let tenacity handle the retry
raise
return await _call()
We use asyncio.to_thread because the Anthropic SDK is synchronous; in a FastAPI app we don’t want to block the event loop. The tenacity library gives us exponential backoff for rate limits. After translating all chunks in parallel with asyncio.gather, we merge them:
def merge_chunks(translated_chunks: List[str], overlap_paragraphs: int = 5) -> str:
"""
Concatenate translated chunks, removing the overlapping paragraphs
except from the first chunk.
"""
if not translated_chunks:
return ""
result = translated_chunks[0]
for i in range(1, len(translated_chunks)):
# Each subsequent chunk starts with 5 overlap paragraphs; skip them
chunk_paragraphs = translated_chunks[i].split('\n\n')
# We assume the translation preserved paragraph boundaries
main_text = chunk_paragraphs[overlap_paragraphs:] if len(chunk_paragraphs) > overlap_paragraphs else chunk_paragraphs
result += '\n\n' + '\n\n'.join(main_text)
return result
Parallel Translation and Performance
We run all chunk translations concurrently. For a 300‑page book, we typically get 5–8 chunks of ~180K tokens each. With Claude 3 Opus, each chunk takes about 15–30 seconds to translate. We impose a concurrency limit of 4 simultaneous calls to avoid hitting Anthropic’s rate caps. Overall, a full‑book translation completes in 2–5 minutes.
Cost: Claude 3 Opus is expensive. At $15 per million input tokens, a 300‑page book (~100K input tokens per chunk, ~8 chunks) costs around $12–15. We mitigated this by offering Claude 3 Haiku (cheaper, faster, but lower quality) and DeepSeek as alternatives. Users can choose.
Quality trade‑offs: The overlap strategy works well for most texts, but sometimes a chapter ends exactly at a chunk boundary and the narrative flow feels a bit disjointed. We experimented with dynamic overlap based on chapter markers (e.g., force a split only at chapter headings), but that added complexity and didn’t always align with token limits. We’re sticking with paragraph‑level overlap for now.
Lessons Learned
-
Token counting is tricky. tiktoken’s
cl100k_baseis close to Claude’s tokenizer but not identical. We saw a 5% discrepancy in token counts, so we kept a safety margin of 20K tokens below the limit. - Overlap size matters. Too little overlap and you lose context; too much wastes tokens and money. Five paragraphs proved a sweet spot for most books.
-
Rate limits forced us to build robust retries. Anthropic’s API will 429 you aggressively if you fire too many concurrent requests.
tenacityand a concurrency semaphore saved us. -
The merge step must handle formatting. Splitting and rejoining on
\n\nworks for prose, but tables, lists, and code blocks get mangled. We’re now exploring a markdown‑aware splitter. - Cost transparency is crucial. Users understand that translating a 300‑page book isn’t free. We show an upfront cost estimate based on token counts.
Where We Are Now
LectuLibre’s translation pipeline currently handles EPUBs and PDFs up to ~1000 pages. We’ve translated novels, technical manuals, and even a PhD thesis. The chunking approach has held up surprisingly well, but there’s room for improvement: dynamic overlap detection, better table handling, and perhaps a two‑stage translation where we first summarize each chunk’s context.
If you’re building a similar system, don’t underestimate the merge logic. The chunking is easy; making the final output read like a single, coherent book is the real challenge.
What’s your experience with long‑form AI translation? Have you found a better chunking heuristic? We’d love to hear your thoughts in the comments.
Top comments (0)