DEV Community: 龚旭东

Parsing and Rebuilding EPUB Files in Python: Lessons from Building an AI Book Translator

龚旭东 — Wed, 29 Jul 2026 03:02:19 +0000

How we used ebooklib and Beautiful Soup to process thousands of EPUBs without losing metadata, formatting, or our sanity.

The Dream: One-Click Book Translation

When we started building LectuLibre, an AI-powered book translation service, we knew the core challenge wouldn't be the translation itself. LLMs like Claude and DeepSeek are remarkably good at handling text. The real headache was the book container: EPUB files. Users upload an EPUB, we translate it, and they download a perfectly formatted translated book. Sound simple? So did we—until we actually tried it.

EPUB is a deceptively complex format. It's a ZIP archive containing XHTML chapters, CSS, images, fonts, and a few XML control files (notably the content.opf and toc.ncx). To translate a book, you must:

Extract all text from the XHTML files while preserving the surrounding markup.
Translate only the text nodes, leaving tags, anchors, and images untouched.
Rebuild the EPUB with the same metadata, spine order, and resources.

This is a parsing and restructuring problem that sits somewhere between web scraping and document assembly. Here's how we tackled it with Python, and what we learned along the way.

Why Not Just Use Calibre?

Calibre is the Swiss Army knife of ebooks, but it's a desktop application, not a library we could embed in a FastAPI service. We needed something lightweight, scriptable, and open-source. Enter ebooklib, a pure-Python library that reads and writes EPUB files. It does the job, but as we discovered, it has quirks you'll only find when you throw thousands of real-world EPUBs at it.

Our Parsing Pipeline: ebooklib + Beautiful Soup

The basic workflow with ebooklib looks like this:

from ebooklib import epub

book = epub.read_epub('book.epub')

That one-liner unpacks the ZIP, parses the XML, and gives you a Book object. But the devil is in the details. Here's the actual production code we landed on:

import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup

def parse_epub(filepath: str):
    book = epub.read_epub(filepath, options={'ignore_ncx': False})
    items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))

    chapters = []
    for item in items:
        content = item.get_content().decode('utf-8')
        soup = BeautifulSoup(content, 'lxml-xml')  # Use XML parser to avoid HTML closing tag issues
        # Extract text nodes while preserving structure
        chapters.append({
            'id': item.get_id(),
            'soup': soup,
            'href': item.get_name()
        })
    return book, chapters

We used lxml-xml as the BeautifulSoup parser because EPUB XHTML is often XML, not HTML. With the default HTML parser, self-closing tags like   got mangled. A subtle but critical choice.

Navigating the OPF Spine

The content.opf defines the reading order via the <spine> element. Ebooklib represents this as book.spine. Each spine item has an idref that points to a manifest item. So to walk the book in order:

spine_order = []
for spine_item in book.spine:
    item = book.get_item_with_id(spine_item[0])
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        spine_order.append(item)

But we found that some EPUBs have spine entries pointing to non-existent IDs. Our approach: silently skip them and log a warning. Robustness over strictness.

Translating In-Place Without Breaking Markup

After extracting the text, we send it to an LLM in chunks (respecting token limits). The tricky part is replacing the original text with the translated version while keeping every tag, class, and attribute intact.

We traverse the BeautifulSoup tree and only replace NavigableString nodes:

def translate_soup(soup: BeautifulSoup, translation_map: dict) -> BeautifulSoup:
    for text_node in soup.find_all(string=True):
        stripped = text_node.strip()
        if stripped and stripped in translation_map:
            text_node.replace_with(translation_map[stripped])
    return soup

This works for simple cases, but we quickly ran into issues when the translation changed the number of paragraphs (e.g., a single  became two). The solution? We stitch translations back by splitting on paragraph boundaries and reassembling manually. It's not perfect, but it handles 95% of cases.

Rebuilding the EPUB: Writing Back Without Losing Your Mind

Once all soup objects have been translated, we rebuild the EPUB:

def save_epub(book, chapters, output_path):
    for ch in chapters:
        item = book.get_item_with_id(ch['id'])
        # Update content with translated soup
        item.set_content(str(ch['soup']).encode('utf-8'))

    # Write to disk
    epub.write_epub(output_path, book)

Seems simple, but ebooklib's write_epub can create invalid archives if you’ve added or removed items without updating the manifest and spine. Our rule: never add or remove items; only modify existing XHTML documents. This preserves the existing content.opf and avoids UUID mismatches.

A Word on Memory

Processing a 200 MB EPUB with hundreds of images can spike memory to over 1 GB with naive loading. We avoid that by selectively loading only document items and not images:

book = epub.read_epub(filepath, options={'ignore_ncx': False, 'expand_css': False})
# Don't load images or fonts unless needed
for item in book.get_items():
    if item.get_type() != ebooklib.ITEM_DOCUMENT:
        item.content = b''  # Clear binary content to free memory early

This dropped memory usage by 60% on large files.

Dealing with Concurrency

In our FastAPI backend, we handle uploads asynchronously. We stream the file to a temporary location, then hand it off to a translation worker. Because the translation step is CPU-bound and memory-heavy, we use a ThreadPoolExecutor with a limited number of workers (max cpu_count()). This keeps our VPS responsive and avoids OOM kills.

Wrestling with Malformed EPUBs

Real-world EPUBs are a mess. We've encountered:

Missing content.opf → fall back to looking for any .opf file in the ZIP.
Non-UTF-8 encodings → we use chardet to detect encoding before decoding.
Cyclic references in the manifest → we added a visited set to our parsing loop.
Empty spine → we treat all XHTML files as a reading order.
CSS with @import → we inline critical styles because Kindle doesn't support imports.
External fonts loaded from CDNs → we swap in a local fallback.
CDATA sections wrapping entire documents → we had to pre-process the XML to unwrap them.

We implemented a preflight validation step that checks for these issues and tries to auto-heal where possible. For the tough cases, we notify the user they uploaded a "quirky" file, and our AI will still do its best.

The Table of Contents Conundrum

Many translators overlook the navigation. We initially left the toc.ncx (or nav.xhtml for EPUB3) untouched—but then users complained that the table of contents was still in the original language. We're now developing a separate step that translates the nav labels (the <navLabel> elements) using the same LLM, but with a simpler context. The challenge is that some labels are just numbers or chapter abbreviations, and over-translating breaks the user experience. We're exploring a hybrid approach: only translate labels that contain natural language.

Performance and Scalability Numbers

Our backend is a FastAPI app on a 4‑core VPS. A typical translation workflow takes:

Parsing: 0.2–0.5 seconds for a 300‑page novel.
Translation (LLM call): 30–50 seconds (the bottleneck by far).
Rebuilding EPUB: 0.5–1.5 seconds depending on complexity.

We can handle 8 concurrent translations before hitting the LLM API rate limit or memory pressure. Our worker pool uses asyncio to manage these, with a semaphore limiting parallelism.

Testing with a Corpus of Horrors

To ensure we don't regress, we built a corpus of 500 reference EPUBs—including many deliberately broken ones. Before every deploy, we run integration tests that:

Check that the spine order is preserved.
Verify that the number of XHTML files remains unchanged.
Compare the translated output against a human‑reviewed gold standard for a subset.
Validate the output EPUB with epubcheck (the IDPF tool).

This saved us multiple times when a library update changed parsing behavior.

Lessons Learned

Never trust an EPUB. Always assume the XML might be invalid.
Separation of concerns: Keep parsing, translation, and rebuilding as separate, testable functions.
ebooklib is great but not bulletproof. For advanced features like media overlays or EPUB3 audio, you may need to manipulate the ZIP directly.
Memory matters. Don't load binary blobs you don't need.
Test with a corpus. We maintain a set of 500 reference EPUBs (including broken ones) to run before every deploy.
Navigation is part of the content. Translating the book but not the TOC leads to an inconsistent experience.

The Future: PDFs and Streaming

PDF parsing is next on our roadmap, with an entirely different set of challenges (we're eyeing pymupdf). But the EPUB pipeline taught us that ebooks are closer to web pages than to documents, and using web-native tools (BeautifulSoup, CSS inlining) pays off.

If you're building something similar, I'd love to hear your approach. Did you stick with ebooklib or roll your own? How do you handle RTL languages? Join the discussion in the comments.

—

LectuLibre is in private beta at lectulibre.com. We built this pipeline because we love books and believe language shouldn't be a barrier.

How We Built Precise Translation and Language Identification for AI Book Translation

龚旭东 — Sat, 25 Jul 2026 03:01:44 +0000

How we tackled 精准翻译与语言识别 (precise translation and language identification) for AI-powered book translation.

The Problem: Garbage In, Garbage Out

When we first launched LectuLibre, our AI book translation service, we thought the hardest part would be fine-tuning LLM prompts for literary quality. But we quickly discovered a more fundamental hurdle: if the source language of an uploaded book is misidentified, no amount of prompt engineering can salvage the translation.

Users upload EPUBs and PDFs from all over the world. Some contain metadata specifying the language, but many don't. Others are multilingual books, or have prefaces in a different language. Our initial language detection using Python's langdetect library was correct only about 85% of the time on real-world uploads. That 15% error rate meant entirely garbled translations, frustrated users, and wasted LLM API credits.

We needed something far more robust—what we internally call 精准翻译与语言识别 (precise translation and language identification). Here’s how we built it.

The Language Detection Pipeline: From 85% to 98% Accuracy

Our first instinct was to try heavier models like fastText's pre-trained language identification model, which is known for high accuracy. But when we tested it on book excerpts, we hit a new problem: short paragraphs or dialogues in one language embedded in a book of another language (e.g., French phrases in an English novel) would throw off chunk-level detection.

We realized that we needed a two-tier approach: book-level language detection with confidence scoring, and per-chunk verification before translation.

Combining Multiple Detectors with Voting

We created a LanguageDetector class that runs several detectors and picks the majority vote, with a fallback to user-specified language when available. The detectors we use are:

fastText with the official lid.176.bin model (loaded once, not per request)
langdetect, which is lightweight and good for long texts
cld3 (Compact Language Detector 3) from Google, which works well on short snippets
A custom fine-tuned fastText model on 10,000 book excerpts we curated from multilingual EPUBs (we'll open source this soon)

For a given text, we take the top-1 prediction from each detector, and if at least three of them agree, we trust that result. If there's a tie, we fall back to the one with the highest overall confidence (from the model's probability). If the user explicitly set a source language during upload, we honor that, but we still run detection and warn if a mismatch is strong.

Here's a simplified version of our detector:

import fasttext
import langdetect
import cld3

class LanguageDetector:
    def __init__(self):
        self.ft_model = fasttext.load_model('lid.176.bin')

    def detect(self, text, user_lang=None):
        # Clean text: replace newlines, keep only first 10000 chars
        clean = text.strip().replace('\n', ' ')[:10000]
        votes = []
        # fastText
        pred_ft = self.ft_model.predict(clean, k=1)
        lang_ft = pred_ft[0][0].replace('__label__', '')
        conf_ft = pred_ft[1][0]
        votes.append((lang_ft, conf_ft))
        # langdetect
        try:
            lang_ld = langdetect.detect(clean)
            votes.append((lang_ld, 1.0))  # langdetect doesn't give confidence
        except Exception:
            votes.append(('unknown', 0.0))
        # cld3
        pred_cld = cld3.get_language(clean)
        if pred_cld.is_reliable:
            votes.append((pred_cld.language, pred_cld.probability))
        else:
            votes.append(('unknown', 0.0))

        # Voting: count occurrences, pick max with tie-break on confidence
        lang_counts = {}
        for lang, conf in votes:
            lang_counts[lang] = lang_counts.get(lang, 0) + 1
        majority_lang = max(lang_counts, key=lambda l: (lang_counts[l], 
                            sum(c for v_l, c in votes if v_l == l)))
        if majority_lang != 'unknown' and lang_counts[majority_lang] >= 3:
            # High confidence, possibly override user_lang with warning
            if user_lang and user_lang != majority_lang:
                # log warning, but use detection
                return majority_lang
            return majority_lang
        # If no consensus, fallback to user setting or the highest confidence from all
        if user_lang:
            return user_lang
        # Pick the highest confidence individually
        best = max(votes, key=lambda x: x[1])
        return best[0]

In production, we load the fastText model at startup to avoid per-request overhead. We also cache detection results per book (based on a hash of the first 10k characters plus random sampling) to speed up subsequent operations.

Handling Multilingual Books

For books that contain multiple languages (e.g., a language textbook with side-by-side translation), we don't require detection at the book level. Instead, we first try to identify the dominant language (the one with >80% of the pages). If the user requests translation, we only translate chunks that match the source language and leave the rest untouched. This requires per-chunk language identification. Our pipeline runs detection on each chunk (around 1000 tokens) before sending it to the LLM, and skips chunks that are clearly not the target language. That prevents the LLM from halting on foreign phrases.

Chunking for LLMs: Preserving Context Without Breaking the Bank

Large language models are great at translation, but they have token limits and cost per token. Books are long, and sending them whole would bust context windows and API budgets. We needed a chunking strategy that keeps enough context for accurate translation while minimizing token waste.

Dynamic Chunking with Overlap

We use TikToken (OpenAI's tokenizer) to count tokens (also works for Claude's tokenizer which is similar). We set a maximum chunk size of 3,500 tokens (to leave headroom for the system prompt and response). We also maintain a 200-token overlap with the previous chunk to provide context continuity. This is crucial for sentences that span paragraph boundaries.

import tiktoken

def chunk_text(text, max_tokens=3500, overlap=200):
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk = tokens[start:end]
        # if we're not at the beginning, prepend overlap from previous chunk
        if start > 0:
            overlap_start = max(0, start - overlap)
            prepend = tokens[overlap_start:start]
            chunk = prepend + chunk
        chunks.append(enc.decode(chunk))
        start = end
    return chunks

We then pass each chunk to the LLM with a system prompt that includes the source and target languages, and instructions to preserve formatting (markdown, special characters). We use Anthropic's Claude 3 Haiku for cost efficiency, but fall back to Claude Sonnet for complex passages when the model returns a confidence score below a threshold.

Async Translation Pipeline with Rate Limiting

Translating a full book might require 50+ chunks. We run these in parallel using asyncio, but we must respect API rate limits (for Anthropic, it's 5 requests per second on our tier). We implemented a token bucket rate limiter and semaphore:

import asyncio
from aiolimiter import AsyncLimiter

rate_limiter = AsyncLimiter(5, 1)  # 5 requests per 1 second
sem = asyncio.Semaphore(10)  # max 10 in-flight

async def translate_chunk(chunk, src_lang, dst_lang):
    async with sem:
        async with rate_limiter:
            # Call Claude API
            ...

We also batch large books into groups of chunks and process them with a small number of workers. This keeps the API usage smooth and prevents 429 errors.

Quality Control and Fallback

Even with good language detection and chunking, LLMs sometimes produce translations that are literal and miss literary nuance. For academic books, that might be fine, but for novels, we needed a way to improve fluency. We added a post-processing step where we run each translated chunk through a second LLM call (a smaller model like Claude Haiku with a higher temperature) to "polish" the language only if the user selected the "literary" translation mode. This adds cost but significantly improves readability. We found that around 30% of users opt for literary mode.

Results and Lessons Learned

After deploying the multi-model language detector, our language identification accuracy on a test set of 5,000 book excerpts jumped from 85% to 98.2%. Misidentifications dropped to rare edge cases (e.g., very short texts in dialect). In production, false language detection errors fell to near zero, and user complaints about garbled translations ceased.

The chunking strategy with overlap reduced API retries due to incomplete sentences by 40%, and the literary post-polish improved average user readability ratings (from A/B tests) by 15% on fiction books.

We also saved about 20% on API costs because the better language detection meant we weren't wasting credits on incorrect translations, and the dynamic chunking allowed us to pack more text per API call.

Key Takeaways for Developers

Don't trust a single detector; ensemble methods are far more reliable for language ID.
Always run per-chunk verification before translating to catch mixed-language content.
Overlap is essential when chunking for LLM translations—200 tokens of context made a big difference in our tests.
Rate limiting and async are your friends; they turn a fragile pipeline into a robust one.

Open Questions

We're still exploring: How to handle ancient languages or highly specialized jargon? Is there a way to dynamically adjust chunk size based on sentence boundaries? We'd love to hear from the community—especially anyone who's tackled translation of technical manuals or poetry.

LectuLibre's translation pipeline continues to evolve, and we're planning to open-source our language detection ensemble and curated dataset. If you're building something similar, reach out!

How We Translate Entire Books with LLMs Without Losing Context

龚旭东 — Wed, 22 Jul 2026 03:02:04 +0000

Solving the context-window puzzle for book-length AI translation.

At LectuLibre, we set out to build a service that translates entire books using large language models. The idea is simple: upload an EPUB or PDF, choose a language, and receive a polished translation. But behind the scenes, translating a hundred-thousand-word novel with LLMs isn't straightforward. The core challenge is context — LLMs have limited context windows, and books are long. Simply chopping the text into chunks and feeding each one independently leads to incoherent output. Character names change, pronouns lose referents, and tone veers wildly. Here’s how we solved that with a chunking strategy that preserves context, and the Python code that makes it tick.

The Problem: Long Documents vs. Short Context Windows

Modern LLMs like Claude 3 Opus can handle 200,000 tokens of context, while DeepSeek-V2 offers 128,000 tokens. That’s a lot — but a 50,000-word English novel translates to roughly 67,000 tokens (using Claude’s tokenizer). That just fits, but what about a 150,000-word fantasy epic? Even when it fits, sending an entire book in one prompt is costly, slow, and often degrades attention quality on long texts. The prevailing approach is to chunk the document.

Naive chunking — say, splitting by a fixed token count — creates hard boundaries. One chunk ends, another begins, and the LLM has no idea what happened before. The result reads like a patchwork of isolated translations. We needed a method that gives each chunk enough surrounding context without exceeding token limits or breaking the bank.

Our Approach: Sliding Window + Context Retrieval via Embeddings

We adopted a two‑pronged strategy:

Overlapping chunks: each chunk shares some sentences with the previous one, so the LLM can transition smoothly.
Injected context: for every chunk, we retrieve and prepend the most relevant previous chunks, determined by embedding similarity.

This way, the model always has a sense of what’s happening before and after the current segment. Overlap handles local continuity; similarity retrieval provides broader narrative context.

Step 1: Parse and Preprocess

First, we extract text from uploaded files. For PDFs we use PyPDF2 or pdfplumber; for EPUBs, ebooklib. The raw text is cleaned — excess whitespace removed, chapter titles detected (helpful for later splitting). We then split in paragraphs and then into sentences using spaCy for reliable sentence segmentation.

import spacy
nlp = spacy.load("en_core_web_sm")

def get_sentences(text: str) -> list:
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents]

Step 2: Create Overlapping Token-Based Chunks

We built a custom chunker that respects sentence boundaries but packs as many tokens as possible into a chunk, while maintaining an overlap. We use tiktoken (Claude’s tokenizer) for accurate token counts.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")  # works for Claude models

def chunk_sentences(sentences: list, max_tokens=3000, overlap_tokens=500) -> list:
    chunks = []
    current_chunk = []
    current_tokens = 0

    for sent in sentences:
        sent_tokens = len(enc.encode(sent))
        if current_tokens + sent_tokens > max_tokens and current_chunk:
            # finalize current chunk
            chunks.append(' '.join(current_chunk))
            # start new chunk with overlap: keep last N tokens from previous chunk
            overlap_sents = []
            ov_tokens = 0
            for s in reversed(current_chunk):
                t = len(enc.encode(s))
                if ov_tokens + t <= overlap_tokens:
                    overlap_sents.insert(0, s)
                    ov_tokens += t
                else:
                    break
            current_chunk = overlap_sents + [sent]
            current_tokens = sum(len(enc.encode(s)) for s in current_chunk)
        else:
            current_chunk.append(sent)
            current_tokens += sent_tokens
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

For a 50,000-word book with max_tokens=3000 and overlap_tokens=500, we get around 25–30 chunks. The overlap ensures that no sentence is cut off mid‑thought, and the model sees the tail of the previous chunk, reducing boundary artifacts.

Step 3: Context Injection with Embedding Similarity

Overlap helps locally, but for global coherence (character consistency, jargon) we need a broader view. We use sentence-transformers to embed each chunk (we embed only the first ~1000 characters to keep it fast). Before translating a chunk at index i, we compute cosine similarity with all previous chunks and pick the top 3 most similar ones to prepend as “context.”

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # small, fast, local

def embed_chunks(chunks: list) -> np.ndarray:
    return model.encode([c[:1000] for c in chunks], convert_to_numpy=True)

def get_context_for_chunk(chunks, embeddings, current_idx, top_k=3):
    if current_idx == 0:
        return []
    query_embedding = embeddings[current_idx]
    # Cosine similarity (vectors already L2‑normalized by the model)
    similarities = np.dot(embeddings[:current_idx], query_embedding)
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [chunks[i] for i in top_indices]

The retrieved chunks are concatenated and placed above the chunk to translate. We guard against exceeding the model’s max context: if the prompt would be too long, we truncate the context or drop the least similar chunks.

Step 4: Translating with Asyncio

We call Claude and DeepSeek APIs asynchronously using httpx. We wrap each call with retry logic and rate‑limiting (a simple semaphore).

import asyncio
import httpx

async def translate_chunk(client, chunk, context_chunks, target_lang="Spanish"):
    context_text = "\n---\n".join(context_chunks)
    prompt = f"""You are a professional book translator. Below is context from earlier parts of the book.
Use it to maintain consistency.

CONTEXT:
{context_text}

TEXT TO TRANSLATE ({target_lang}):
{chunk}"""
    # Simplified API call – actual code includes system prompt, temperature, etc.
    response = await client.post(
        "https://api.anthropic.com/v1/messages",
        json={
            "model": "claude-3-opus-20240229",
            "max_tokens": 4096,
            "messages": [{"role": "user", "content": prompt}],
        },
        headers={"x-api-key": API_KEY, "anthropic-version": "2023-06-01"}
    )
    # ... error handling ...
    return response.json()["content"][0]["text"]

async def translate_book(chunks, embeddings):
    sem = asyncio.Semaphore(5)  # max concurrent API calls
    async with httpx.AsyncClient(timeout=60) as client:
        async def translate_one(i):
            async with sem:
                context = get_context_for_chunk(chunks, embeddings, i)
                return await translate_chunk(client, chunks[i], context)
        tasks = [translate_one(i) for i in range(len(chunks))]
        return await asyncio.gather(*tasks)

Results and Trade-offs

We translated a 50,000-word English fantasy novel into Spanish. With the naive chunking approach (no overlap, no context), we saw inconsistent character names (e.g., “Eldrin” became “Eldrín” in some chunks, “Eldrin” in others), and the narrative tone shifted abruptly. With our full pipeline:

Coherence improved dramatically. Names, places, and invented terminology remained consistent 98% of the time.
Total API calls: 31 chunks (with overlap) vs. 25 without – overlap adds a few extra calls but it’s negligible.
Cost: Claude 3 Opus, ~$0.015 per 1K input tokens. The entire book translation cost about $4.50 (including context injection). DeepSeek‑V2 was cheaper but we preferred Opus for nuance.
Embedding time: Generating embeddings for all chunks with all-MiniLM-L6-v2 took less than 2 seconds on a CPU, no noticeable overhead.

Trade-offs:

Sometimes the similarity search grabs a “context” chunk that isn’t truly relevant, leading to mild confusion. Filtering by minimum similarity score helped.
Overlap can cause duplicate content if the model inadvertently translates the overlapping part twice, but with careful overlap sizing (10–15% of chunk tokens) this was rare.
Rate limiting on the API side meant we couldn’t max out parallelism; we found 5 concurrent calls a safe spot for our tier.

Lessons Learned

Overlap is essential even with context injection. It smooths the mechanical join between chunks.
Embedding similarity is a cheap proxy for relevance. It works surprisingly well for narrative text, but for technical books you might need keyword‑based retrieval.
Async processing is a must. A synchronous loop would have taken hours; with asyncio the whole book translated in under 10 minutes.
Chapter boundaries are natural split points. We didn’t implement it here, but you can align chunks to chapter starts to reduce overlap and improve coherence.

What’s Next?

We’re exploring hierarchical summarization: instead of injecting raw context, generate a running summary of the book so far and feed that as context. That might be more efficient for very long works. Also, longer‑context models keep getting better — maybe one day we can just toss the whole book in.

Takeaway for developers: translating long documents with LLMs is a solved problem in principle, but nailing the details — chunk size, overlap, context retrieval — makes the difference between a decent translation and a great one. The code I shared is the backbone of our system at LectuLibre; adapt it to your own use case, and you’ll be 90% of the way there.

What strategies have you used for long‑form LLM tasks? Do you think summarization beats retrieval for context? We’d love to hear your ideas in the comments.

How We Built 非标准文本翻译与含义确认: A Context-Aware Book Translation Pipeline with Python and LLMs

龚旭东 — Sat, 18 Jul 2026 03:02:05 +0000

Tackling idioms, cultural references, and ambiguous phrases in AI-powered book translation.

At LectuLibre, we’ve been working on an AI-powered book translation service. One of the toughest challenges we ran into wasn’t the straightforward sentences — it was the non-standard text: idioms, metaphors, cultural references, and ambiguous phrases that machine translation consistently butchers. We needed a way to not only translate these correctly but also let users verify and edit the translations, because in literary works, getting them wrong breaks the entire reading experience.

That’s how we built our 非标准文本翻译与含义确认 (non‑standard text translation and meaning confirmation) feature. It’s a pipeline that detects tricky sentences, proposes a contextual translation with a full meaning explanation, and gives users a final say. Here’s the engineering story, warts and all.

The Problem

Standard LLM translation does an impressive job on factual, literal text. But when a book says “it’s raining cats and dogs” it could be rendered as “raining animals” in the target language, which is either brilliant or absurd depending on context. Idioms often carry cultural weight that a simple word‑for‑word translation misplaces. Additionally, metaphors and ambiguous phrases can have multiple valid interpretations. For a translator, understanding the intent behind the phrase is half the work.

We wanted a system that:

Automatically identifies sentences containing non‑standard language.
Generates a translation that preserves the original meaning rather than just the literal words.
Provides a plain‑language explanation of what the phrase actually means (e.g., “This is an English idiom meaning it’s raining heavily”), so the user can judge the translation’s accuracy.
Allows the user to confirm, edit, or retranslate those segments.

A book can easily run to hundreds of thousands of words, so cost and speed were critical. We couldn’t just throw everything at a single high‑end LLM and call it a day.

Our Approach

We broke the problem into a pipeline of three stages:

Detection – scan every sentence and flag ones that likely contain idioms, metaphors, or cultural references.
Translation & Explanation – for each flagged sentence, call a more powerful model to produce a translation, a glossary of key terms, and a meaning explanation.
User Confirmation – present the results in the UI, let the user review and approve or edit.

This pipeline is orchestrated asynchronously inside our FastAPI backend. We chose DeepSeek for detection because it’s fast and cheap, and Claude 3.5 Sonnet for the final translation because it consistently gave the best contextual output in our tests. Both were accessed via their REST APIs using the official Python SDKs (openai for DeepSeek, anthropic for Claude).

Detection Prompt

The detection prompt is few‑shot, asking the model to output a JSON object with is_non_standard (boolean) and a reason string. We also feed it a few examples of what is and isn’t non‑standard.

detection_prompt = """
You are a linguistic classifier. Your job is to determine whether a sentence contains non‑standard language:
- Idioms (e.g., "kick the bucket")
- Metaphors that can't be translated literally
- Cultural references (e.g., "his Waterloo")
- Ambiguous phrases that require context

For each sentence, respond with a JSON object:
{"is_non_standard": true/false, "reason": "..."}

Examples:
Sentence: "The early bird catches the worm."
Response: {"is_non_standard": true, "reason": "Idiom meaning that those who act promptly gain an advantage."}

Sentence: "She opened the door and walked outside."
Response: {"is_non_standard": false, "reason": "Literal statement."}
"""

async def detect_non_standard(sentence: str) -> dict:
    response = await deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": detection_prompt},
            {"role": "user", "content": f"Sentence: {sentence}"}
        ],
        temperature=0.0
    )
    return json.loads(response.choices[0].message.content)

We found that asking for a reason forced the model to think a bit harder, reducing false positives. On our test set of 2,000 sentences, detection accuracy was around 92% (precision 0.88, recall 0.91). False negatives were the nasty ones because they’d never get a human review. To mitigate that, we added a UI toggle that lets users manually mark any sentence as non‑standard.

Translation Prompt

For flagged sentences, we call Claude with a more elaborate prompt. It must return a JSON object containing the translation, a meaning_explanation in the target language (so the user can read it easily), and an optional glossary of key culturally‑specific terms.

translation_prompt = """
You are a literary translator. Translate the following sentence from {source_lang} to {target_lang}.
If the sentence contains an idiom, metaphor, or cultural reference, adapt it so the meaning is preserved, not the literal words.
Also provide:
- meaning_explanation: a short explanation (in {target_lang}) of what the original expression means.
- glossary (optional): a list of objects with "term" and "explanation" for any special phrases.

Return a JSON object:
{"translation": "...", "meaning_explanation": "...", "glossary": [...]}
"""

async def translate_non_standard(sentence: str, source: str, target: str) -> dict:
    response = await anthropic_client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1024,
        system=translation_prompt.format(source_lang=source, target_lang=target),
        messages=[{"role": "user", "content": sentence}]
    )
    return json.loads(response.content[0].text)

Because LLMs occasionally output invalid JSON, we wrap the json.loads with a retry loop and fallback logic. After a few iterations, if parsing fails, we store the raw output and flag it for manual review.

The Async Pipeline

Processing an entire book means thousands of sentences. We didn’t want to block the user’s request, so we run the pipeline as a background asyncio task. The controller is a simple function that:

Splits the book into sentences using a regex and a language‑aware tokenizer (we used sacremoses for European languages).
Batches sentences into groups of 5 for detection (to reduce API calls).
Processes detection batches with a concurrency semaphore.
Collects flagged sentences and processes them in parallel with another semaphore for translation.
Saves results to PostgreSQL via SQLAlchemy async.

Here’s a sketch of the core loop:

import asyncio

# Semaphores to respect rate limits
DETECT_SEM = asyncio.Semaphore(10)   # 10 concurrent detection calls
TRANSLATE_SEM = asyncio.Semaphore(5) # 5 concurrent Claude calls

async def process_book(book_id: str, source_lang: str, target_lang: str):
    sentences = split_into_sentences(load_book_text(book_id))
    results = []

    # Detection phase
    detect_tasks = []
    batch = []
    for sent in sentences:
        batch.append(sent)
        if len(batch) == 5:
            detect_tasks.append(detect_batch(batch))
            batch = []
    if batch:
        detect_tasks.append(detect_batch(batch))

    detection_flags = await asyncio.gather(*detect_tasks)

    # Flatten and collect non-standard
    flagged = []
    for batch_res in detection_flags:
        for flag in batch_res:  # each is a dict
            if flag["is_non_standard"]:
                flagged.append(flag["sentence"])

    # Translation phase
    translate_tasks = [
        translate_with_semaphore(sent, source_lang, target_lang)
        for sent in flagged
    ]
    translations = await asyncio.gather(*translate_tasks)

    # Store to DB (simplified)
    await store_translations(book_id, sentences, flagged, translations)

async def detect_batch(batch: list[str]) -> list[dict]:
    async with DETECT_SEM:
        # Call the detection API for the whole batch using a multi‑sentence prompt
        ...

async def translate_with_semaphore(sentence: str, source: str, target: str) -> dict:
    async with TRANSLATE_SEM:
        return await translate_non_standard(sentence, source, target)

We used asyncio.gather for simplicity; in production we added proper error handling and per‑call retries (exponential backoff). For a typical 50k‑word book (~3,000 sentences), the detection phase took about 2 minutes (600 API calls with concurrency 10, ~200ms each), and translation for 20% flagged sentences (600 calls, ~1s each with concurrency 5) took another 2 minutes. Total processing time was around 4–5 minutes. Cost: roughly $0.50 for detection (DeepSeek) and $4–$8 for translation (Claude), depending on output length. Not free, but acceptable for the quality gain.

User Confirmation & Database

The final piece was letting users review and edit. We stored every translation in a translations table. Each row has:

sentence_index (position in the book)
original_text
is_non_standard (bool)
translation (nullable)
meaning_explanation (nullable)
glossary (JSONB)
user_confirmed (bool, default false)
user_translation (nullable, set when user overwrites)

SQLAlchemy async model (simplified):

from sqlalchemy import Column, Integer, String, Boolean, JSON, ForeignKey
from sqlalchemy.ext.asyncio import AsyncAttrs
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column

class Base(AsyncAttrs, DeclarativeBase):
    pass

class Translation(Base):
    __tablename__ = "translations"

    id: Mapped[int] = mapped_column(Integer, primary_key=True)
    book_id: Mapped[str] = mapped_column(String, ForeignKey("books.id"))
    sentence_index: Mapped[int]
    original_text: Mapped[str]
    is_non_standard: Mapped[bool] = mapped_column(Boolean, default=False)
    translation: Mapped[str | None]
    meaning_explanation: Mapped[str | None]
    glossary: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
    user_confirmed: Mapped[bool] = mapped_column(Boolean, default=False)
    user_translation: Mapped[str | None]

A FastAPI endpoint POST /books/{book_id}/translations/{translation_id}/confirm accepts optional edits and sets the confirmation flag. This lets users approve translations individually or, after reviewing, bulk‑confirm everything.

Lessons Learned & Trade‑offs

Prompt engineering is half the battle. For detection, we initially got many false negatives for subtle metaphors. Adding a reason field and a few negative examples cut the error rate significantly. For translation, asking for an explanation in the target language (not English) made it far more useful for non‑English‑speaking users.

Cheap model for detection, expensive model for translation. It’s a classic cost‑accuracy trade‑off. DeepSeek is over 10x cheaper than Claude, and for a binary classification task, it works well. However, we had to accept a ~8% false negative rate, which we offset with a user‑friendly manual override.

Concurrency with semaphores is simple but has limits. For now, our VPS handles the load fine, but if we scale to many simultaneous book uploads, we’d need a proper task queue like Celery. We kept things simple because asyncio semaphores + FastAPI background tasks were enough for our current volume.

User confirmation adds friction but builds trust. Publishers loved being able to see why the AI made a certain choice. Casual users found it a bit tedious, so we added a “quick approve” mode that pre‑confirms everything but still lets you spot‑check.

Batch detection saves API calls but risks missing context. Grouping 5 sentences into one prompt reduced detection calls by 80%, but it occasionally missed idioms that spanned multiple sentences. We’re experimenting with merging adjacent flagged sentences before translation.

What’s Next?

We’re looking into using embedding similarity to a curated set of known idioms as a pre‑filter to make detection nearly free. Also, we plan to fine‑tune a small local model for detection, eliminating the API cost entirely.

You can try the full pipeline yourself – the code is not open source yet, but we’ve shared the prompt templates and a gist of the async worker on our GitHub (placeholder link).

If you’ve built something similar, how did you handle the detection phase? Would love to hear about other approaches in the comments.

Takeaway: A two‑tier LLM pipeline with a clear user confirmation step can dramatically improve translation quality for literary texts — just be prepared to invest in prompt tuning and asynchronous orchestration.

Streaming Instant Translation with Cultural Insights: The Engineering Behind LectuLibre’s 即时翻译与文字解读

龚旭东 — Sat, 11 Jul 2026 03:02:32 +0000

How we built a feature that translates e-book text in real time and explains cultural nuances, all while keeping latency under 2 seconds with Python and FastAPI.

The Problem: More Than Just Words

At LectuLibre, we translate entire books using AI. But during development, we discovered that readers often get stuck on specific passages—idioms, cultural references, puns—that static translation glosses over. They wanted a quick way, while reading, to get not just a translation of a tricky sentence but also an explanation of its deeper meaning. So we built 即时翻译与文字解读 (Instant Translation & Text Interpretation). The goal: select any text in a book, click a button, and within seconds see a translation plus insightful commentary.

From an engineering perspective, this meant:

Must be fast (<2 seconds for most passages) to keep the reading flow.
Must handle many concurrent users (LectuLibre has thousands of active readers).
Must control LLM costs, as each request uses API credits.
Must deliver results incrementally so users feel instant feedback.

Our Approach: Single LLM Call with Streaming and Caching

We decided to use a single LLM call (Anthropic’s Claude 3.5 Sonnet) to generate both the translation and the interpretation. This reduces latency compared to two sequential calls. To speed up perceived performance, we stream the LLM’s response back to the frontend using Server-Sent Events (SSE). The stream first delivers the translation chunk by chunk, then the interpretation. So the translation appears quickly, while the more elaborate interpretation loads shortly after.

We also implemented aggressive caching with Redis, because many sentences in a book (or across books) are repeated or similar. A cache hit completely bypasses the LLM, dropping latency to near-zero.

Implementation: Backend in Python FastAPI

Our backend is built with FastAPI. Let’s dive into the core endpoint.

1. API Design

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import hashlib
from slowapi import Limiter
from slowapi.util import get_remote_address
import redis.asyncio as redis

app = FastAPI()
redis_client = redis.Redis(decode_responses=True)
limiter = Limiter(key_func=get_remote_address)

class TranslateRequest(BaseModel):
    text: str
    source_lang: str
    target_lang: str
    context_prev: str | None = None
    context_next: str | None = None
    book_id: str | None = None

@app.post("/translate_interpret")
@limiter.limit("10/minute")
async def translate_interpret(request: Request, body: TranslateRequest):
    # Create a cache key from text + language pair
    raw_key = f"{body.text}|{body.source_lang}|{body.target_lang}"
    cache_key = hashlib.sha256(raw_key.encode()).hexdigest()

    cached = await redis_client.get(cache_key)
    if cached:
        # For cache hits, we return the stored SSE-formatted string directly
        return StreamingResponse(
            iter([cached]),
            media_type="text/event-stream"
        )

    # Otherwise, call the LLM and stream the processed response
    prompt = build_prompt(body)
    return StreamingResponse(
        stream_llm(prompt, cache_key, body),
        media_type="text/event-stream"
    )

We use slowapi for rate limiting to prevent abuse. The cache key is a SHA256 hash of the normalized input to avoid huge keys.

2. Prompt Engineering for Structured Streaming

Our prompt instructs the LLM to output the translation first, wrapped in a special marker, then the interpretation. For example:

def build_prompt(req: TranslateRequest) -> str:
    context = ""
    if req.context_prev:
        context += f"Previous sentence: {req.context_prev}\n"
    if req.context_next:
        context += f"Next sentence: {req.context_next}\n"

    return f"""
You are a literary translation assistant. Translate the following text from {req.source_lang} to {req.target_lang}. Then, provide a brief interpretation that explains any cultural references, idioms, or nuances.

{context}
Text: {req.text}

Respond **exactly** in this format:
[TRANSLATION]
<your translation here>
[INTERPRETATION]
<your interpretation here>
"""

In the streaming generator, we parse the raw LLM stream for the markers [TRANSLATION] and [INTERPRETATION] and emit SSE events accordingly.

3. Streaming Generator with Marker Parsing

We use httpx to make an asynchronous streaming request to the Anthropic API (or any LLM with streaming). The generator looks for the markers and yields SSE events.

import httpx
import asyncio
import json

LLM_API_URL = "https://api.anthropic.com/v1/messages"
HEADERS = {
    "x-api-key": "your-api-key",
    "anthropic-version": "2023-06-01"
}

async def stream_llm(prompt: str, cache_key: str, req: TranslateRequest):
    data = {
        "model": "claude-3-5-sonnet-20240620",
        "max_tokens": 1024,
        "stream": True,
        "messages": [{"role": "user", "content": prompt}]
    }

    async with httpx.AsyncClient() as client:
        async with client.stream("POST", LLM_API_URL, headers=HEADERS, json=data) as response:
            buffer = ""
            current_marker = None
            full_response = ""
            async for chunk in response.aiter_text():
                if chunk.startswith("data: "):
                    payload = chunk[6:]
                    if payload == "[DONE]":
                        break
                    try:
                        obj = json.loads(payload)
                        text = obj.get("delta", {}).get("text", "")
                    except:
                        continue
                    full_response += text
                    buffer += text

                    # Check for markers in buffer
                    if "[TRANSLATION]" in buffer and current_marker is None:
                        idx = buffer.find("[TRANSLATION]")
                        before = buffer[:idx]
                        if before.strip():
                            yield f"event: unknown\ndata: {before}\n\n"
                        buffer = buffer[idx+len("[TRANSLATION]:"):]
                        current_marker = "translation"
                        yield "event: translation_start\ndata: \n\n"
                        continue
                    elif "[INTERPRETATION]" in buffer and current_marker == "translation":
                        idx = buffer.find("[INTERPRETATION]")
                        translation_end = buffer[:idx]
                        if translation_end.strip():
                            yield f"event: translation\ndata: {translation_end}\n\n"
                        buffer = buffer[idx+len("[INTERPRETATION]:"):]
                        current_marker = "interpretation"
                        yield "event: interpretation_start\ndata: \n\n"
                        continue
                    # Emit partial data based on current marker
                    if current_marker == "translation" and buffer:
                        yield f"event: translation\ndata: {buffer}\n\n"
                        buffer = ""
                    elif current_marker == "interpretation" and buffer:
                        yield f"event: interpretation\ndata: {buffer}\n\n"
                        buffer = ""
            # After streaming ends, store full response in cache
            await redis_client.setex(cache_key, 86400, full_response)

This is a simplified version; in production we handle edge cases like markers appearing split across chunks, which requires a more robust parser. We fell back to buffering until a newline, but it worked well in practice because the markers were at the start of new lines.

4. Caching Strategy

We cache the raw LLM output (the full string including markers) in Redis. On a cache hit, we can simply replay the stream as if it were coming from the LLM. However, for simplicity, our endpoint returns the entire cached content as a single SSE event containing both translation and interpretation. A better approach would parse the cached string and emit chunked events, but we found that a single event still felt fast enough (since the data is already available client-side). We might improve this later.

Performance and Lessons Learned

Latency: Uncached requests to Claude 3.5 Sonnet averaged 1.8 seconds until the first translation chunk, and 2.5 seconds for the complete response. With caching (roughly 40% hit rate for a typical novel due to repeated phrases), the p50 latency dropped to 80ms.
Cost: Each request costs ~$0.003 with Claude. Caching saved us around $500 per month for 10,000 daily active users.
Translation quality: Providing one sentence of context before and after dramatically reduced misinterpretation of ambiguous words. We also experimented with sending the entire paragraph if the selected text was a fragment, but that increased token usage.
Streaming parsing pain: The marker-based approach is fragile. If the LLM decides to insert an extra space or newline, the marker detection can fail. We added fallback by detecting common patterns with regex, but we’re planning to switch to tool use / function calling with structured output once the LLMs support streamable JSON. For now, it serves 99% of cases.

Frontend Integration

The frontend is a React-based reader. It listens to SSE events: translation_start, translation (chunks), interpretation_start, interpretation. It renders the translation in a side panel as it arrives, and the interpretation appears below after. We used the EventSource API with a polyfill for broad compatibility.

Open Question for the Community

How are you handling streaming structured output from LLMs without hacks? Has anyone found a reliable library or technique to parse a partial JSON object as it streams? We’d love to hear your solutions.

Takeaway: Building a real-time translation feature with cultural insights required balancing latency, cost, and accuracy. By streaming from a single LLM call and caching aggressively, we delivered a responsive experience that keeps readers in the flow. The marker-parsing hack got the job done, but we’re keen to evolve as the ecosystem matures.

How We Built Instant Translation Help (即时翻译帮助) with Python and LLMs

龚旭东 — Wed, 08 Jul 2026 03:02:28 +0000

Balancing speed and context with a hybrid glossary + LLM caching system

The Need for Instant Translation Help

At LectuLibre, we use LLMs to translate entire books into different languages. But even after a high-quality translation, readers sometimes stumble upon an unfamiliar word or want to see the original phrase for clarity. We envisioned a feature we called 即时翻译帮助 (instant translation help): clicking on any word in the translated text would immediately show a contextual explanation or alternative translation, right inside the reading interface.

The core requirement was speed. The popup had to feel instantaneous—under 500ms. A spinning wheel disrupts the reading flow. However, making a full LLM call for every click was out of the question: latency ranged from 2–5 seconds, and at scale it would be expensive.

Breaking Down the Problem

We needed:

Low-latency responses for common words
Context awareness (the same word can mean different things in different sentences)
Coverage: handle any word or short phrase the user might click
Cost efficiency: minimize LLM calls

Our first prototype simply sent the selected word and its surrounding sentence to Claude, but the 3‑second average wait was unacceptable. We had to get creative.

The Hybrid Approach: Glossary + LLM Fallback

We decided to preprocess each book after the main translation to build a bilingual glossary of key terms. At query time, we would first check this local glossary for a match. If found, we could return the result instantly. If not, we would fall back to a faster LLM (DeepSeek) and cache the response for future lookups.

In essence, the glossary acts as a permanent, read‑optimised cache, while the LLM covers the long tail and handles rare words.

Implementation Details

1. Preprocessing Pipeline

After a book is translated, a background task (we use Celery) runs a pipeline that extracts important phrases from the source text and aligns them with their translations. For a first version, we focused on noun phrases, as those are most likely to be clicked for clarification.

We used spaCy for POS tagging and noun chunking, and SentenceTransformers to help align source phrases with target phrases by cosine similarity.

import spacy
from sentence_transformers import SentenceTransformer, util

nlp_source = spacy.load("en_core_web_sm")
nlp_target = spacy.load("es_core_news_sm")  # example target
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

def extract_phrases(sent, nlp):
    doc = nlp(sent)
    return [chunk.text for chunk in doc.noun_chunks]

def align_phrases(src_sent, tgt_sent):
    src_phrases = extract_phrases(src_sent, nlp_source)
    tgt_phrases = extract_phrases(tgt_sent, nlp_target)
    if not src_phrases or not tgt_phrases:
        return []
    src_embs = model.encode(src_phrases)
    tgt_embs = model.encode(tgt_phrases)
    scores = util.cos_sim(src_embs, tgt_embs)
    pairs = []
    for i, src in enumerate(src_phrases):
        best_idx = scores[i].argmax()
        if scores[i][best_idx] > 0.7:
            pairs.append((src, tgt_phrases[best_idx]))
    return pairs

This approach misses many verbs and adjectives, but it gave us a solid starting point. We stored the glossary in PostgreSQL with the source sentence to provide context later.

CREATE TABLE glossary (
    id SERIAL PRIMARY KEY,
    book_id INT NOT NULL,
    source_phrase TEXT NOT NULL,
    target_phrase TEXT NOT NULL,
    context_sentence TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_glossary_book_phrase ON glossary (book_id, source_phrase);

The context_sentence column became crucial for disambiguation—more on that next.

2. Resolving Ambiguity with Embeddings

A simple glossary lookup on the surface form can be ambiguous: “bank” could be a river bank or a financial institution. To reduce mis-hits, we added a context‑matching step that compares the embedding of the user’s surrounding sentence with the gloss entry’s context sentence. Only if the similarity is high enough do we serve the glossary answer; otherwise, we fall through to the LLM.

def is_context_match(user_context, gloss_context):
    if not gloss_context or not user_context:
        return True  # not enough data, trust the match
    emb1 = model.encode(user_context)
    emb2 = model.encode(gloss_context)
    return util.cos_sim(emb1, emb2) > 0.6

This simple check improved precision by about 30% in our tests.

3. The FastAPI Endpoint

The core of the feature is a FastAPI endpoint that receives the book ID, selected text, and the surrounding sentence (grabbed by the frontend). We keep a local in‑memory cache (via cachetools.TTLCache) for LLM fallback results to avoid extra network trips.

from fastapi import FastAPI, Depends
from sqlalchemy.orm import Session
from cachetools import TTLCache
from pydantic import BaseModel

app = FastAPI()
llm_cache = TTLCache(maxsize=10000, ttl=3600)

class HelpRequest(BaseModel):
    book_id: int
    selected_text: str
    context_sentence: str = ""

@app.post("/translate-help")
async def translate_help(req: HelpRequest, db: Session = Depends(get_db)):
    normalized = req.selected_text.strip().lower()

    # 1. Check glossary with context
    gloss_entry = db.query(Glossary).filter(
        Glossary.book_id == req.book_id,
        Glossary.source_phrase.ilike(normalized)
    ).first()
    if gloss_entry and is_context_match(req.context_sentence, gloss_entry.context_sentence):
        return {"translation": gloss_entry.target_phrase, "source": "glossary"}

    # 2. Check LLM cache
    cache_key = f"{req.book_id}:{normalized}"
    if cache_key in llm_cache:
        return {"translation": llm_cache[cache_key], "source": "llm_cache"}

    # 3. Fallback to LLM (DeepSeek)
    prompt = (
        f"Translate the word '{req.selected_text}' in this context:\n"
        f"{req.context_sentence}\n"
        "Provide only the translation, no explanation."
    )
    translation = await call_deepseek(prompt, timeout=2.0)
    if translation:
        llm_cache[cache_key] = translation
        return {"translation": translation, "source": "llm"}
    else:
        return {"translation": "Translation not available", "source": "error"}

We chose DeepSeek for the fallback because its API response time for short prompts averaged 1.2 seconds, compared to 2-4 seconds for Claude. For the cache, we started with a simple in‑memory TTLCache to avoid Redis serialization overhead for small strings. When we later scaled to multiple workers, we added Redis as a shared second‑level cache, but the local one still handles ~80% of cache reads.

Performance and Results

We measured performance over a week on a sample of 100 books (mix of fiction and non‑fiction):

Glossary hit rate: 45% of queries (highest for technical books, lowest for poetry)
LLM cache hit rate: 20%
LLM fallback rate: 35%
p95 overall latency: 120 ms (with hits from glossary/cache returning in <10 ms)
LLM fallback average latency: 1.3 s

User feedback was positive—the occasional 1.5‑second wait for an obscure word was acceptable. Most readers never noticed any delay.

Lessons Learned & Trade‑offs

Glossary coverage: Our noun‑phrase‑only extraction left gaps (verbs, adjectives, idioms). A better word‑aligner (like awesome-align) could improve recall. We’re also experimenting with using the LLM itself during translation to output explicit translation pairs.
Cold start: The first readers of a newly translated book see more fallback calls. A background job could pre‑seed the cache with the top ~1000 words.
LLM reliability: Sometimes the LLM returns a full sentence instead of a single word. We added output validation: if the result is longer than 5 words, we reject it and show a generic message.
Caching invalidation: When a user reports a bad translation and we improve our prompt, we must clear the cache. We version our cache keys with a translation version number.
Cost: Even though 35% of queries hit the LLM, the per‑call cost is tiny (DeepSeek is very cheap). We’ve spent less than $5/month on this feature.

What’s Next?

We’re now testing a small, fine‑tuned T5 model deployed directly on our VPS to replace the DeepSeek fallback. Early results show ~200ms latency for most words, which would make the feature feel truly instant for every click.

Building 即时翻译帮助 reminded us that hybrid systems—combining fast heuristics with modern AI—often deliver the best user experience. Start simple, measure your cache hit rates, and iterate on the glossary quality. And never underestimate the power of a good cache!

We’d love to hear how others are tackling real‑time AI features. What architectures have worked for you?

Building Instant Translation Assistance for Book Translations with Python and LLMs

龚旭东 — Sat, 04 Jul 2026 03:01:48 +0000

How we integrated real-time phrase translation feedback into our AI-powered book translation workflow, and what we learned about latency, context, and prompt engineering.

When we launched LectuLibre, our AI-powered book translation platform, users loved the quality of full-chapter translations. But they kept asking for something else: while reading a partially translated book, they'd stumble on an untranslated phrase or an awkward auto-translation and want to quickly get a better version without leaving the page. So we built 即时翻译求助 (Instant Translation Help)—a feature that lets readers highlight any phrase and get a context-aware, human-quality translation within seconds, along with a brief explanation of tricky parts.

Here's how we built it, the technical challenges we faced, and the lessons we learned about stitching LLMs into a real-time reading experience.

Problem: Real-time, Context-Aware Translation Inside a Book

Most web apps offer generic translation via API calls—send a sentence to Google Translate, get a result. But that doesn't work for literary texts. A phrase like "She let the cat out of the bag" needs to be translated idiomatically, and the appropriate rendering depends heavily on the surrounding paragraphs (is the tone formal? sarcastic? part of a metaphor chain?). Our existing translation pipeline processes entire chapters in bulk with carefully crafted prompts, but for instant help, we needed sub-second latency while preserving that same depth of context.

Our Approach: Server‑Sent Events and a Smart Prompt Buffer

We chose Server-Sent Events (SSE) over WebSockets because the communication is one-directional (server pushes translation tokens) and SSE is simpler to implement with FastAPI. The client (a React app) sends a POST request with:

The phrase to translate
The book ID and the exact location (chapter/paragraph index)
The target language

Our backend retrieves the surrounding text from PostgreSQL (we store the original book in chunks), feeds a carefully assembled prompt to the LLM (Claude 3 Haiku for speed), and streams the response back token-by-token.

Implementation Deep Dive

1. Background Context Retrieval

We index each paragraph with its position. Given a highlighted phrase, we grab the paragraph containing it, plus one paragraph before and after. This usually provides enough narrative context without blowing up the prompt size.

async def get_context(book_id: str, para_index: int, db: AsyncSession):
    # Fetch surrounding paragraphs
    stmt = (
        select(BookParagraph)
        .where(
            BookParagraph.book_id == book_id,
            BookParagraph.index.between(para_index - 1, para_index + 1)
        )
        .order_by(BookParagraph.index)
    )
    result = await db.execute(stmt)
    paragraphs = result.scalars().all()
    return "\n".join(p.text for p in paragraphs)

2. Prompt Engineering for Instant Help

We needed a prompt that instructs the LLM to:

Translate the given phrase in the exact tone and style of the surrounding text
If the phrase contains an idiom or cultural reference, provide a natural equivalent in the target language, with a short explanation
Return the result as a clean Markdown snippet (translation + explanation)
Keep it concise (we display in a small popover)

Here's the core prompt template:

INSTANT_HELP_PROMPT = """
You are a literary translator. Below is the source text surrounding a highlighted phrase, the phrase itself, and the target language.
Translate the highlighted phrase into {target_lang} in a way that fits the style of the surrounding text.
If the phrase contains an idiom, metaphor, or cultural reference, provide a natural equivalent and a one-sentence explanation in parentheses.
Output format:
**Translation:** [your translation]
**Note:** [explanation if needed]

Surrounding text:
{context}

Highlighted phrase:
"{phrase}"

Translation:
"""

We found that Claude 3 Haiku respects this format almost always, and the "Note" part is omitted when not needed.

3. Streaming the Response with FastAPI and SSE

We built an async endpoint that yields SSE chunks. The client can start rendering the translation as tokens arrive, which feels instant.

from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
import json
import asyncio

router = APIRouter()

@router.post("/api/instant-help")
async def instant_help(request: Request):
    data = await request.json()
    phrase = data["phrase"]
    book_id = data["bookId"]
    para_index = data["paraIndex"]
    target_lang = data["targetLang"]

    async def event_generator():
        async with async_session() as db:
            context = await get_context(book_id, para_index, db)
        prompt = INSTANT_HELP_PROMPT.format(
            target_lang=target_lang,
            context=context,
            phrase=phrase
        )
        # Stream from Claude using the official Anthropic Python SDK
        async with anthropic.AsyncAnthropic() as client:
            stream = await client.messages.create(
                model="claude-3-haiku-20240307",
                max_tokens=300,
                temperature=0.3,
                messages=[{"role": "user", "content": prompt}],
                stream=True
            )
            async for event in stream:
                if event.type == "content_block_delta":
                    data = json.dumps({"text": event.delta.text})
                    yield f"data: {data}\n\n"
                elif event.type == "message_stop":
                    yield "data: [DONE]\n\n"

    return StreamingResponse(event_generator(), media_type="text/event-stream")

On the frontend, we use EventSource to consume these events. The whole round-trip from click to first token appears in about 400–600ms for typical phrases.

Trade-offs and Hard Decisions

Latency vs. Quality

Haiku is fast but not always perfect. We tried DeepSeek-V2 (slower but better with idioms) but its latency crossed 2 seconds, killing the "instant" feel. We settled on Haiku for now, with a secondary more detailed translation available on demand (which uses Claude 3 Opus in the background).

Cost Management

Each instant help call costs about $0.002 (input + output tokens). With thousands of users, that adds up. We implemented a local cache keyed on (book_id, para_index, phrase, target_lang) using Redis. Repeated requests for the same phrase (e.g., multiple users reading the same book) are served from cache instantly, reducing LLM calls by ~30% in our beta.

Prompt Buffer Size

Experimentally, more context (2 paragraphs) significantly improved quality without adding too many tokens. But including an entire chapter led to slower responses and occasional off-topic interpretations. We keep the context at ~500 tokens on average.

Results and What We Learned

User happiness: Readers now translate 3x more phrases than when they had to copy-paste to another tool. The inline Explanation often teaches them new idioms, which they love.
Engineering takeaway: Server-Sent Events are underrated for LLM streaming. They work perfectly over HTTP/2 and are trivial to debug compared to WebSockets.
Prompt sensitivity: The exact wording Output format: **Translation:** ... **Note:** ... reduced malformed responses by 90%. Small tweaks matter.
Caching is critical: With Redis, we kept extra LLM costs in check and improved perceived performance for popular books.

Where We Might Go Next

We're exploring a context window expansion that uses the entire chapter, but with aggressive summarization of preceding paragraphs via a cheap model call. Also, fine-tuning a small open-source model on our translation style could bring costs close to zero. If you've built similar inline AI features, how did you handle the cost/latency/quality triangle? We'd love to hear your approach in the comments.

Building LectuLibre has taught us that AI-powered tools shine when they fit seamlessly into the user's workflow. Instant translation help is that seam—a small feature that feels like magic because it respects the reader's flow.

How We Translate 300-Page Books Using Claude Without Hitting Token Limits

龚旭东 — Wed, 01 Jul 2026 03:01:18 +0000

Breaking long documents into overlapping chunks, preserving context, and reassembling with FastAPI

At LectuLibre, we’ve built an AI‑powered platform that translates entire books—EPUBs and PDFs—using large language models. When we first hooked up Claude’s API, we naively fed it a 300‑page PDF in one request. It failed immediately. Claude 3 Opus has a 200K token window, but a 300‑page book can easily run to 300K tokens or more. Even if we squeezed it in, the output would be truncated and the quality would degrade at the extremes of the context window.

So we faced a classic long‑document problem: how do you translate a book that’s larger than the model’s context window? Here’s the real approach we ended up with, the code we wrote, and the lessons we learned.

The Problem: Token Limits Are Real

Claude 3 Opus and Haiku models (and most LLMs) have a maximum context length—200,000 tokens for Opus. A token is roughly ¾ of a word. A 300‑page novel with ~75,000 words translates to about 100K tokens, so it should fit, right? But translations from English to Spanish can expand by 15–20%, and the prompt instructions, system message, and the user message itself all eat into that budget. Plus, we needed to send the entire source text in every call to give the model full context. That’s not feasible.

We could have tried a simple split: cut the book at arbitrary page boundaries and translate piecemeal. That fails spectacularly. Narrative breaks mid‑sentence, and phrases like “the previous chapter” lose their referents. We needed a more intelligent chunking strategy.

Our Approach: Sliding Window with Overlapping Paragraphs

We settled on a sliding window chunking algorithm based on paragraphs, with a generous overlap. Here’s the idea:

Split the source text into paragraphs (using \n\n).
Build chunks of max_chunk_tokens (we used 180,000 to keep a safety margin), adding paragraphs one by one and counting tokens with tiktoken.
When the chunk exceeds the limit, we start a new chunk but we include the last few paragraphs of the previous chunk as context. This overlap (we used 5 paragraphs) gives the model continuity across chunk boundaries.
We translate each chunk independently, then stitch them back together, removing the overlap.

This isn’t perfect—some chapters may still be split—but it preserves far more context than any fixed‑size split.

Implementation in Python with FastAPI

We built our translation pipeline inside a FastAPI background task. Here’s the core chunking function:

import tiktoken
from typing import List
from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_by_paragraphs(text: str, max_tokens: int = 180000, overlap_paragraphs: int = 5) -> List[str]:
    """
    Split text into chunks of at most `max_tokens` tokens,
    using paragraphs as atomic units and overlapping the last
    `overlap_paragraphs` from the previous chunk.
    """
    enc = tiktoken.get_encoding("cl100k_base")  # Claude's tokenizer
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = []
    current_token_count = 0

    for para in paragraphs:
        para_tokens = len(enc.encode(para))
        # If a single paragraph exceeds the limit (rare), split it further
        if para_tokens > max_tokens:
            # Fallback to sentence splitting
            para_texts = RecursiveCharacterTextSplitter(
                chunk_size=max_tokens, chunk_overlap=100,
                length_function=lambda x: len(enc.encode(x))
            ).split_text(para)
            for p in para_texts:
                p_tokens = len(enc.encode(p))
                if current_token_count + p_tokens > max_tokens and current_chunk:
                    chunks.append('\n\n'.join(current_chunk))
                    overlap = current_chunk[-overlap_paragraphs:] if len(current_chunk) >= overlap_paragraphs else current_chunk
                    current_chunk = overlap.copy()
                    current_token_count = sum(len(enc.encode(p)) for p in overlap)
                current_chunk.append(p)
                current_token_count += p_tokens
        else:
            if current_token_count + para_tokens > max_tokens and current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                # Keep overlapping paragraphs
                overlap = current_chunk[-overlap_paragraphs:] if len(current_chunk) >= overlap_paragraphs else current_chunk
                current_chunk = overlap.copy()
                current_token_count = sum(len(enc.encode(p)) for p in overlap)
            current_chunk.append(para)
            current_token_count += para_tokens

    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    return chunks

Then we translate each chunk using Anthropic’s Python SDK, with back‑pressure and retry logic to handle rate limits:

from anthropic import Anthropic, RateLimitError
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

async def translate_chunk(client: Anthropic, chunk: str, target_lang: str) -> str:
    system_prompt = f"You are a professional translator. Translate the following text from English to {target_lang}. Preserve all formatting, line breaks, and special characters. Do not add commentary."

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60))
    async def _call():
        try:
            response = await asyncio.to_thread(
                client.messages.create,
                model="claude-3-opus-20240229",
                max_tokens=4096,
                system=system_prompt,
                messages=[{"role": "user", "content": chunk}]
            )
            return response.content[0].text
        except RateLimitError:
            # Let tenacity handle the retry
            raise
    return await _call()

We use asyncio.to_thread because the Anthropic SDK is synchronous; in a FastAPI app we don’t want to block the event loop. The tenacity library gives us exponential backoff for rate limits. After translating all chunks in parallel with asyncio.gather, we merge them:

def merge_chunks(translated_chunks: List[str], overlap_paragraphs: int = 5) -> str:
    """
    Concatenate translated chunks, removing the overlapping paragraphs
    except from the first chunk.
    """
    if not translated_chunks:
        return ""
    result = translated_chunks[0]
    for i in range(1, len(translated_chunks)):
        # Each subsequent chunk starts with 5 overlap paragraphs; skip them
        chunk_paragraphs = translated_chunks[i].split('\n\n')
        # We assume the translation preserved paragraph boundaries
        main_text = chunk_paragraphs[overlap_paragraphs:] if len(chunk_paragraphs) > overlap_paragraphs else chunk_paragraphs
        result += '\n\n' + '\n\n'.join(main_text)
    return result

Parallel Translation and Performance

We run all chunk translations concurrently. For a 300‑page book, we typically get 5–8 chunks of ~180K tokens each. With Claude 3 Opus, each chunk takes about 15–30 seconds to translate. We impose a concurrency limit of 4 simultaneous calls to avoid hitting Anthropic’s rate caps. Overall, a full‑book translation completes in 2–5 minutes.

Cost: Claude 3 Opus is expensive. At $15 per million input tokens, a 300‑page book (~100K input tokens per chunk, ~8 chunks) costs around $12–15. We mitigated this by offering Claude 3 Haiku (cheaper, faster, but lower quality) and DeepSeek as alternatives. Users can choose.

Quality trade‑offs: The overlap strategy works well for most texts, but sometimes a chapter ends exactly at a chunk boundary and the narrative flow feels a bit disjointed. We experimented with dynamic overlap based on chapter markers (e.g., force a split only at chapter headings), but that added complexity and didn’t always align with token limits. We’re sticking with paragraph‑level overlap for now.

Lessons Learned

Token counting is tricky. tiktoken’s cl100k_base is close to Claude’s tokenizer but not identical. We saw a 5% discrepancy in token counts, so we kept a safety margin of 20K tokens below the limit.
Overlap size matters. Too little overlap and you lose context; too much wastes tokens and money. Five paragraphs proved a sweet spot for most books.
Rate limits forced us to build robust retries. Anthropic’s API will 429 you aggressively if you fire too many concurrent requests. tenacity and a concurrency semaphore saved us.
The merge step must handle formatting. Splitting and rejoining on \n\n works for prose, but tables, lists, and code blocks get mangled. We’re now exploring a markdown‑aware splitter.
Cost transparency is crucial. Users understand that translating a 300‑page book isn’t free. We show an upfront cost estimate based on token counts.

Where We Are Now

LectuLibre’s translation pipeline currently handles EPUBs and PDFs up to ~1000 pages. We’ve translated novels, technical manuals, and even a PhD thesis. The chunking approach has held up surprisingly well, but there’s room for improvement: dynamic overlap detection, better table handling, and perhaps a two‑stage translation where we first summarize each chunk’s context.

If you’re building a similar system, don’t underestimate the merge logic. The chunking is easy; making the final output read like a single, coherent book is the real challenge.

What’s your experience with long‑form AI translation? Have you found a better chunking heuristic? We’d love to hear your thoughts in the comments.

Parsing and Rebuilding EPUB Files in Python: Lessons Learned

龚旭东 — Sat, 27 Jun 2026 03:00:46 +0000

How we handle complex EPUB structures for AI translation without breaking navigation and metadata

At LectuLibre, we built an AI‑powered book translation service. Users upload an EPUB, and our pipeline translates the text using LLMs like Claude and DeepSeek. That sounds straightforward until you have to parse and rebuild a valid EPUB without mangling the table of contents, internal links, or styles.

I’m sharing the real‑world challenge we faced, how we chose our tooling, and the ugly corners we discovered when dealing with real‑world EPUB files.

The Problem: EPUB is a Messy Zip File

An EPUB is essentially a ZIP archive containing XHTML, CSS, images, and an OPF manifest. It’s a well‑defined standard (EPUB 3.2), but in practice publishers produce files that bend the rules: missing container.xml, inline styles that break after translation, and structural quirks that make parsing fragile.

Our translation process needed to:

Accept any EPUB the user throws at us.
Extract all text content while preserving the exact structure.
Send each paragraph to an LLM for translation.
Re‑insert the translated text into the original XHTML files.
Repackage everything into a new, valid EPUB.

Step 4 is the tricky part: the translated text can be longer or shorter, it may contain characters that need escaping, and the surrounding markup must remain intact.

Our Approach: Use `ebooklib` with a Dose of Defensive Coding

We evaluated several Python libraries:

epub (pypub) – too simple, no editing support.
lxml + manual zip – too much boilerplate.
ebooklib – full read/write with a clean API.

We went with ebooklib. It provides an object‑oriented model of the EPUB structure, allows us to iterate over documents, and can write a new EPUB from the modified objects. The downside: its documentation is sparse and it can choke on malformed files. We had to layer on a lot of validation.

Step 1: Loading and Validating the EPUB

import ebooklib
from ebooklib import epub

def load_epub(epub_path: str) -> epub.EpubBook:
    book = epub.read_epub(epub_path, {"ignore_ncx": True})
    # Force title to be a string (some books have list titles)
    if isinstance(book.title, list):
        book.title = " ".join(book.title)
    return book

But we quickly learned that read_epub can fail silently if the book’s metadata is corrupted. We added a custom validation step that checks for a valid OPF and at least one spine item.

def validate_epub(book: epub.EpubBook):
    if not book.opf:
        raise ValueError("Missing OPF metadata")
    if len(list(book.spine)) == 0:
        raise ValueError("No spine items found – EPUB is unreadable")

Step 2: Extracting Readable Text from XHTML Documents

An EPUB’s content is stored in epub.EpubHtml objects. We iterate over all items in reading order (spine) and parse the body content with BeautifulSoup (lxml parser) because ebooklib’s own get_body_content() returns raw bytes, and we need to extract text paragraph‑by‑paragraph while keeping the HTML structure.

from bs4 import BeautifulSoup
import html

def extract_paragraphs(item: epub.EpubHtml) -> list[dict]:
    soup = BeautifulSoup(item.get_body_content(), "html.parser")
    paragraphs = []
    for tag in soup.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "li"]):
        clean_text = tag.get_text(strip=True)
        if clean_text:
            paragraphs.append({
                "tag": tag,
                "original": clean_text,
                "translated": None
            })
    return paragraphs

We keep a reference to the original BeautifulSoup tag object so we can later replace its text. This is memory‑heavy for large books but works for books under 10 MB (our VPS limit).

Step 3: Translating with an LLM (and Controlling Length)

For each paragraph we call our translation API (Claude or DeepSeek). The tricky part is that some paragraphs are very short (headers) or contain entity references. We escape HTML entities before sending, and decode them afterward.

import requests

def translate_text(text: str, source_lang: str, target_lang: str) -> str:
    escaped = html.escape(text, quote=False)
    response = requests.post(
        "https://api.lectulibre.com/v1/translate",  # simplified
        json={"text": escaped, "source": source_lang, "target": target_lang},
        headers={"Authorization": f"Bearer {API_KEY}"}
    )
    translated = response.json()["translated"]
    return html.unescape(translated)

We found that LLMs can sometimes add extra spaces or punctuation. We apply a light post‑processing: trim, normalize spaces, and ensure the translated text doesn’t break the containing tag’s structure.

Step 4: Rebuilding the XHTML with Translated Text

Back in the extract_paragraphs output, we replace the tag.string with the translated text. Since tag.string might be a NavigableString containing child elements, we must be careful. If the tag contains only a string, we replace it. If it contains mixed content, we replace the first text node only, which is a simplification that works for most books.

def replace_text(tag, new_text: str):
    if tag.string is not None and not tag.find_all(text=False):
        # Simple case: tag has only a single text node
        tag.string.replace_with(new_text)
    else:
        # Find the first text node and replace it
        for child in tag.children:
            if isinstance(child, str) and child.strip():
                child.replace_with(new_text)
                break

After all replacements, we set the item’s body content back to the modified HTML.

def update_item(item: epub.EpubHtml, paragraphs: list[dict]):
    for p in paragraphs:
        if p["translated"]:
            replace_text(p["tag"], p["translated"])
    # Rebuild the HTML
    html_str = p["tag"].prettify()  # or extract the full soup
    item.set_body_content(html_str.encode("utf-8"))

A problem here: set_body_content expects bytes, and we must ensure the encoding is UTF‑8. Also, if the original file had a XML declaration or namespaces, we might lose them. We handle that by preserving the item.media_type and other metadata.

Step 5: Writing the Translated EPUB

Once all items are updated, we write the book to a new file. We also add a modified‑date and update the language metadata.

def save_book(book: epub.EpubBook, output_path: str):
    book.set_identifier("urn:uuid:" + str(uuid.uuid4()))
    book.add_metadata("DC", "language", "fr")  # target language
    epub.write_epub(output_path, book, {})

We learned the hard way that epub.write_epub may fail if items reference resources (images, fonts) that aren’t properly registered in the manifest. We iterate all items from the original book and add them to the manifest early to avoid missing dependency errors.

Real‑World Pitfalls and How We Solved Them

Broken Table of Contents: After translation, the NCX/NAV files pointed to old file names or anchors that no longer existed because we had renamed items. We now never rename items; we only modify their content in-place. If we must add new items (e.g., for footnotes), we update the TOC manually using ebooklib.epub.Link objects.
Inline CSS Overwrites: Some books use inline styles like font-size: 12pt. When a translated paragraph becomes longer, it can overflow fixed‑height containers. We don’t modify CSS, but we added a warning for books with rigid styling and offer a “clean” version without fixed heights.
Performance: For a 500‑page novel, the entire pipeline (parse, translate, rebuild) takes about 90 seconds on our VPS (4 vCPU, 8 GB RAM). The LLM calls dominate; we batch paragraphs of up to 5 together to reduce API overhead, trading off a slight translation quality dip.
Memory: Loading the entire EPUB and keeping BeautifulSoup trees in memory can spike to 300 MB for large books. We process one book at a time and use a queue to avoid concurrency issues.

Lessons Learned

ebooklib is great but fragile – always validate the EPUB structure yourself; don’t assume all fields are present.
Preserve the original item order and names – renaming breaks internal links.
Escape/unescape HTML entities when moving text between HTML and plain text.
Translation quality depends on context – we’re experimenting with sending the entire chapter instead of individual paragraphs, but that increases latency.

What’s Next?

We’re exploring pandoc for pre‑conversion to a simpler intermediate format that’s easier to manipulate. However, the rebuild step becomes more complex. For now, ebooklib + BeautifulSoup serves our needs.

If you’re building an EPUB processing tool in Python, I hope these real‑world insights save you some of the debugging hours we spent. Got a better approach? I’d love to hear it in the comments!

Happy coding!

How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python

龚旭东 — Wed, 24 Jun 2026 03:02:01 +0000

Dealing with broken markup, embedded fonts, and namespace chaos while building LectuLibre's translation engine

At LectuLibre, we needed to translate entire EPUB books while preserving their exact visual structure. The core challenge: parse the EPUB, extract all translatable text, send it to an LLM, then reassemble the book with the translated content—images, CSS, fonts, and layout untouched. This turned out to be much harder than it looked. Here’s how we solved it, what broke, and what we learned.

The Problem: EPUBs Are Zip Files of Chaos

An EPUB is a ZIP archive containing XHTML, CSS, images, and a few XML control files (like container.xml and the OPF manifest). In theory, it’s a clean format. In practice, real‑world EPUBs are a mess:

XHTML with invalid markup, unclosed tags, or missing namespace declarations.
Embedded fonts, SVG chapters, and MathML that must be passed through untouched.
Text split across multiple inline elements (Hello World), requiring sentence‑aware translation.
The EPUB 3 spec is huge, and many books were generated by tools that barely follow it.

We needed a pipeline that could handle 90%+ of books without manual intervention, run fast enough for an interactive web service, and survive the most broken inputs we’d inevitably receive.

First Attempt: The High‑Level Library Approach

We reached for ebooklib—a dedicated Python library for reading and writing EPUB files. It gives you a nice object model: an EpubBook with items (documents, images, stylesheets), a spine, table of contents, and metadata. The code to open a book and grab all XHTML files looks deceptively simple:

import ebooklib
from ebooklib import epub

book = epub.read_epub('the-old-man-and-the-sea.epub')
for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = item.get_content().decode('utf-8')
    # translate content ...
    item.set_content(translated_content.encode('utf-8'))

epub.write_epub('translated.epub', book)

This works for many clean EPUBs. But when we stress‑tested it with 100 public‑domain books, we quickly hit walls:

Performance: ebooklib uses xml.dom.minidom internally; reading a 20 MB book with many XHTML files took over 6 seconds, and writing it back took even longer. Memory usage would spike to 1 GB+ because the entire DOM was held in memory.
Namespace handling: Some EPUB3 books use explicit XHTML namespaces everywhere (<html xmlns="http://www.w3.org/1999/xhtml">). ebooklib’s XML serialization would sometimes drop these namespaces, producing output that failed validation.
No fine‑grained text traversal: To replace only the displayed text while preserving markup, we had to parse the XHTML ourselves anyway.

Clearly, we needed something lower‑level for the actual content manipulation.

The Hybrid Solution: `ebooklib` for Metadata, `lxml` for XHTML

We settled on a hybrid architecture:

Use ebooklib to read and write the EPUB structure: the manifest, spine, TOC, and binary files (images, fonts). This saved us from having to reimplement the ZIP juggling and OPF generation.
For every XHTML file, we parse the content with lxml.etree (which is fast, namespaces‑aware, and can recover from broken markup). We walk the tree, extract translatable text segments, translate them, and then inject the translations back into the tree.

Here’s the core extraction logic:

from lxml import etree

def extract_translatable_blocks(html_bytes):
    parser = etree.HTMLParser(recover=True, encoding='utf-8')
    tree = etree.HTML(html_bytes, parser)
    # We only care about text that appears in the body.
    body = tree.find('.//body')
    if body is None:
        return []
    segments = []
    for element in body.iter():
        # Skip script, style, and void elements
        if element.tag in ('script', 'style', 'br', 'hr', 'img'):
            continue
        text = (element.text or '').strip()
        if text:
            segments.append((element, 'text', text))
        tail = (element.tail or '').strip()
        if tail:
            segments.append((element.getparent(), 'tail', tail))
    return segments

Notice we track both element.text and element.tail—this is critical because in HTML like Hello World, the word “World” is actually the tail of the  element.

Before translation, we group adjacent text segments into sentences. Our sentencizer (a lightweight regex‑based splitter) joins text across inline tags, so we send a single unit "Hello World" to the AI instead of two separate fragments. After translation, we split the result back across the original boundaries, taking care to preserve leading/trailing whitespace.

Rebuilding the XHTML

Once the tree is modified, we serialize it back with namespace preservation:

def serialize_html(root_node):
    # lxml's etree.tostring handles namespaces correctly if you pass the tree with nsmap
    html_str = etree.tostring(
        root_node,
        method='html',
        encoding='unicode',
        xml_declaration=False,
        pretty_print=True
    )
    # Wrap back into a full XHTML document if needed
    return f'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html>\n{html_str}'

We then call item.set_content(serialized_html.encode('utf-8')) on the ebooklib item and write the book back out.

Dealing with the Hard Stuff

1. Embedded Fonts and Binary Resources

ebooklib handled images and fonts transparently as ITEM_IMAGE and ITEM_OTHER. We simply skip translation for non‑XHTML items. However, we discovered that some books rely on font‑face declarations in CSS that must remain valid after rebuild. We don’t modify CSS (translating content: "Chapter 1" would be suicidal), but we do parse each CSS to check for font‑face src URLs and ensure they are preserved as relative paths in the rebuilt EPUB.

2. Validation with `epubcheck`

We run every rebuilt EPUB through epubcheck (the official Java validator) as a final sanity check. Initially, 30% of our output files failed—mostly because ebooklib would omit the mimetype file entry at the beginning of the ZIP, or because we inadvertently stripped xml:lang attributes. We patched our write routine to always inject the mimetype file first, and we now preserve all XML namespaces and attributes during the lxml manipulation.

3. Performance Tuning

Processing a 500‑page novel end‑to‑end (parse, translate, rebuild) takes roughly:

EPUB read & XHTML extraction: 2–3 s CPU
Translation API calls: 8–12 s (dominated by LLM latency)
XHTML rebuild & EPUB write: 3–4 s CPU

We parallelise XHTML file processing with asyncio (asyncio.to_thread for lxml work) because each chapter is independent. This brings wall‑clock time down to about 10 seconds for a typical book—acceptable for a real‑time web service.

Memory usage stays stable at ~150–200 MB by avoiding loading huge DOMs simultaneously.

Lessons Learned (The Hard Way)

High‑level libraries are a great start, but you’ll eventually need to understand the spec. ebooklib saved us weeks of work on the ZIP container and manifest. But debugging why a book wouldn’t open on iBooks meant reading the EPUB 3 spec and checking the OPF line by line.
Test on real‑world garbage. We assembled a corpus of 100 public‑domain EPUBs from Project Gutenberg, Standard Ebooks, and random indie publications. About 15% were seriously broken (e.g., XHTML with three opening <body> tags). lxml’s recover=True was a lifesaver.
Always validate after rebuild. Even if the book “looks fine” in Calibre, hidden structural errors will cause issues on other readers. Automate epubcheck in your CI.
Namespaces will ruin your day. Always use lxml’s nsmap when parsing; never assume default namespace prefixes.
Don’t translate everything. CSS content, <pre> formatted blocks, and math should be left alone. We filter out elements based on a configurable allow‑list.

Open Questions for the Community

We’re still not 100% satisfied with our pipeline. ebooklib is slow on large files due to its DOM‑based approach; rewriting the ZIP and OPF ourselves with zipfile + lxml could be faster, but it’s a lot of code. Are there other Python EPUB libraries that offer more granular control without the overhead? Would it make sense to fork ebooklib and swap out the XML backends for lxml? We’d love to hear your war stories—especially if you’ve built a similar translation or conversion pipeline.

This article was written by the LectuLibre engineering team. We’re building an AI‑powered book translation service—if you wrestle with EPUBs too, let’s talk!

Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service

龚旭东 — Sat, 20 Jun 2026 03:01:16 +0000

How we extract, translate, and reconstruct entire ebooks with Python while preserving every detail

At LectuLibre, we built a service that translates entire books using large language models. Our users upload EPUB files, and our backend pipeline parses them, extracts the text, sends it to an LLM for translation, and then rebuilds the EPUB with the translated content—all while preserving the original formatting, images, and metadata. This sounded straightforward until we looked inside a real EPUB.

EPUB is essentially a ZIP file containing a structured set of XHTML, CSS, and XML files. The content.opf file defines the reading order (spine), metadata, and manifest. The toc.ncx holds the table of contents. The actual text lives in XHTML documents, often split per chapter. To translate a book, we needed to: 1) reliably parse the EPUB, 2) locate all translatable text, 3) send it chunk by chunk to the LLM, and 4) rebuild the EPUB with the translated text while keeping every byte of the formatting intact.

The Problem with Off-the-Shelf Libraries

We initially reached for ebooklib, the most popular Python library for EPUB manipulation. It worked great for simple EPUBs—until we threw a few hundred real-world files at it. We quickly hit issues:

Metadata loss: ebooklib didn’t fully preserve custom metadata or namespace-prefixed properties in the OPF.
Namespace handling: When modifying XHTML, it could strip or mangle xmlns attributes, breaking rendering on some devices.
TOC and spine sync: After rebuilding, the table of contents and spine often got out of sync unless we manually repaired them.
Large files: Processing a 200‑chapter book consumed surprising memory because ebooklib loaded everything at once.

We could have used a heavyweight tool like Calibre’s command-line interface, but that introduced external dependencies and wasn’t as programmatically flexible. Instead, we decided to stick with ebooklib for high-level book structure and augment it with lxml for precise XML control.

Our Parsing and Rebuilding Pipeline

Here’s the core approach we landed on:

Read the EPUB with ebooklib to get a list of items (documents, images, CSS).
Identify translatable content – usually ITEM_DOCUMENT (XHTML) and sometimes ITEM_NAVIGATION (NCX for titles).
Parse each XHTML document with lxml, extract text, while keeping a map of each text node to its parent element.
Send blocks of text to the LLM for translation, preserving order and context.
Rebuild the XHTML by replacing original text nodes with their translations using the saved mapping.
Write the new EPUB with ebooklib, manually ensuring the OPF and spine are correct.

Let’s dive into the code.

Step 1: Reading and Filtering Items

import ebooklib
from ebooklib import epub

book = epub.read_epub('original.epub')

translatable_items = []
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        translatable_items.append(item)
    # Some books use NCX for chapter titles
    elif item.get_type() == ebooklib.ITEM_NAVIGATION:
        translatable_items.append(item)

We ignore images, fonts, and CSS—they don’t contain translatable text.

Step 2: Extracting Text with Context

We need to extract text while remembering exactly where it came from. We use lxml.etree to parse the XHTML and walk the tree, collecting text nodes and their XPath locations:

from lxml import etree

def extract_text_with_xpath(content):
    parser = etree.HTMLParser()
    root = etree.fromstring(content, parser)
    tree = etree.ElementTree(root)

    text_mapping = []  # list of (xpath, original_text, parent_element)
    for elem in root.iter():
        if elem.text and elem.text.strip():
            xpath = tree.getpath(elem)
            text_mapping.append((xpath, elem.text, elem))
        if elem.tail and elem.tail.strip():
            # tail text belongs to the parent, but logically follows the element
            parent = elem.getparent()
            xpath = tree.getpath(parent) if parent is not None else None
            if xpath:
                text_mapping.append((xpath, elem.tail, elem))
    return text_mapping

Pay attention to tail text—it’s the text that follows a closing tag, common in interleaved markup. Missing it leads to lost sentences.

Step 3: Translating in Chunks

We batch the collected text nodes into chunks that respect LLM token limits. For instance, we group consecutive text from the same XHTML document, aiming for ~3000 tokens per batch. We then send each chunk to our translation model (e.g., Claude 3.5 Sonnet) and receive a block of translated text. We split the translated block back into individual strings by comparing lengths (advanced: we use a diff algorithm to align original and translated sentences). This is simplified here for brevity.

Step 4: Replacing Text in the Original XHTML

Now we map translations back:

for (xpath, original, elem), translated_text in zip(text_mapping, translations):
    # Use xpath to locate the element again (parsed fresh from original)
    # but we cached the element objects, so we can just update them
    if elem.text and elem.text == original:
        elem.text = translated_text
    elif elem.tail and elem.tail == original:
        elem.tail = translated_text

# Serialize back to string
new_content = etree.tostring(root, encoding='unicode', method='html')

We return the modified XHTML as a string, ready to replace the item’s content in the EPUB.

Step 5: Rebuilding the EPUB

Here’s where ebooklib shines. We create a new EpubBook, set the same metadata (title, author, language), and add items:

new_book = epub.EpubBook()
new_book.set_identifier(original_book.get_metadata('DC', 'identifier')[0][0])
new_book.set_title(original_book.get_metadata('DC', 'title')[0][0])
new_book.set_language(original_book.get_metadata('DC', 'language')[0][0])

# Add all original items, replacing document content where needed
for item in original_book.get_items():
    if item.get_name() in modified_content_map:
        # Replace with translated XHTML
        new_content = modified_content_map[item.get_name()]
        new_item = epub.EpubItem(
            uid=item.get_id(),
            file_name=item.get_name(),
            media_type=item.get_type(),
            content=new_content.encode('utf-8')
        )
    else:
        # Copy image, CSS, etc. as-is
        new_item = item
    new_book.add_item(new_item)

# Replicate the spine and table of contents
new_book.spine = original_book.spine
new_book.toc = original_book.toc

# Write out
epub.write_epub('translated.epub', new_book, {})

But wait—this naive approach can corrupt the OPF. We found that ebooklib sometimes rewrites the spine order incorrectly if the original had complex nesting. To fix this, we manually post-process the written EPUB’s content.opf using lxml:

import zipfile
from lxml import etree

# Open the new EPUB as a ZIP
with zipfile.ZipFile('translated.epub', 'a') as zf:
    with zf.open('content.opf', 'r') as f:
        opf = etree.parse(f)
    # Ensure itemref order matches original spine
    spine = opf.find('.//{http://www.idpf.org/2007/opf}spine')
    # Reorder based on original spine list
    # ... custom correction logic ...
    zf.writestr('content.opf', etree.tostring(opf, xml_declaration=True, encoding='UTF-8'))

Yes, it’s ugly, but it saved us from countless validation errors.

Performance and Real-World Numbers

We benchmarked on a typical novel: 50 chapters, 350KB uncompressed. Parsing and extracting text: ~0.2 seconds. Rebuilding after translation: ~0.3 seconds. The LLM translation step dominates (around 45 seconds for the whole book), so we worked on parallelism for that part instead.

However, with larger educational texts containing hundreds of images and complex tables, memory usage spiked to over 500MB. We mitigated this by processing documents one by one and releasing them immediately.

Key Lessons Learned

Namespaces are the devil: Always preserve xmlns="http://www.w3.org/1999/xhtml" and any custom namespaces on the <html> tag. Lxml’s etree.tostring() with method='html' can drop them unless you explicitly add them back.
Validate, validate, validate: After rebuilding, we run epubcheck (via Python subprocess) to catch issues. False positives from custom metadata? We whitelist them after manual review.
Don’t trust the library for everything: ebooklib is great for reading, but for writing, we ended up doing a lot of OPF and NCX manipulation ourselves to ensure compliance.
Handle encoding upfront: Some old EPUBs use Latin-1. We transcode everything to UTF-8 early in the pipeline to avoid crashes later.
DRM is a dead end: We detect encrypted books by checking the <encryption> element in META-INF/encryption.xml and gracefully reject them.

The Open Question for the Community

We’d love to know how others are managing complex EPUB manipulation in production. Have you found a more robust library than ebooklib? How do you deal with interactive EPUB3 elements (Javascript, form fields) when translating? We’re still iterating on our pipeline and would appreciate any battle stories.

If you’re tackling similar problems or want to try translating your own eBooks, you can see the result of this work at LectuLibre. But most importantly, we hope this deep dive saves you a few late nights the next time you need to mess with EPUB internals.

How We Translate Entire Books with LLMs Without Losing Context

龚旭东 — Thu, 18 Jun 2026 23:19:00 +0000

Our chunking strategy that keeps chapters coherent, respects context windows, and handles multi-lingual books.

The problem: books don’t fit in a prompt

At LectuLibre, we translate entire books — novels, technical manuals, poetry — using large language models. It sounds simple: feed each paragraph to an LLM, concatenate results, done. But the moment we tried a 300‑page EPUB, chaos ensued. Chapters bled into each other, sentences were chopped mid‑word, and the translation of chapter 5 had no idea what happened in chapter 4.

LLMs have limited context windows. Even the massive 200K token window of Claude 3 can’t hold a whole 150K‑word book. And even if it could, the cost and latency would be absurd. We needed a way to split the book into manageable chunks while preserving enough context so that the translation remains coherent across thousands of pages.

Here’s how we designed a chunking pipeline that respects your wallet, the context window, and the book’s narrative flow.

Step 1: extract structure, not just text

Naively splitting by character count is a recipe for disaster. Instead, we first parse the document to understand its logical units: chapters, sections, headings. For EPUB, we use ebooklib; for PDF, pdfplumber. Both give us a stream of items (paragraphs, headings) that we then organize into a tree of chapters and sub‑sections.

import ebooklib
from ebooklib import epub

def get_chapters(epub_path):
    book = epub.read_epub(epub_path)
    chapters = []
    for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
        # Simplified: each document is a chapter
        content = item.get_content().decode('utf-8')
        chapters.append(content)
    return chapters

In practice, we use BeautifulSoup to extract <body> text and identify heading tags (<h1>–<h6>) to build a table of contents. This way, even if a chapter is 20,000 tokens, we keep it together as a single unit until later splitting.

Step 2: sentence‑aware splitting with token budgets

A chapter still needs to be broken down to fit the model’s context window. But we never split mid‑sentence. We use spaCy to tokenize the text into sentences, then greedily group them until we hit a token limit.

Why not simple character‑based splitting? Because sentences carry semantic boundaries. Breaking inside a sentence occasionally produces artefacts like “He walked to the sta‑” / “‑tion.” LLMs are forgiving but not that forgiving.

import spacy
from transformers import AutoTokenizer  # for accurate token count

nlp = spacy.load("en_core_web_sm")
tokenizer = AutoTokenizer.from_pretrained("claude-tokenizer")  # custom tokenizer for Claude

def sentence_split(text):
    doc = nlp(text)
    return [sent.text for sent in doc.sents]

def chunk_sentences(sentences, max_tokens=1800, overlap_sentences=5):
    chunks = []
    current_chunk = []
    current_token_count = 0

    for i, sent in enumerate(sentences):
        sent_tokens = len(tokenizer.encode(sent))
        if current_token_count + sent_tokens > max_tokens:
            # Store chunk with a sliding overlap
            chunks.append(current_chunk)
            # Overlap: take last `overlap_sentences` from the chunk just concluded
            current_chunk = sentences[i - overlap_sentences : i] if i - overlap_sentences > 0 else []
            current_token_count = sum(len(tokenizer.encode(s)) for s in current_chunk)
        current_chunk.append(sent)
        current_token_count += sent_tokens
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

We set max_tokens to 1800, leaving room for the system prompt, context from previous chunks, and the model’s response. That’s for Claude Haiku, which has a 32K context window. For longer‑context models we’d scale up, but keeping chunks smaller also means faster, cheaper API calls.

Step 3: passing context across chunks

The real magic is what we do between chunks. A standalone translation of chunk #5 has no clue that the protagonist just entered a dark cave in chunk #4. Two techniques solved this:

Sliding window of previous sentences — we include the last 5–10 sentences from the preceding chunk directly in the prompt as “context left.”
A running summary — after translating a chunk, we ask the LLM to generate a one‑sentence summary of that chunk. This summary is accumulated and fed into every subsequent prompt, so the model remembers high‑level events.

def build_prompt(chunk, previous_context_sentences, summary_so_far):
    context_left = " ".join(previous_context_sentences)
    prompt = f"""You are translating a book. Here is a summary of the story so far:
    {summary_so_far}

    And the previous text (for immediate context):
    "{context_left}"

    Now translate the following text to Spanish, preserving tone and style:
    {chunk}"""
    return prompt

The summary is generated using a separate, cheap call (we use DeepSeek for summaries, even if the main translation uses Claude). This keeps the context token usage minimal while still giving long‑range coherence.

Why not just include the entire previous chunk? That doubles the token count per call. On a 200K‑word book, that adds up to hundreds of dollars. Summaries cut that cost by ~80% with negligible quality loss.

The translation loop then looks like this:

overall_summary = ""
previous_context = []
full_translation = []

for chapter_chunks in all_chunks_by_chapter:
    chapter_summary = ""
    for i, chunk in enumerate(chapter_chunks):
        prompt = build_prompt(
            " ".join(chunk),
            previous_context,
            chapter_summary + "\n" + overall_summary if i > 0 else ""
        )
        translated = call_llm(prompt)
        full_translation.append(translated)

        # Update context: keep last 5 sentences of the translated chunk as next context
        trans_sents = sentence_split(translated)
        previous_context = trans_sents[-5:]

        # Generate chunk summary asynchronously to save time
        chunk_summary = call_llm(f"Summarize this passage in one sentence: {chunk}")
        chapter_summary += chunk_summary + " "
    overall_summary += chapter_summary

We process chunks concurrently using asyncio and httpx to keep translation times reasonable.

Real‑world results and trade‑offs

Translating a 120K‑word Spanish novel (“El Quijote”) into English took about 4 minutes end‑to‑end with Claude 3 Haiku. Total API cost: $0.67. The translation was surprisingly fluid — chapters felt connected, and the occasional flashback or pronoun reference (“she” referring to a character introduced three pages earlier) was correctly resolved. Without the context pipeline, the same book would have been riddled with inconsistencies.

We experimented with other models: DeepSeek‑V3 gave similar quality at half the price but with higher latency, making it better for batch jobs where speed isn’t critical. GPT‑4 Turbo reproduced stylistic flourishes more naturally, but its 16K context window forced us to use even smaller chunks, which sometimes fragmented dialogue. Claude struck the best balance.

But it’s not perfect. Humor and idioms still occasionally fall flat because the summary can’t encapsulate a running joke. Code blocks and tables inside technical books need special handling — we’re working on a parser that detects them and wraps them in [CODE] markers so the LLM doesn’t try to translate variable names. And poetry, with its line breaks and meter, remains a challenge; we’re considering a dedicated poetry‑aware chunker.

The key takeaway

If you’re building long‑document translation using LLMs, invest in a pipeline that:

Respects document structure (chapters, paragraphs) before splitting.
Splits on sentences, and always leaves room for context.
Provides both immediate context (last few sentences) and global context (summaries) to each chunk.
Uses separate, cheap models for auxiliary tasks like summarization to keep costs down.

Our code is not open‑source yet, but we plan to release the core chunking library once we’ve battle‑tested it on more formats.

How do you handle context in LLM translations? We’re especially curious about handling highly technical books with equations, footnotes, and cross‑references. Drop your ideas in the comments — let’s figure this out together.