How we integrated real-time phrase translation feedback into our AI-powered book translation workflow, and what we learned about latency, context, and prompt engineering.
When we launched LectuLibre, our AI-powered book translation platform, users loved the quality of full-chapter translations. But they kept asking for something else: while reading a partially translated book, they'd stumble on an untranslated phrase or an awkward auto-translation and want to quickly get a better version without leaving the page. So we built 即时翻译求助 (Instant Translation Help)—a feature that lets readers highlight any phrase and get a context-aware, human-quality translation within seconds, along with a brief explanation of tricky parts.
Here's how we built it, the technical challenges we faced, and the lessons we learned about stitching LLMs into a real-time reading experience.
Problem: Real-time, Context-Aware Translation Inside a Book
Most web apps offer generic translation via API calls—send a sentence to Google Translate, get a result. But that doesn't work for literary texts. A phrase like "She let the cat out of the bag" needs to be translated idiomatically, and the appropriate rendering depends heavily on the surrounding paragraphs (is the tone formal? sarcastic? part of a metaphor chain?). Our existing translation pipeline processes entire chapters in bulk with carefully crafted prompts, but for instant help, we needed sub-second latency while preserving that same depth of context.
Our Approach: Server‑Sent Events and a Smart Prompt Buffer
We chose Server-Sent Events (SSE) over WebSockets because the communication is one-directional (server pushes translation tokens) and SSE is simpler to implement with FastAPI. The client (a React app) sends a POST request with:
- The phrase to translate
- The book ID and the exact location (chapter/paragraph index)
- The target language
Our backend retrieves the surrounding text from PostgreSQL (we store the original book in chunks), feeds a carefully assembled prompt to the LLM (Claude 3 Haiku for speed), and streams the response back token-by-token.
Implementation Deep Dive
1. Background Context Retrieval
We index each paragraph with its position. Given a highlighted phrase, we grab the paragraph containing it, plus one paragraph before and after. This usually provides enough narrative context without blowing up the prompt size.
async def get_context(book_id: str, para_index: int, db: AsyncSession):
# Fetch surrounding paragraphs
stmt = (
select(BookParagraph)
.where(
BookParagraph.book_id == book_id,
BookParagraph.index.between(para_index - 1, para_index + 1)
)
.order_by(BookParagraph.index)
)
result = await db.execute(stmt)
paragraphs = result.scalars().all()
return "\n".join(p.text for p in paragraphs)
2. Prompt Engineering for Instant Help
We needed a prompt that instructs the LLM to:
- Translate the given phrase in the exact tone and style of the surrounding text
- If the phrase contains an idiom or cultural reference, provide a natural equivalent in the target language, with a short explanation
- Return the result as a clean Markdown snippet (translation + explanation)
- Keep it concise (we display in a small popover)
Here's the core prompt template:
INSTANT_HELP_PROMPT = """
You are a literary translator. Below is the source text surrounding a highlighted phrase, the phrase itself, and the target language.
Translate the highlighted phrase into {target_lang} in a way that fits the style of the surrounding text.
If the phrase contains an idiom, metaphor, or cultural reference, provide a natural equivalent and a one-sentence explanation in parentheses.
Output format:
**Translation:** [your translation]
**Note:** [explanation if needed]
Surrounding text:
{context}
Highlighted phrase:
"{phrase}"
Translation:
"""
We found that Claude 3 Haiku respects this format almost always, and the "Note" part is omitted when not needed.
3. Streaming the Response with FastAPI and SSE
We built an async endpoint that yields SSE chunks. The client can start rendering the translation as tokens arrive, which feels instant.
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
import json
import asyncio
router = APIRouter()
@router.post("/api/instant-help")
async def instant_help(request: Request):
data = await request.json()
phrase = data["phrase"]
book_id = data["bookId"]
para_index = data["paraIndex"]
target_lang = data["targetLang"]
async def event_generator():
async with async_session() as db:
context = await get_context(book_id, para_index, db)
prompt = INSTANT_HELP_PROMPT.format(
target_lang=target_lang,
context=context,
phrase=phrase
)
# Stream from Claude using the official Anthropic Python SDK
async with anthropic.AsyncAnthropic() as client:
stream = await client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=300,
temperature=0.3,
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for event in stream:
if event.type == "content_block_delta":
data = json.dumps({"text": event.delta.text})
yield f"data: {data}\n\n"
elif event.type == "message_stop":
yield "data: [DONE]\n\n"
return StreamingResponse(event_generator(), media_type="text/event-stream")
On the frontend, we use EventSource to consume these events. The whole round-trip from click to first token appears in about 400–600ms for typical phrases.
Trade-offs and Hard Decisions
Latency vs. Quality
Haiku is fast but not always perfect. We tried DeepSeek-V2 (slower but better with idioms) but its latency crossed 2 seconds, killing the "instant" feel. We settled on Haiku for now, with a secondary more detailed translation available on demand (which uses Claude 3 Opus in the background).
Cost Management
Each instant help call costs about $0.002 (input + output tokens). With thousands of users, that adds up. We implemented a local cache keyed on (book_id, para_index, phrase, target_lang) using Redis. Repeated requests for the same phrase (e.g., multiple users reading the same book) are served from cache instantly, reducing LLM calls by ~30% in our beta.
Prompt Buffer Size
Experimentally, more context (2 paragraphs) significantly improved quality without adding too many tokens. But including an entire chapter led to slower responses and occasional off-topic interpretations. We keep the context at ~500 tokens on average.
Results and What We Learned
- User happiness: Readers now translate 3x more phrases than when they had to copy-paste to another tool. The inline Explanation often teaches them new idioms, which they love.
- Engineering takeaway: Server-Sent Events are underrated for LLM streaming. They work perfectly over HTTP/2 and are trivial to debug compared to WebSockets.
-
Prompt sensitivity: The exact wording
Output format: **Translation:** ... **Note:** ...reduced malformed responses by 90%. Small tweaks matter. - Caching is critical: With Redis, we kept extra LLM costs in check and improved perceived performance for popular books.
Where We Might Go Next
We're exploring a context window expansion that uses the entire chapter, but with aggressive summarization of preceding paragraphs via a cheap model call. Also, fine-tuning a small open-source model on our translation style could bring costs close to zero. If you've built similar inline AI features, how did you handle the cost/latency/quality triangle? We'd love to hear your approach in the comments.
Building LectuLibre has taught us that AI-powered tools shine when they fit seamlessly into the user's workflow. Instant translation help is that seam—a small feature that feels like magic because it respects the reader's flow.
Top comments (0)