Adding 70-language translation to an image API without paying per word

#selfhosted #ai #machinelearning #webdev

I run PixelDrive, an API + editor that turns templates
into branded images. You design once, mark layers as variables, then POST data
and get a PNG. The most common request was localization: teams were making the
same graphic in 10 languages by hand.

Here's how I built translation into the render pipeline, and why I self-hosted
it instead of calling a cloud translation API.

The shape of the problem

Translation here is not "translate documents." It's short, repetitive design
copy (headlines, CTAs, labels) rendered onto images, on a server that's
already CPU-bound doing the actual rendering. Two facts drove every decision:

The text is tiny and cacheable. Translate "Spring Sale" -> es once, store it, and every future use is a sub-millisecond cache hit.
The box has no spare CPU and no GPU. Anything I run competes with the image renderer.

Why self-host

With caching, the per-word cost of a paid API basically disappears, so cost was
not the deciding factor. The deciding factors were control and not wanting a
network hop in the render path. The trade-off is you have to pick a model that
is small, fast on CPU, and good on short copy.

The sweet spot:

Model: facebook/nllb-200-distilled-1.3B (200 languages, purpose-built for translation).
Runtime: CTranslate2 with int8 quantization. This is the key piece. It shrinks the model to ~1.3 GB and runs CPU inference fast. Do not run raw transformers on CPU for this.

A multi-stage Docker build does the conversion once, so the runtime image
carries no torch:

# stage 1: convert + quantize (needs torch, thrown away)
FROM python:3.11-slim AS converter
RUN pip install ctranslate2 "transformers[sentencepiece]" torch \
    --extra-index-url https://download.pytorch.org/whl/cpu
RUN ct2-transformers-converter \
      --model facebook/nllb-200-distilled-1.3B \
      --quantization int8 --output_dir /models/nllb-ct2

# stage 2: runtime (ctranslate2 + tokenizer only)
FROM python:3.11-slim
RUN pip install ctranslate2 "transformers[sentencepiece]" fastapi "uvicorn[standard]"
COPY --from=converter /models /models

The service itself is a tiny FastAPI app with /translate and
/translate_batch. NLLB uses FLORES-200 codes (spa_Latn), so I map ISO codes
(es) and normalize typographic punctuation (em dashes, curly quotes) that the
tokenizer would otherwise drop as <unk>.

Wiring it into rendering

The render service is the only place that needs translation, and it already has
a Redis/Kvrocks cache. The flow:

A text field can carry a language: { "headline": { "text": "Spring Sale", "lang": "es" } }.
The render cache key already includes the full payload (so es and fr are different cache entries). On a cache miss only, translate.
translatePayload() collects every text value with a lang, checks the cache (tr:<lang>:<sha(text)>), batches the misses to the translation service, writes results back to the cache, and swaps the text in.
The harness draws the translated text. The same cache key is used by the editor, so a translation done in the editor is reused at render time and vice versa.

Two operational details that mattered:

The translation container is CPU-capped (a couple of cores). Even a burst of cache misses can't starve the renderer.
It's best-effort: if the service is unavailable, it falls back to the original text. Translation never breaks a render.

The result

POST /v1/render
{
  "templateId": "...",
  "payload": {
    "headline": { "text": "Welcome to our spring sale", "lang": "es" }
  }
}

renders "Bienvenido a nuestra venta de primavera" onto the image. In the
editor there's a one-click "translate the whole template" button with a
searchable 70-language picker, and the bulk-CSV flow supports a per-row lang
column so a single upload can render a batch across markets.

Because every (text, language) pair is cached forever, the model only runs on
genuinely new strings, which for marketing copy is rare after warmup. A small
CPU model plus aggressive caching turned out to be a better fit than a cloud API
for this particular shape of problem.

If you want to see it in action: pixeldrive.pro.
Happy to answer questions about the setup in the comments.