kevinbai

Posted on Mar 23

How AI Translates Manga: The Full Pipeline

#ai #nlp #ocr #inpainting

Translating manga sounds simple — just read the text and translate it, right? In practice, it's one of the most technically demanding NLP + computer vision problems you can tackle. The text is embedded in images, stylized, often arranged vertically, and packed inside speech bubbles that need to look natural after translation.

In this post I'll walk through the full AI pipeline behind automated manga translation — from raw image to a fully rendered, translated page.

The Full Pipeline

Input Image
    │
    ├─ [Optional] Upscaling
    ├─ 1. Text Detection
    ├─ 2. OCR
    ├─ 3. Textline Merge
    ├─ 4. Translation
    ├─ 5. Inpainting
    └─ 6. Rendering
         │
    Output Image

Six steps. Each one is a non-trivial problem on its own.

Pre-processing: Upscaling

Small panels or low-DPI scans contain text that's only a few pixels tall. Running detection and OCR on 8px-high characters produces garbage. Super-resolution models like Waifu2x — built specifically for anime/manga art — can 2x or 4x the resolution before anything else runs, dramatically improving downstream accuracy.

Step 1: Text Detection

The first challenge is finding where text is. Manga text comes in every orientation, size, and style imaginable — axis-aligned bounding boxes aren't enough.

Modern text detectors predict a pixel-level probability map of "text-ness" across the entire image, then threshold and trace it into polygons. This produces arbitrarily-shaped regions that conform to rotated, curved, or irregularly laid-out text — something a simple rectangle detector can't handle.

Step 2: OCR — Why Generic OCR Fails

Standard OCR tools fail badly on manga. Stylized fonts, vertical text, low contrast against screentone backgrounds — none of this matches what Tesseract or cloud OCR was trained on.

The solution is domain-specific models trained on manga panels. One example: MangaOCR, a ViT-based model fine-tuned specifically on Japanese manga. Because it's seen thousands of speech bubbles during training, it handles stylized lettering and vertical text far better than any general-purpose OCR.

Step 3: Textline Merge — A Graph Theory Problem

Detection gives you individual text lines. But a single speech bubble might contain five detected lines that need to be treated as one block for translation.

Merging by proximity alone fails — nearby lines from different bubbles get incorrectly grouped. The better approach: model it as a graph problem. Each textline is a node; edges connect candidates weighted by distance, font size similarity, and alignment direction. Cut the graph at the right edges and the remaining connected components become coherent text blocks, ready for translation.

Step 4: Translation

With clean text extracted, translation is where you have the most flexibility. You can plug in any API (GPT-4, DeepL, Gemini) or run local models (Meta's NLLB, which covers 200+ languages, or lighter models specialized for Japanese→English).

One interesting pattern: chaining translators. Route text through one model to get an intermediate language, then through a second model specialized in the final target language. Sometimes two mediocre steps beat one expensive one.

Step 5: Inpainting — Erasing the Original Text

Before rendering the translation, the original text has to be erased and the background reconstructed. This is harder than it sounds — the model needs to hallucinate the screentone, hatching, or artwork that was hiding underneath.

LaMa (Large Mask Inpainting) handles this well. Its large receptive field lets it understand the global structure of the image before deciding what to fill in — so it correctly continues a crosshatch pattern or background gradient across the erased region, rather than just blending nearby pixels.

Step 6: Rendering — The Hardest Part Nobody Talks About

Most DIY translation tools fall flat here. Slapping text onto an image is trivial. Making it look like it belongs there is genuinely hard.

A few of the problems the renderer has to solve:

Length mismatch: a 10-character Japanese phrase might translate to 40 English characters. The renderer must dynamically shrink the font, reflow the text, and keep it inside the bubble.
Rotation: dialogue in action panels is often tilted. The text layer is warped to match using a homography transform — the same math behind AR markers and image stitching.
Vertical text: Japanese manga uses top-to-bottom columns. Simply rotating glyphs 90° doesn't produce correct vertical typography — specific characters have dedicated vertical variants that need to be substituted.

Getting all three right simultaneously, for every panel on a page, is what separates a usable translation from a broken one.

Try It Without Setting Any of This Up

Configuring the full stack — models, CUDA, dependencies, API keys — takes real effort. If you just want to see what this pipeline produces on your own manga pages, mangatranslator.me runs it online without any local setup. Upload an image, pick a target language, done.

Final Thoughts

Automated manga translation is a good example of how real-world AI applications are rarely about one model — they're about a sequence of specialized models where each step's output quality determines the ceiling of the next. The rendering step in particular is a reminder that "put text on image" hides a surprising amount of engineering.

DEV Community