Peter's Lab

Posted on Apr 6

The OCR Cascade: Why DOM-based Manga Translators Always Fail

#webdev #javascript #ai #computervision

At first glance, building a manga translator Chrome extension sounds straightforward.

Inject a script.
Detect text.
Translate it.
Render the result.

Done… right?

Not really.

The Assumption Most Developers Start With

Most implementations follow a simple pipeline:

Find text on the page
Run OCR if needed
Send to a translation API
Overlay translated text

This works well for standard web content.

But manga breaks almost every assumption in that pipeline.

Problem 1: Manga Text Isn’t “Text”

Unlike typical web pages, manga content is:

image-based
layout-dependent
often non-linear

Text is not separate from the UI.

It is the UI.
**
Problem 2: OCR Is the First Bottleneck**

To translate manga, you need OCR.

But OCR struggles with:

vertical Japanese text
stylized fonts
distorted or perspective text

Even worse:

OCR errors cascade downstream

A single misread character can:

change meaning
break sentence structure
confuse the translation model

Problem 3: The DOM vs Image Reality

Chrome extensions operate on the DOM.

Manga lives inside images.

This mismatch creates a fundamental limitation:

You can’t “select” manga text like normal text

Instead, you need:

image segmentation
region detection
layout understanding

Problem 4: Overlay Is a Hack (But Often the Only Option)

Most extensions solve rendering like this:

add a positioned

on top of the image

This leads to:

overlapping text
broken speech bubbles
inconsistent alignment

From an engineering perspective, it’s understandable.

From a user perspective:

it looks broken

Problem 5: Site Dependency

Extensions depend on:

DOM structure
CSS selectors
site layout

When a site updates:

your extension breaks

This makes long-term maintenance expensive and fragile.

Problem 6: No Control Over the Source

Extensions operate on third-party pages.

That means:

no control over image resolution
no control over compression
no control over layout consistency

Which makes reliable processing harder.

What Would a Better Architecture Look Like?

Instead of forcing everything into a browser extension model, a more robust approach is:

Extract the image
Run layout-aware detection
Perform OCR with context
Remove original text (inpainting)
Re-typeset the translation

This shifts the problem from:

“modify the page”

to:

“reconstruct the content”

The move is from DOM manipulation to Server-side Pipeline. Send the blob to a GPU-accelerated worker, run a layout-aware detection model, and return a reconstructed canvas.

Why This Matters

Many developers underestimate this space because:

it looks like a simple translation problem

But in reality, it’s a:

computer vision problem
layout reconstruction problem
UX problem
Final Thoughts

Chrome extensions are great for quick experiments and lightweight use cases.

But when it comes to manga translation:

the constraints of the browser environment become the bottleneck

If you’re building in this space, the question isn’t:

“how do I translate this text?”

It’s:

“how do I understand and rebuild this page?”

Optional: See the Difference

If you’re curious how a layout-first approach compares:

https://ai-manga-translator.com

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.