DEV Community

Vin Xu
Vin Xu

Posted on

Why no text-to-speech tool can read Kindle Cloud Reader — and how I fixed it with OCR

I build a read-aloud Chrome extension. Early on, a user asked the obvious question: "Can it read my Kindle books?" Kindle has a web reader at read.amazon.com, so surely it's just another webpage, right?

It is not. And the reason turned out to be a genuinely interesting front-end rabbit hole.

The DOM returns garbage

The naive approach to any read-aloud tool is simple: grab the text from the DOM, send it to a speech engine. This works on 99% of the web.

On Kindle Cloud Reader it returns nonsense. Select a paragraph, copy it, paste it into a text editor — and you get a jumble of characters that don't match what's on screen.

The reason: Amazon renders book text using custom, obfuscated fonts. The character codes sitting in the DOM aren't the real letters — they're scrambled glyph indices, and only Amazon's font maps them back to the letters you actually see. The visible text is correct; the underlying text is deliberately not. It's an anti-scraping measure, and it breaks every TTS extension that reads the DOM.

So the real text lives in exactly one place: the pixels on screen.

If the DOM lies, read the render

The fix follows directly from the problem. The only faithful representation of the text is what's rendered — so instead of reading the DOM, read the rendered page, with OCR.

The pipeline ends up looking like this:

  1. Capture the rendered reader area as an image.

  2. Run OCR on it to recover the actual text.

  3. Send that text to the TTS engine.

  4. Auto-advance to the next page and repeat, so it reads continuously.

A few years ago "run OCR inside a browser extension" would've been a non-starter. Today it's a WASM build of Tesseract running in an offscreen document — no server round-trip, and the page image never leaves the machine.

The parts that were harder than expected

  • OCR quality vs. speed. OCR is far heavier than a textContent read — you trade a few milliseconds for a few hundred. Caching recognized pages and pre-processing the image (contrast, scaling) before OCR was the difference between "usable" and "painful."

  • Page boundaries. A book isn't one long scroll; it's paginated and virtualized. Reading continuously meant detecting when a page finished, flipping to the next programmatically, then re-running the capture.

  • Layout noise. Headers, page numbers, and footnotes get OCR'd too. You need light heuristics to drop them so the voice doesn't read "page 214" in the middle of a sentence.

A note on what this is (and isn't)

Worth being explicit, because it matters: this is an accessibility technique. It reads aloud a book you've already bought and are already viewing — the same thing your phone's screen reader does with on-screen content. It doesn't decrypt anything, doesn't export the book to a file, and doesn't produce a copy you could share. The book stays locked inside Kindle; all that leaves is audio of the page in front of you. If you can read it with your eyes, this lets you read it with your ears.

The takeaway

The general lesson outlived the specific feature for me: when the DOM lies, read the pixels. Obfuscated fonts, canvas-rendered text, gnarly SPA shadow DOMs — sometimes the faithful representation of the content isn't in the markup at all, and OCR-on-the-render is a surprisingly practical fallback now that it runs client-side.

I built this into CastReader. If you want to see it reading a Kindle page aloud, it's at castreader.com — a Chrome extension, free to start, 40+ languages.

Happy to get into the OCR pre-processing or page-advance details in the comments.

Top comments (0)