Why Arabic text comes out backwards when you extract it from a PDF (and how to fix it)

#webdev #i18n #pdf #unicode

If you've ever built a feature that extracts text from PDFs, an Arabic-speaking user has probably filed this bug: "the words come out in reverse order." Not the letters — the words. Every line reads last-word-first.

I spent the better part of a year fixing this class of bugs while building Confileo, a free PDF toolkit with first-class Arabic support. Here's what's actually going on, because almost every explanation online is wrong or incomplete.

The four distinct failure modes

People say "Arabic breaks" as if it's one bug. It's four:

1. Visual vs logical order (the reversed-words bug)

A PDF doesn't store text the way a Word file does — it stores positioned glyph runs: "paint these shapes at these coordinates." For left-to-right scripts, the paint order happens to match the reading order, so naive extraction works by accident.

Arabic is right-to-left. Many PDF generators emit the glyph runs in visual order — the order they appear on screen, left to right. A naive extractor concatenates the runs as stored and produces every line word-reversed. The text was never "reversed" in the file; your extractor just assumed paint order == reading order.

Fix: reconstruct logical order using glyph positions + the Unicode Bidirectional Algorithm (UAX #9), not the content-stream order. Libraries like PyMuPDF already return text in logical order — a common mistake is "fixing" that output by reversing it again, which is how you get double-reversed text. Rule of thumb: never reverse Arabic yourself. If it looks backwards, your rendering layer lacks bidi support; the data is usually fine.

2. Disconnected letters (the ransom-note bug)

Arabic letters are contextual: ع renders differently in initial, medial, final and isolated positions, and letters join. That joining is applied at render time by a shaping engine (HarfBuzz being the standard).

If any step of your pipeline round-trips text through a non-shaping renderer — a canvas library, a barebones PDF writer, an image caption filter — you get isolated forms: م ر ح ب ا instead of مرحبا. (Fun fact: ffmpeg burns subtitles correctly only because libass links HarfBuzz + FriBidi. Strip those and Arabic subtitles shatter.)

Fix: only render Arabic through stacks that do real shaping. If you're generating PDFs headlessly, Chromium/Puppeteer shape correctly; many lightweight PDF writers do not.

3. Font substitution (the hollow-boxes bug)

Your server converts a document referencing a font it doesn't have. The substitution fallback has no Arabic glyphs → tofu boxes. This is purely an ops problem: conversion servers get provisioned with Latin fonts only.

Fix: install a proper Arabic set (Noto Naskh Arabic, Amiri, Cairo) on every box that renders documents, and configure fontconfig so substitution prefers them.

4. Scanned PDFs (the empty-output bug)

A scanned PDF has no text layer at all — each page is a bitmap. Extraction returns nothing, or worse, your pipeline silently emits an "editable" file full of images. The only real answer is OCR, and generic OCR models trained mostly on Latin do poorly on connected Arabic script; use one with genuine Arabic training (PaddleOCR and Tesseract's ara traineddata are workable starting points).

Numbers and mixed text: the boss level

Dates like 2024/05/13 inside an RTL sentence follow their own bidi embedding rules — digits are LTR within an RTL context. Get the algorithm wrong and "من 2020 إلى 2024" turns into "من 2024 إلى 2020", silently corrupting the meaning. This is why you implement UAX #9 (or use a library that does) instead of hand-rolling direction logic.

Try it / steal the approach

I've put all of this into production converters you can poke at for free — Arabic Word → PDF (shaping + bundled fonts) and Arabic PDF → Word (logical-order extraction). No signup needed, so they're easy to use as a reference for what correct output should look like when testing your own pipeline.

If you're dealing with the same class of bugs in Hebrew or Farsi — same four failure modes, same fixes. Questions welcome in the comments; I've probably hit whatever edge case you're stuck on. 🐪

Top comments (2)

i18nagent • Jul 6

Great breakdown — the visual-vs-logical-order point is the one almost everyone misses, because for LTR scripts paint order and reading order coincide by accident and the bug never shows up until Arabic. The part that makes it genuinely hard: reconstructing logical order isn't just "reverse the runs," it's re-running the Unicode Bidi Algorithm (UAX #9) over the positioned glyphs — and the algorithm needs a base paragraph direction the PDF never stored, so you're inferring it from the script of the runs. And the moment a line mixes Arabic with an English brand name or a number (128 ريال), a naive reversal flips the Latin and digits too and you've turned one bug into two. Did you end up running a full bidi pass or approximating with run-level heuristics? That mixed-direction case is usually where the heuristics fall apart.

Franco Ortiz • Jul 10

A year of bidi bugs shows, this is the most complete explanation of the reversal problem I've seen outside the Unicode spec itself.

Adjacent thought: with Arabic-first users, is Confileo's own UI localized? I build i1n (i1n.ai), AI translation with context plus CI checks on locale files, RTL languages included. Different layer than PDF extraction, but for a product whose whole pitch is first-class Arabic support, an Arabic UI closes the loop. Free tier if it's on the roadmap.