If you've ever built a feature that extracts text from PDFs, an Arabic-speaking user has probably filed this bug: "the words come out in reverse order." Not the letters — the words. Every line reads last-word-first.
I spent the better part of a year fixing this class of bugs while building Confileo, a free PDF toolkit with first-class Arabic support. Here's what's actually going on, because almost every explanation online is wrong or incomplete.
The four distinct failure modes
People say "Arabic breaks" as if it's one bug. It's four:
1. Visual vs logical order (the reversed-words bug)
A PDF doesn't store text the way a Word file does — it stores positioned glyph runs: "paint these shapes at these coordinates." For left-to-right scripts, the paint order happens to match the reading order, so naive extraction works by accident.
Arabic is right-to-left. Many PDF generators emit the glyph runs in visual order — the order they appear on screen, left to right. A naive extractor concatenates the runs as stored and produces every line word-reversed. The text was never "reversed" in the file; your extractor just assumed paint order == reading order.
Fix: reconstruct logical order using glyph positions + the Unicode Bidirectional Algorithm (UAX #9), not the content-stream order. Libraries like PyMuPDF already return text in logical order — a common mistake is "fixing" that output by reversing it again, which is how you get double-reversed text. Rule of thumb: never reverse Arabic yourself. If it looks backwards, your rendering layer lacks bidi support; the data is usually fine.
2. Disconnected letters (the ransom-note bug)
Arabic letters are contextual: ع renders differently in initial, medial, final and isolated positions, and letters join. That joining is applied at render time by a shaping engine (HarfBuzz being the standard).
If any step of your pipeline round-trips text through a non-shaping renderer — a canvas library, a barebones PDF writer, an image caption filter — you get isolated forms: م ر ح ب ا instead of مرحبا. (Fun fact: ffmpeg burns subtitles correctly only because libass links HarfBuzz + FriBidi. Strip those and Arabic subtitles shatter.)
Fix: only render Arabic through stacks that do real shaping. If you're generating PDFs headlessly, Chromium/Puppeteer shape correctly; many lightweight PDF writers do not.
3. Font substitution (the hollow-boxes bug)
Your server converts a document referencing a font it doesn't have. The substitution fallback has no Arabic glyphs → tofu boxes. This is purely an ops problem: conversion servers get provisioned with Latin fonts only.
Fix: install a proper Arabic set (Noto Naskh Arabic, Amiri, Cairo) on every box that renders documents, and configure fontconfig so substitution prefers them.
4. Scanned PDFs (the empty-output bug)
A scanned PDF has no text layer at all — each page is a bitmap. Extraction returns nothing, or worse, your pipeline silently emits an "editable" file full of images. The only real answer is OCR, and generic OCR models trained mostly on Latin do poorly on connected Arabic script; use one with genuine Arabic training (PaddleOCR and Tesseract's ara traineddata are workable starting points).
Numbers and mixed text: the boss level
Dates like 2024/05/13 inside an RTL sentence follow their own bidi embedding rules — digits are LTR within an RTL context. Get the algorithm wrong and "من 2020 إلى 2024" turns into "من 2024 إلى 2020", silently corrupting the meaning. This is why you implement UAX #9 (or use a library that does) instead of hand-rolling direction logic.
Try it / steal the approach
I've put all of this into production converters you can poke at for free — Arabic Word → PDF (shaping + bundled fonts) and Arabic PDF → Word (logical-order extraction). No signup needed, so they're easy to use as a reference for what correct output should look like when testing your own pipeline.
If you're dealing with the same class of bugs in Hebrew or Farsi — same four failure modes, same fixes. Questions welcome in the comments; I've probably hit whatever edge case you're stuck on. 🐪
Top comments (0)