DEV Community

Cover image for Arabic OCR with an API: Make Scanned Arabic PDFs Searchable (Python)
Hani Amro
Hani Amro

Posted on

Arabic OCR with an API: Make Scanned Arabic PDFs Searchable (Python)

If you've ever tried to extract text from a scanned Arabic document, you already know the pain. Most OCR tooling is built English-first. Arabic adds three problems on top:

  1. Right-to-left (RTL) text that breaks naive layout assumptions.
  2. Connected letters (ligatures) — the same letter changes shape depending on its position in the word.
  3. Diacritics and a different numeral set that generic models drop or mangle.

The result: you run a scanned Arabic contract, invoice, or government form through a typical "PDF to text" tool and get back garbage — reversed words, missing letters, or nothing at all.

This post shows a practical way to turn a scanned Arabic PDF into a searchable PDF (a real, selectable text layer underneath the original page image) with a single API call — no ML pipeline to build, no GPU, no model weights to host. Code is in Python, cURL, and JavaScript.

Contents

What "searchable PDF" actually means

There are two different things people call "OCR":

  • Text extraction — you get back a string of the recognized text.
  • Searchable PDF — you get back a PDF that looks identical to the scan, but now has an invisible text layer, so Ctrl+F, copy-paste, and indexing all work.

The second is what most real workflows need: you keep the original document exactly as scanned (important for legal/official docs), but it becomes searchable and accessible. That's what we'll produce here.

The approach

We'll use the PDF Tools API /ocr endpoint. Under the hood it runs Tesseract with the Arabic (ara) and English (eng) language models and rebuilds the PDF with an invisible OCR text layer. The relevant detail for us: you can pass lang=eng+ara to recognize mixed Arabic/English documents in one pass — which is what most real MENA paperwork actually is (Arabic body text, English brand names, Latin numbers).

You'll need a free API key from the listing (the free tier is 1,000 requests/month, no card). Then:

Python

import requests

API_KEY = "YOUR_RAPIDAPI_KEY"
HOST = "pdf-tools-api2.p.rapidapi.com"

with open("arabic_scan.pdf", "rb") as f:
    resp = requests.post(
        f"https://{HOST}/ocr",
        headers={"X-RapidAPI-Key": API_KEY, "X-RapidAPI-Host": HOST},
        files={"file": ("arabic_scan.pdf", f, "application/pdf")},
        data={"lang": "eng+ara"},   # mixed Arabic + English
    )
resp.raise_for_status()

with open("searchable.pdf", "wb") as out:
    out.write(resp.content)

print("Done — searchable.pdf now has a real text layer.")
Enter fullscreen mode Exit fullscreen mode

Open searchable.pdf and try selecting the Arabic text or searching it. It's there now.

cURL

curl -X POST "https://pdf-tools-api2.p.rapidapi.com/ocr" \
  -H "X-RapidAPI-Key: YOUR_RAPIDAPI_KEY" \
  -H "X-RapidAPI-Host: pdf-tools-api2.p.rapidapi.com" \
  -F "file=@arabic_scan.pdf" \
  -F "lang=eng+ara" \
  --output searchable.pdf
Enter fullscreen mode Exit fullscreen mode

JavaScript (Node / browser)

const form = new FormData();
form.append("file", fileInput.files[0]);
form.append("lang", "eng+ara");

const res = await fetch("https://pdf-tools-api2.p.rapidapi.com/ocr", {
  method: "POST",
  headers: {
    "X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
    "X-RapidAPI-Host": "pdf-tools-api2.p.rapidapi.com",
  },
  body: form,
});
const blob = await res.blob(); // application/pdf, now searchable

// Browser: download the searchable PDF
const url = URL.createObjectURL(blob);
const a = Object.assign(document.createElement("a"), { href: url, download: "searchable.pdf" });
a.click();
URL.revokeObjectURL(url);
Enter fullscreen mode Exit fullscreen mode

Just need the raw text instead of a searchable PDF?

If you only want the extracted string (for a database, a search index, an LLM pipeline), run the searchable PDF through /extract-text:

resp = requests.post(
    "https://pdf-tools-api2.p.rapidapi.com/extract-text",
    headers={"X-RapidAPI-Key": API_KEY, "X-RapidAPI-Host": HOST},
    files={"file": ("searchable.pdf", open("searchable.pdf", "rb"), "application/pdf")},
)
print(resp.json()["text"])
Enter fullscreen mode Exit fullscreen mode

Tips for better Arabic OCR results

OCR quality depends mostly on the input scan, not the engine. To get clean output:

  • Scan at 300 DPI or higher. Below ~200 DPI, connected Arabic letters blur together.
  • Deskew crooked scans before sending. Even 2–3° of rotation hurts RTL recognition.
  • Use eng+ara, not ara alone, for any document that mixes Latin characters (almost all real-world ones do).
  • Keep it under 15 pages per request (split larger docs first — there's a /split endpoint).
  • Black-on-white beats colored backgrounds; if your scan is noisy, that's the biggest quality lever.

Honest limitations


This is Tesseract-based OCR, not a frontier vision model. It's excellent for printed Arabic (forms, contracts, books, invoices). It is not built for handwritten Arabic, heavily stylized calligraphy, or low-resolution phone photos — accuracy drops sharply there, same as every OCR engine. For clean printed scans it's genuinely good and, importantly, it's available — which is more than most PDF APIs can say for Arabic at all.

Why an API instead of self-hosting Tesseract

You can apt install tesseract-ocr-ara and wire up the PDF rebuild yourself. People do. But you then own:

  • installing and updating Tesseract + the Arabic language data,
  • the rasterize → OCR → re-embed-text-layer pipeline (the fiddly part),
  • font/encoding edge cases for the invisible RTL text layer,
  • scaling it without melting your server on a 15-page scan.

If Arabic OCR is core to your product, self-hosting is fine. If it's one feature among many, one HTTP call you can put in a spreadsheet beats a maintenance project.

Pricing, briefly

The API is flat per-request — one OCR call is one request, whether it's a 1-page or 15-page scan. No credit tables, no per-page billing (iLovePDF, for comparison, charges OCR per page in credits). Free tier is 1,000 requests/month, permanently, no card. The same key also does merge, split, compress, encrypt, HTML→PDF, Office→PDF, redaction, and table extraction — 26 endpoints total.

Wrap-up

Arabic OCR has a reputation for being painful, and self-hosting it is. But for printed documents, turning a scanned Arabic PDF into a searchable one is now a single API call with lang=eng+ara. If you're digitizing Arabic archives, building a MENA document-management product, or just need Ctrl+F to work on a scanned contract, this gets you there in five minutes.

Your turn: what trips you up most with Arabic OCR — RTL layout, connected-letter ligatures, or diacritics getting dropped? And what are you digitizing: contracts, old books, or handwritten notes? Tell me in the comments. 👇

Try the Arabic OCR API free — 1,000 requests/month, no card

Built and maintained by a solo developer (based in Syria) who actually answers — questions welcome in the comments.

Top comments (1)

Collapse
 
h_amro_13de6b93cc1ce profile image
Hani Amro

Author here 👋 I built this because I couldn't find a single PDF tool that handled Arabic without mangling it — so it's the one I wish I'd had.

One tip that didn't make the post: use eng+ara even for "pure" Arabic docs — there's almost always a Latin number or brand name hiding in there that ara alone chokes on.

Got an Arabic doc that's been giving you grief — a contract, an old book, a form? Describe it in the replies and I'll tell you honestly whether this handles it. I actually read these.