If you've ever tried to extract text from a scanned Arabic document, you already know the pain. Most OCR tooling is built English-first. Arabic adds three problems on top:
- Right-to-left (RTL) text that breaks naive layout assumptions.
- Connected letters (ligatures) — the same letter changes shape depending on its position in the word.
- Diacritics and a different numeral set that generic models drop or mangle.
The result: you run a scanned Arabic contract, invoice, or government form through a typical "PDF to text" tool and get back garbage — reversed words, missing letters, or nothing at all.
This post shows a practical way to turn a scanned Arabic PDF into a searchable PDF (a real, selectable text layer underneath the original page image) with a single API call — no ML pipeline to build, no GPU, no model weights to host. Code is in Python, cURL, and JavaScript.
Contents
- What "searchable PDF" actually means
- The approach
- Tips for better Arabic OCR results
- Honest limitations
- Why an API instead of self-hosting Tesseract
- Pricing
- Wrap-up
What "searchable PDF" actually means
There are two different things people call "OCR":
- Text extraction — you get back a string of the recognized text.
-
Searchable PDF — you get back a PDF that looks identical to the scan, but now has an invisible text layer, so
Ctrl+F, copy-paste, and indexing all work.
The second is what most real workflows need: you keep the original document exactly as scanned (important for legal/official docs), but it becomes searchable and accessible. That's what we'll produce here.
The approach
We'll use the PDF Tools API /ocr endpoint. Under the hood it runs Tesseract with the Arabic (ara) and English (eng) language models and rebuilds the PDF with an invisible OCR text layer. The relevant detail for us: you can pass lang=eng+ara to recognize mixed Arabic/English documents in one pass — which is what most real MENA paperwork actually is (Arabic body text, English brand names, Latin numbers).
You'll need a free API key from the listing (the free tier is 1,000 requests/month, no card). Then:
Python
import requests
API_KEY = "YOUR_RAPIDAPI_KEY"
HOST = "pdf-tools-api2.p.rapidapi.com"
with open("arabic_scan.pdf", "rb") as f:
resp = requests.post(
f"https://{HOST}/ocr",
headers={"X-RapidAPI-Key": API_KEY, "X-RapidAPI-Host": HOST},
files={"file": ("arabic_scan.pdf", f, "application/pdf")},
data={"lang": "eng+ara"}, # mixed Arabic + English
)
resp.raise_for_status()
with open("searchable.pdf", "wb") as out:
out.write(resp.content)
print("Done — searchable.pdf now has a real text layer.")
Open searchable.pdf and try selecting the Arabic text or searching it. It's there now.
cURL
curl -X POST "https://pdf-tools-api2.p.rapidapi.com/ocr" \
-H "X-RapidAPI-Key: YOUR_RAPIDAPI_KEY" \
-H "X-RapidAPI-Host: pdf-tools-api2.p.rapidapi.com" \
-F "file=@arabic_scan.pdf" \
-F "lang=eng+ara" \
--output searchable.pdf
JavaScript (Node / browser)
const form = new FormData();
form.append("file", fileInput.files[0]);
form.append("lang", "eng+ara");
const res = await fetch("https://pdf-tools-api2.p.rapidapi.com/ocr", {
method: "POST",
headers: {
"X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
"X-RapidAPI-Host": "pdf-tools-api2.p.rapidapi.com",
},
body: form,
});
const blob = await res.blob(); // application/pdf, now searchable
// Browser: download the searchable PDF
const url = URL.createObjectURL(blob);
const a = Object.assign(document.createElement("a"), { href: url, download: "searchable.pdf" });
a.click();
URL.revokeObjectURL(url);
If you only want the extracted string (for a database, a search index, an LLM pipeline), run the searchable PDF through Just need the raw text instead of a searchable PDF?
/extract-text:
resp = requests.post(
"https://pdf-tools-api2.p.rapidapi.com/extract-text",
headers={"X-RapidAPI-Key": API_KEY, "X-RapidAPI-Host": HOST},
files={"file": ("searchable.pdf", open("searchable.pdf", "rb"), "application/pdf")},
)
print(resp.json()["text"])
Tips for better Arabic OCR results
OCR quality depends mostly on the input scan, not the engine. To get clean output:
- Scan at 300 DPI or higher. Below ~200 DPI, connected Arabic letters blur together.
- Deskew crooked scans before sending. Even 2–3° of rotation hurts RTL recognition.
-
Use
eng+ara, notaraalone, for any document that mixes Latin characters (almost all real-world ones do). -
Keep it under 15 pages per request (split larger docs first — there's a
/splitendpoint). - Black-on-white beats colored backgrounds; if your scan is noisy, that's the biggest quality lever.
Honest limitations
Why an API instead of self-hosting Tesseract
You can apt install tesseract-ocr-ara and wire up the PDF rebuild yourself. People do. But you then own:
- installing and updating Tesseract + the Arabic language data,
- the rasterize → OCR → re-embed-text-layer pipeline (the fiddly part),
- font/encoding edge cases for the invisible RTL text layer,
- scaling it without melting your server on a 15-page scan.
If Arabic OCR is core to your product, self-hosting is fine. If it's one feature among many, one HTTP call you can put in a spreadsheet beats a maintenance project.
Pricing, briefly
The API is flat per-request — one OCR call is one request, whether it's a 1-page or 15-page scan. No credit tables, no per-page billing (iLovePDF, for comparison, charges OCR per page in credits). Free tier is 1,000 requests/month, permanently, no card. The same key also does merge, split, compress, encrypt, HTML→PDF, Office→PDF, redaction, and table extraction — 26 endpoints total.
Wrap-up
Arabic OCR has a reputation for being painful, and self-hosting it is. But for printed documents, turning a scanned Arabic PDF into a searchable one is now a single API call with lang=eng+ara. If you're digitizing Arabic archives, building a MENA document-management product, or just need Ctrl+F to work on a scanned contract, this gets you there in five minutes.
Your turn: what trips you up most with Arabic OCR — RTL layout, connected-letter ligatures, or diacritics getting dropped? And what are you digitizing: contracts, old books, or handwritten notes? Tell me in the comments. 👇
Try the Arabic OCR API free — 1,000 requests/month, no card
Built and maintained by a solo developer (based in Syria) who actually answers — questions welcome in the comments.
Top comments (1)
Author here 👋 I built this because I couldn't find a single PDF tool that handled Arabic without mangling it — so it's the one I wish I'd had.
One tip that didn't make the post: use
eng+araeven for "pure" Arabic docs — there's almost always a Latin number or brand name hiding in there thataraalone chokes on.Got an Arabic doc that's been giving you grief — a contract, an old book, a form? Describe it in the replies and I'll tell you honestly whether this handles it. I actually read these.