FLOW by Vestelon

Posted on Jun 19

Parsing Arabic PDF Bank Statements: What I Learned Building a Multi-Language Analyzer

#python #arabic #fintech #tutorial

Parsing Arabic PDF Bank Statements: What I Learned Building a Multi-Language Analyzer

When I started building FLOW — a bank statement PDF analyzer — I assumed Arabic would be one of the harder languages to support. I was right, but not for the reasons I expected.

The Right-to-Left Problem Isn't the Hard Part

Most developers assume Arabic PDF parsing is hard because of right-to-left text direction. In practice, modern PDF libraries (pdfplumber, PyMuPDF) handle RTL text reasonably well. The text extraction itself usually works.

The real challenge is positional logic.

Column Detection Breaks

In a typical Western bank statement, you have columns roughly like this:

Date | Description | Debit | Credit | Balance

The date is on the left, balance on the right. Column detection heuristics work because the layout matches left-to-right reading order.

In an Arabic statement from ENBD or Mashreq, the layout is mirrored:

Balance | Credit | Debit | Description | Date

If you try to apply the same column extraction logic, you get the balance where you expect the date, and vice versa. Your "date" column suddenly contains AED amounts, and your amount parsing fails silently.

Fix: Detect document direction first. We check the dominant text alignment in the header rows. If the majority of header cells are right-aligned, we mirror our column index mapping before extraction.

Mixed-Direction Documents

UAE bank statements often contain both Arabic and English — Arabic for legal text and section headers, English for transaction descriptions and amounts. This creates mixed-direction paragraphs that confuse naive extraction.

# Simplified direction detection
def detect_dominant_direction(page_text_blocks):
    rtl_chars = sum(1 for c in page_text_blocks if '\u0600' <= c <= '\u06FF')
    ltr_chars = sum(1 for c in page_text_blocks if c.isascii() and c.isalpha())
    return 'rtl' if rtl_chars > ltr_chars else 'ltr'

Merchant Names: Arabic vs Transliterated

This was the real surprise. UAE bank statements frequently show merchant names in two formats depending on whether the merchant registered with the bank in Arabic or English:

Arabic text: actual Arabic script for the merchant name
Transliterated: Arabic merchant name written in Latin characters (e.g., "MAKTABAT AL JARIR" for a bookstore)
English: international merchant names in English

For categorization, you need to handle all three. We built a lookup table that maps known UAE merchants in all three formats to categories. It's manual work — there's no clean automated way to do this.

Date Format Variations

Arabic dates in UAE bank statements come in at least 4 formats:

DD/MM/YYYY (most common)
DD-MMM-YYYY with English month abbreviations
DD-MMM-YYYY with Arabic month names
Hijri calendar dates (rare, but present in some government-linked banks)

We process Hijri dates using the hijri-converter Python library. Miss this and you'll silently drop transactions or misparse dates by ~600 years.

What Actually Works

After months of edge cases, our pipeline for Arabic UAE statements:

Extract raw text with PyMuPDF (better RTL handling than pdfplumber for Arabic)
Detect dominant direction per page
Apply mirrored column mapping if RTL
Normalize merchant names through Arabic → Latin transliteration
Attempt date parsing in all 4 formats, cascade until one succeeds
Fall back to OCR (AWS Textract) if text extraction confidence is low

Banks we support well now: ENBD, ADCB, Mashreq, FAB, DIB, RAK Bank.

If you're parsing UAE bank statements and hitting edge cases, drop a comment — happy to share more specifics.

FLOW is a PDF bank statement analyzer supporting 8 languages including Arabic. First analysis free.

DEV Community

Parsing Arabic PDF Bank Statements: What I Learned Building a Multi-Language Analyzer

Parsing Arabic PDF Bank Statements: What I Learned Building a Multi-Language Analyzer

The Right-to-Left Problem Isn't the Hard Part

Column Detection Breaks

Mixed-Direction Documents

Merchant Names: Arabic vs Transliterated

Date Format Variations

What Actually Works

Top comments (0)