DEV Community

FLOW by Vestelon
FLOW by Vestelon

Posted on

Parsing Arabic PDF Bank Statements: What I Learned Building a Multi-Language Analyzer

Parsing Arabic PDF Bank Statements: What I Learned Building a Multi-Language Analyzer

When I started building FLOW — a bank statement PDF analyzer — I assumed Arabic would be one of the harder languages to support. I was right, but not for the reasons I expected.

The Right-to-Left Problem Isn't the Hard Part

Most developers assume Arabic PDF parsing is hard because of right-to-left text direction. In practice, modern PDF libraries (pdfplumber, PyMuPDF) handle RTL text reasonably well. The text extraction itself usually works.

The real challenge is positional logic.

Column Detection Breaks

In a typical Western bank statement, you have columns roughly like this:

Date | Description | Debit | Credit | Balance
Enter fullscreen mode Exit fullscreen mode

The date is on the left, balance on the right. Column detection heuristics work because the layout matches left-to-right reading order.

In an Arabic statement from ENBD or Mashreq, the layout is mirrored:

Balance | Credit | Debit | Description | Date
Enter fullscreen mode Exit fullscreen mode

If you try to apply the same column extraction logic, you get the balance where you expect the date, and vice versa. Your "date" column suddenly contains AED amounts, and your amount parsing fails silently.

Fix: Detect document direction first. We check the dominant text alignment in the header rows. If the majority of header cells are right-aligned, we mirror our column index mapping before extraction.

Mixed-Direction Documents

UAE bank statements often contain both Arabic and English — Arabic for legal text and section headers, English for transaction descriptions and amounts. This creates mixed-direction paragraphs that confuse naive extraction.

# Simplified direction detection
def detect_dominant_direction(page_text_blocks):
    rtl_chars = sum(1 for c in page_text_blocks if '\u0600' <= c <= '\u06FF')
    ltr_chars = sum(1 for c in page_text_blocks if c.isascii() and c.isalpha())
    return 'rtl' if rtl_chars > ltr_chars else 'ltr'
Enter fullscreen mode Exit fullscreen mode

Merchant Names: Arabic vs Transliterated

This was the real surprise. UAE bank statements frequently show merchant names in two formats depending on whether the merchant registered with the bank in Arabic or English:

  • Arabic text: actual Arabic script for the merchant name
  • Transliterated: Arabic merchant name written in Latin characters (e.g., "MAKTABAT AL JARIR" for a bookstore)
  • English: international merchant names in English

For categorization, you need to handle all three. We built a lookup table that maps known UAE merchants in all three formats to categories. It's manual work — there's no clean automated way to do this.

Date Format Variations

Arabic dates in UAE bank statements come in at least 4 formats:

  • DD/MM/YYYY (most common)
  • DD-MMM-YYYY with English month abbreviations
  • DD-MMM-YYYY with Arabic month names
  • Hijri calendar dates (rare, but present in some government-linked banks)

We process Hijri dates using the hijri-converter Python library. Miss this and you'll silently drop transactions or misparse dates by ~600 years.

What Actually Works

After months of edge cases, our pipeline for Arabic UAE statements:

  1. Extract raw text with PyMuPDF (better RTL handling than pdfplumber for Arabic)
  2. Detect dominant direction per page
  3. Apply mirrored column mapping if RTL
  4. Normalize merchant names through Arabic → Latin transliteration
  5. Attempt date parsing in all 4 formats, cascade until one succeeds
  6. Fall back to OCR (AWS Textract) if text extraction confidence is low

Banks we support well now: ENBD, ADCB, Mashreq, FAB, DIB, RAK Bank.

If you're parsing UAE bank statements and hitting edge cases, drop a comment — happy to share more specifics.


FLOW is a PDF bank statement analyzer supporting 8 languages including Arabic. First analysis free.

Top comments (0)