How I built a PDF bank statement analyzer in 8 languages (and what I learned)

#python #fintech #showdev #ai

I spent months building FLOW (vestelonflow.com) — a tool that analyzes bank statement PDFs and finds forgotten subscriptions, hidden fees, and recurring charges.

Here's what I learned building it in 8 languages.

The Problem

Most personal finance apps require you to connect your bank account. For many people (especially in Europe), that's a dealbreaker. GDPR concerns, privacy fears, and simply not trusting third-party apps with banking credentials.

My insight: the data people need is already in their PDF bank statements. Every bank generates them. Most people never look past the total.

The Tech Stack

The core flow:

User uploads PDF bank statement
PDF text extraction (pdfplumber + fallback OCR)
Transaction parsing — this is the hard part
LLM categorization pipeline
Subscription detection (recurring charges with same merchant)
Report generation

The trickiest part was transaction parsing. Every bank formats their PDF differently. German banks look nothing like Slovak banks. We ended up building bank-specific parsers for the most common formats and a fallback generic parser.

The 8-Language Challenge

Supporting Slovak, Czech, German, French, Spanish, Polish, Arabic, and Chinese wasn't just about translating the UI. The financial terminology varies significantly:

"Permanent order" in English = "Trvalý príkaz" in Slovak = "Dauerauftrag" in German
Subscription detection keywords differ by region
Date/amount formats are locale-specific

We ended up with language-specific merchant dictionaries for common subscription services in each market.

What Actually Matters

The biggest lesson: people don't want a budgeting dashboard. They want a specific, actionable number.

"You're spending €137/month on forgotten subscriptions" converts. "Your spending breakdown by category" does not.

The product is live at vestelonflow.com — first report is free, no card required, no bank connection needed.

Happy to answer questions about the PDF parsing approach, the LLM pipeline, or the localization challenges.

Top comments (2)

Alex Shev • Jun 18

PDF statements are a good case because the hard part is not the LLM call, it is confidence around messy extraction.

I would want the report to expose "why this transaction was classified this way" and where OCR/parsing confidence was low. In finance tooling, a useful uncertainty marker is better than a clean-looking but wrong category.

FLOW by Vestelon • Jun 19

You're exactly right — and this is one of the things that took the longest to get right. The hard part isn't extraction accuracy in isolation; it's knowing when to trust what was extracted.

We ended up tracking several confidence signals per transaction: OCR character confidence, regex match quality on the amount pattern, whether the date could be unambiguously parsed, and whether the merchant name matched known patterns. Transactions that score low on multiple signals get flagged rather than silently miscategorized.

The 'why was this categorized this way' feature is on the roadmap — it's genuinely useful for both end users and for our own debugging. Right now the most common confusion is corporate card merchant names (truncated, all-caps, reference codes) which look completely different from the same merchant on a personal account.