DEV Community

Praval Parikh
Praval Parikh Subscriber

Posted on

I Built a Scalable Financial Transaction System That Stays Correct Under Load

Financial systems rarely fail because they are slow.

They fail because they are wrong.

A duplicated transaction, an out-of-order withdrawal, or a race condition in a balance update is enough to permanently break user trust. As traffic grows, correctness becomes harder not easier.

In this post, I’ll walk through how I designed a scalable financial transaction system that prioritizes correctness first, while still achieving high throughput under heavy concurrency.

👉 Source code:

https://github.com/plebsicle/microfin


Why Naive Financial Systems Break at Scale

A typical starting point looks simple:

  1. A request hits the server
  2. The server validates it
  3. A database transaction updates the balance

This works fine at low traffic. Under concurrency, it quickly breaks.

Common failure modes:

  • Multiple concurrent updates to the same balance
  • Database connection exhaustion
  • Lock contention increasing latency
  • Race conditions causing incorrect balances

Even with database transactions, allowing concurrent writes to shared financial state is dangerous. Locking harder reduces throughput, and scaling vertically only delays the problem.

Direct concurrent writes to financial state do not scale safely.


Defining the Non-Negotiables

Before choosing any technology, I defined the system requirements:

  • Strong consistency for account balances
  • Sequential execution of transactions per user/account
  • High throughput under concurrent load
  • Horizontal scalability
  • Fault tolerance
  • A system that is easy to reason about

Every architectural decision follows from these constraints.


High-Level Architecture

The key idea is to separate request ingestion from state mutation.

At a high level:

  • Requests are accepted and validated by application servers
  • Valid transactions are queued
  • Transactions are processed sequentially per user
  • Durable state is updated in the database
  • Caching is used purely as a performance optimization

This allows the system to absorb traffic spikes without immediately mutating shared state.


System Architecture Diagram

The diagram below shows the high-level architecture of the system and how requests flow through different components, from request ingestion to durable storage.

High-level architecture of the financial transaction system

Each incoming request is first load-balanced by Nginx, validated by application servers, and then published to Kafka. Kafka enforces strict per-user ordering before workers update PostgreSQL (the system of record) and Redis (the cache).


The Core Insight: Ordering Beats Locking

The most important insight in this design is simple:

Correct ordering is more scalable than locking.

Instead of allowing concurrent updates and then trying to protect state with locks, the system ensures that transactions for a given user are never processed concurrently.

This is achieved through deterministic ordering, not mutual exclusion.


Enforcing Sequential Execution with Kafka Partitioning

Each transaction is associated with a user ID / account number.

A hash of this identifier determines which Kafka partition the transaction is sent to.

This gives us three critical guarantees:

  • All transactions for a user always go to the same partition
  • Each partition is consumed by exactly one worker
  • Kafka guarantees ordering within a partition

As a result:

  • No two transactions for the same user are processed concurrently
  • Race conditions are eliminated
  • Database-level locking is unnecessary

Scaling is achieved by increasing the number of partitions and workers, while preserving per-user ordering.


What Happens When a User Double-Clicks?

Users double-click buttons, retry requests, or refresh pages during slow network conditions.

In this system:

  • Each request is treated as a distinct transaction
  • Multiple similar requests may be enqueued
  • They are processed strictly sequentially

Rather than rejecting or deduplicating requests at the application layer, the system relies on deterministic execution order to guarantee correctness.


Why Kafka

Kafka isn’t used here just as a queue, it’s a core correctness mechanism.

Kafka provides:

  • Strong ordering guarantees within partitions
  • Horizontal scalability through partitioning
  • Fault tolerance via replication

Kafka producers are configured with idempotency enabled. This ensures that:

  • Duplicate messages are not written during retries
  • Ordering is preserved
  • Exactly-once production semantics are achieved per producer session

This is producer-side idempotency, not consumer-side deduplication. Correctness comes from ordering, not from rejecting duplicate-looking requests.


PostgreSQL as the Source of Truth

PostgreSQL acts as the system of record.

Why PostgreSQL:

  • Strong ACID guarantees
  • Mature transactional behavior
  • Predictable correctness under concurrency

PgBouncer is used for connection pooling to prevent database exhaustion under heavy load.

All balance updates occur inside database transactions, ensuring durability and correctness even in failure scenarios.


Redis: Performance Optimization, Not a Dependency

Redis is used for:

  • Session management
  • Caching frequently accessed data
  • Reducing read load on PostgreSQL

Redis is not part of the correctness path. There is no distributed transaction protocol between PostgreSQL and Redis. If Redis fails, the system remains correct — only performance is affected.


End-to-End Transaction Flow

  1. Client sends a transaction request
  2. Nginx load-balances the request
  3. Application server validates it
  4. Transaction is published to Kafka
  5. Kafka assigns it to a partition via hashing
  6. A worker processes transactions sequentially
  7. PostgreSQL updates occur inside a transaction
  8. Redis is updated opportunistically
  9. Response is returned to the client

Load Testing and Results

The system was stress-tested using K6 with scenarios including:

  • Account creation
  • User sign-ins
  • Deposits, withdrawals, and transfers

Results:

  • Sustained 3600+ requests per second
  • 100% success rate
  • Sub-second p95 latencies

Correctness was maintained throughout.


Observability (and Its Limits)

Prometheus and Grafana are used to monitor:

  • Request rate
  • Latency
  • CPU and memory usage of application servers

Infrastructure-level metrics for Kafka, Redis, and PostgreSQL are not instrumented in this version.


Trade-offs and Limitations

This system intentionally avoids:

  • Semantic request deduplication
  • Distributed transactions across PostgreSQL and Redis
  • Full infrastructure-level observability

These trade-offs keep the system simpler and more reliable.


What I’d Improve Next

Future improvements could include:

  • Kafka, Redis, and PostgreSQL metrics
  • Database sharding strategies
  • Advanced failure recovery mechanisms

Final Takeaways

  • Ordering simplifies concurrency
  • Correctness matters more than raw speed
  • Distributed systems are about trade-offs, not perfection
  • Designing for correctness early prevents painful rewrites

Scalability isn’t about handling more requests —

it’s about handling them correctly.


🔗 Project Link

https://github.com/plebsicle/microfin

Top comments (0)