Praval Parikh

Posted on Jan 3

I Built a Scalable Financial Transaction System That Stays Correct Under Load

#systemdesign #database #performance #architecture

Financial systems rarely fail because they are slow.

They fail because they are wrong.

A duplicated transaction, an out-of-order withdrawal, or a race condition in a balance update is enough to permanently break user trust. As traffic grows, correctness becomes harder not easier.

In this post, I’ll walk through how I designed a scalable financial transaction system that prioritizes correctness first, while still achieving high throughput under heavy concurrency.

👉 Source code:

https://github.com/plebsicle/microfin

Why Naive Financial Systems Break at Scale

A typical starting point looks simple:

A request hits the server
The server validates it
A database transaction updates the balance

This works fine at low traffic. Under concurrency, it quickly breaks.

Common failure modes:

Multiple concurrent updates to the same balance
Database connection exhaustion
Lock contention increasing latency
Race conditions causing incorrect balances

Even with database transactions, allowing concurrent writes to shared financial state is dangerous. Locking harder reduces throughput, and scaling vertically only delays the problem.

Direct concurrent writes to financial state do not scale safely.

Defining the Non-Negotiables

Before choosing any technology, I defined the system requirements:

Strong consistency for account balances
Sequential execution of transactions per user/account
High throughput under concurrent load
Horizontal scalability
Fault tolerance
A system that is easy to reason about

Every architectural decision follows from these constraints.

High-Level Architecture

The key idea is to separate request ingestion from state mutation.

At a high level:

Requests are accepted and validated by application servers
Valid transactions are queued
Transactions are processed sequentially per user
Durable state is updated in the database
Caching is used purely as a performance optimization

This allows the system to absorb traffic spikes without immediately mutating shared state.

System Architecture Diagram

The diagram below shows the high-level architecture of the system and how requests flow through different components, from request ingestion to durable storage.

Each incoming request is first load-balanced by Nginx, validated by application servers, and then published to Kafka. Kafka enforces strict per-user ordering before workers update PostgreSQL (the system of record) and Redis (the cache).

The Core Insight: Ordering Beats Locking

The most important insight in this design is simple:

Correct ordering is more scalable than locking.

Instead of allowing concurrent updates and then trying to protect state with locks, the system ensures that transactions for a given user are never processed concurrently.

This is achieved through deterministic ordering, not mutual exclusion.

Enforcing Sequential Execution with Kafka Partitioning

Each transaction is associated with a user ID / account number.

A hash of this identifier determines which Kafka partition the transaction is sent to.

This gives us three critical guarantees:

All transactions for a user always go to the same partition
Each partition is consumed by exactly one worker
Kafka guarantees ordering within a partition

As a result:

No two transactions for the same user are processed concurrently
Race conditions are eliminated
Database-level locking is unnecessary

Scaling is achieved by increasing the number of partitions and workers, while preserving per-user ordering.

What Happens When a User Double-Clicks?

Users double-click buttons, retry requests, or refresh pages during slow network conditions.

In this system:

Each request is treated as a distinct transaction
Multiple similar requests may be enqueued
They are processed strictly sequentially

Rather than rejecting or deduplicating requests at the application layer, the system relies on deterministic execution order to guarantee correctness.

Why Kafka

Kafka isn’t used here just as a queue, it’s a core correctness mechanism.

Kafka provides:

Strong ordering guarantees within partitions
Horizontal scalability through partitioning
Fault tolerance via replication

Kafka producers are configured with idempotency enabled. This ensures that:

Duplicate messages are not written during retries
Ordering is preserved
Exactly-once production semantics are achieved per producer session

This is producer-side idempotency, not consumer-side deduplication. Correctness comes from ordering, not from rejecting duplicate-looking requests.

PostgreSQL as the Source of Truth

PostgreSQL acts as the system of record.

Why PostgreSQL:

Strong ACID guarantees
Mature transactional behavior
Predictable correctness under concurrency

PgBouncer is used for connection pooling to prevent database exhaustion under heavy load.

All balance updates occur inside database transactions, ensuring durability and correctness even in failure scenarios.

Redis: Performance Optimization, Not a Dependency

Redis is used for:

Session management
Caching frequently accessed data
Reducing read load on PostgreSQL

Redis is not part of the correctness path. There is no distributed transaction protocol between PostgreSQL and Redis. If Redis fails, the system remains correct — only performance is affected.

End-to-End Transaction Flow

Client sends a transaction request
Nginx load-balances the request
Application server validates it
Transaction is published to Kafka
Kafka assigns it to a partition via hashing
A worker processes transactions sequentially
PostgreSQL updates occur inside a transaction
Redis is updated opportunistically
Response is returned to the client

Load Testing and Results

The system was stress-tested using K6 with scenarios including:

Account creation
User sign-ins
Deposits, withdrawals, and transfers

Results:

Sustained 3600+ requests per second
100% success rate
Sub-second p95 latencies

Correctness was maintained throughout.

Observability (and Its Limits)

Prometheus and Grafana are used to monitor:

Request rate
Latency
CPU and memory usage of application servers

Infrastructure-level metrics for Kafka, Redis, and PostgreSQL are not instrumented in this version.

Trade-offs and Limitations

This system intentionally avoids:

Semantic request deduplication
Distributed transactions across PostgreSQL and Redis
Full infrastructure-level observability

These trade-offs keep the system simpler and more reliable.

What I’d Improve Next

Future improvements could include:

Kafka, Redis, and PostgreSQL metrics
Database sharding strategies
Advanced failure recovery mechanisms

Final Takeaways

Ordering simplifies concurrency
Correctness matters more than raw speed
Distributed systems are about trade-offs, not perfection
Designing for correctness early prevents painful rewrites

Scalability isn’t about handling more requests —

it’s about handling them correctly.

🔗 Project Link

https://github.com/plebsicle/microfin

DEV Community