Financial systems rarely fail because they are slow.
They fail because they are wrong.
A duplicated transaction, an out-of-order withdrawal, or a race condition in a balance update is enough to permanently break user trust. As traffic grows, correctness becomes harder not easier.
In this post, I’ll walk through how I designed a scalable financial transaction system that prioritizes correctness first, while still achieving high throughput under heavy concurrency.
👉 Source code:
https://github.com/plebsicle/microfin
Why Naive Financial Systems Break at Scale
A typical starting point looks simple:
- A request hits the server
- The server validates it
- A database transaction updates the balance
This works fine at low traffic. Under concurrency, it quickly breaks.
Common failure modes:
- Multiple concurrent updates to the same balance
- Database connection exhaustion
- Lock contention increasing latency
- Race conditions causing incorrect balances
Even with database transactions, allowing concurrent writes to shared financial state is dangerous. Locking harder reduces throughput, and scaling vertically only delays the problem.
Direct concurrent writes to financial state do not scale safely.
Defining the Non-Negotiables
Before choosing any technology, I defined the system requirements:
- Strong consistency for account balances
- Sequential execution of transactions per user/account
- High throughput under concurrent load
- Horizontal scalability
- Fault tolerance
- A system that is easy to reason about
Every architectural decision follows from these constraints.
High-Level Architecture
The key idea is to separate request ingestion from state mutation.
At a high level:
- Requests are accepted and validated by application servers
- Valid transactions are queued
- Transactions are processed sequentially per user
- Durable state is updated in the database
- Caching is used purely as a performance optimization
This allows the system to absorb traffic spikes without immediately mutating shared state.
System Architecture Diagram
The diagram below shows the high-level architecture of the system and how requests flow through different components, from request ingestion to durable storage.
Each incoming request is first load-balanced by Nginx, validated by application servers, and then published to Kafka. Kafka enforces strict per-user ordering before workers update PostgreSQL (the system of record) and Redis (the cache).
The Core Insight: Ordering Beats Locking
The most important insight in this design is simple:
Correct ordering is more scalable than locking.
Instead of allowing concurrent updates and then trying to protect state with locks, the system ensures that transactions for a given user are never processed concurrently.
This is achieved through deterministic ordering, not mutual exclusion.
Enforcing Sequential Execution with Kafka Partitioning
Each transaction is associated with a user ID / account number.
A hash of this identifier determines which Kafka partition the transaction is sent to.
This gives us three critical guarantees:
- All transactions for a user always go to the same partition
- Each partition is consumed by exactly one worker
- Kafka guarantees ordering within a partition
As a result:
- No two transactions for the same user are processed concurrently
- Race conditions are eliminated
- Database-level locking is unnecessary
Scaling is achieved by increasing the number of partitions and workers, while preserving per-user ordering.
What Happens When a User Double-Clicks?
Users double-click buttons, retry requests, or refresh pages during slow network conditions.
In this system:
- Each request is treated as a distinct transaction
- Multiple similar requests may be enqueued
- They are processed strictly sequentially
Rather than rejecting or deduplicating requests at the application layer, the system relies on deterministic execution order to guarantee correctness.
Why Kafka
Kafka isn’t used here just as a queue, it’s a core correctness mechanism.
Kafka provides:
- Strong ordering guarantees within partitions
- Horizontal scalability through partitioning
- Fault tolerance via replication
Kafka producers are configured with idempotency enabled. This ensures that:
- Duplicate messages are not written during retries
- Ordering is preserved
- Exactly-once production semantics are achieved per producer session
This is producer-side idempotency, not consumer-side deduplication. Correctness comes from ordering, not from rejecting duplicate-looking requests.
PostgreSQL as the Source of Truth
PostgreSQL acts as the system of record.
Why PostgreSQL:
- Strong ACID guarantees
- Mature transactional behavior
- Predictable correctness under concurrency
PgBouncer is used for connection pooling to prevent database exhaustion under heavy load.
All balance updates occur inside database transactions, ensuring durability and correctness even in failure scenarios.
Redis: Performance Optimization, Not a Dependency
Redis is used for:
- Session management
- Caching frequently accessed data
- Reducing read load on PostgreSQL
Redis is not part of the correctness path. There is no distributed transaction protocol between PostgreSQL and Redis. If Redis fails, the system remains correct — only performance is affected.
End-to-End Transaction Flow
- Client sends a transaction request
- Nginx load-balances the request
- Application server validates it
- Transaction is published to Kafka
- Kafka assigns it to a partition via hashing
- A worker processes transactions sequentially
- PostgreSQL updates occur inside a transaction
- Redis is updated opportunistically
- Response is returned to the client
Load Testing and Results
The system was stress-tested using K6 with scenarios including:
- Account creation
- User sign-ins
- Deposits, withdrawals, and transfers
Results:
- Sustained 3600+ requests per second
- 100% success rate
- Sub-second p95 latencies
Correctness was maintained throughout.
Observability (and Its Limits)
Prometheus and Grafana are used to monitor:
- Request rate
- Latency
- CPU and memory usage of application servers
Infrastructure-level metrics for Kafka, Redis, and PostgreSQL are not instrumented in this version.
Trade-offs and Limitations
This system intentionally avoids:
- Semantic request deduplication
- Distributed transactions across PostgreSQL and Redis
- Full infrastructure-level observability
These trade-offs keep the system simpler and more reliable.
What I’d Improve Next
Future improvements could include:
- Kafka, Redis, and PostgreSQL metrics
- Database sharding strategies
- Advanced failure recovery mechanisms
Final Takeaways
- Ordering simplifies concurrency
- Correctness matters more than raw speed
- Distributed systems are about trade-offs, not perfection
- Designing for correctness early prevents painful rewrites
Scalability isn’t about handling more requests —
it’s about handling them correctly.

Top comments (0)