shubham pandey (Connoisseur)

Posted on Mar 21

Designing a Stock Exchange / Trading System at Scale A System Design Deep Dive — Question by Question

#algorithms #architecture #interview #systemdesign

Introduction

A stock exchange is one of the most demanding and unforgiving systems in software engineering. A single millisecond of downtime means millions of dollars lost. A single duplicate transaction means regulatory shutdown. A single race condition on a balance means catastrophic financial loss. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: The Order Book Data Structure

Interview Question: A stock exchange must match buyers and sellers in strict price and time priority in microseconds. What data structure holds all pending buy and sell orders and efficiently finds the best match when a new order arrives?

Wrong Approach: Store orders in a simple array and scan for best match.

Why It Fails: Finding the best matching order requires scanning the entire array — O(n) time. At 10 million orders per second this is completely unacceptable.

Navigation: The key insight is to bucket orders by price level. Within each price level maintain a queue for time priority. Use a hashmap for O(1) price level lookup. But a hashmap alone cannot efficiently find the next best price without scanning all keys. You need a data structure that always gives you minimum sell price and maximum buy price instantly.

Solution: Order Book using Balanced BST plus Hashmap plus Queue per price level.

Buy Orders Bids:
$152 maps to Order3 Order7
$151 maps to Order1 Order9
$150 maps to Order2 Order4

Sell Orders Asks:
$153 maps to Order1 Order5
$154 maps to Order2 Order8
$155 maps to Order4 Order6

BST.max() returns best bid price — O(log n)
BST.min() returns best ask price — O(log n)
Queue per price level maintains FIFO time priority — O(1) enqueue and dequeue

Full Order Book operations:

Insert new order — add to queue at price level, insert price into BST — O(log n)
Find best match — BST.min() for sells, BST.max() for buys — O(log n)
Remove matched order — dequeue from front of price queue — O(1)
Remove empty price level — delete from BST — O(log n)

This is called Price-Time Priority matching — best price wins, earliest order wins at same price.

Key Insight: The Order Book is three data structures working together — BST for price level ordering, Hashmap for O(1) price lookup, and Queue for time priority within each price level.

Challenge 2: Fault Tolerance Without Sacrificing Microsecond Latency

Interview Question: The Order Book lives in memory on one machine. A crash loses every pending order worth billions of dollars. But distributing across machines adds milliseconds of latency destroying microsecond matching. How do you make a single machine Order Book fault tolerant?

Wrong Approach 1: Periodic snapshots to disk every few seconds.

Why It Fails: A crash 4.9 seconds after the last 5 second snapshot loses 4.9 seconds of orders. Any gap is unacceptable in financial systems.

Wrong Approach 2: Async write to disk for every order.

Why It Fails: Async writes are fast but data between the write and crash is still lost. Even a tiny gap is catastrophic.

Navigation: Disk writes are slow because of mechanical latency. What if persistent storage operated at RAM speed? And what if replication happened over a network so fast it was equivalent to local memory access?

Solution: NVM RAM plus RDMA replication plus Write Ahead Log.

NVM RAM — Non Volatile Memory:

Writes at RAM speed — nanoseconds not milliseconds
Data survives power loss like a disk
Combines speed of RAM with durability of disk

RDMA Replication — Remote Direct Memory Access:

Write to standby machine RAM over ultra fast data center network
Takes roughly 1 microsecond — equivalent to local memory access
Standby machine has identical Order Book state at all times

Write Ahead Log — WAL:

Every order written to NVM RAM log before being applied to Order Book
On crash — restore last snapshot, replay WAL log, recover every single order
Zero data loss guaranteed

Full fault tolerance flow:

New order arrives
Written to NVM RAM WAL simultaneously with matching engine processing
RDMA replication to standby machine in 1 microsecond
Matching engine never blocked — all persistence happens at memory speed
Machine crashes — standby takes over instantly — replays any missed WAL entries

Key Insight: NVM RAM for local persistence and RDMA replication for standby failover achieves zero data loss with microsecond failover without ever blocking the matching engine.

Challenge 3: Crash Safe Write Ahead Log

Interview Question: What happens if the system crashes while writing the log entry itself? A partial log entry could corrupt recovery entirely.

Navigation: The WAL itself needs crash safety. A log entry must be either fully written and valid or not written at all. Partial entries must be instantly detectable and discardable on recovery.

Solution: Checksum plus UUID idempotency.

Checksum for entry validity:
Every log entry includes a checksum computed from all fields in the entry. On recovery recompute the checksum from the entry fields. If it matches the entry is fully written and safe to replay. If it does not match the entry is partially written and must be discarded and ignored.

UUID for idempotent replay:
Every log entry has a unique UUID. Before executing any step check if that UUID has already been processed. If already processed skip it — no duplicate execution. If not processed execute it and mark complete in log.

Combined crash safety covers every failure scenario:

Crash before log write completes — checksum mismatch — entry discarded — no action taken
Crash after log written but before execution — both steps PENDING — replay safely from beginning
Crash mid execution — UUID check — completed steps skipped — pending steps executed
Zero data loss and zero duplicates regardless of crash timing

Key Insight: Checksum detects corrupt log entries. UUID prevents duplicate execution on replay. Together they make the WAL completely crash safe at every possible failure point.

Challenge 4: Stop Loss Cascade and Circuit Breakers

Interview Question: Millions of stop loss orders all trigger simultaneously when a price drops sharply. Each converts to a market order instantly flooding the matching engine. This is what caused the Flash Crash of 2010 when the Dow Jones dropped 1000 points in minutes. How do you handle this?

Solution Part 1 — Separate Stop Loss Order Book:

Maintain a dedicated Stop Loss Order Book alongside the main Order Book using the same price bucketing structure. Each bucket contains orders that trigger at that price level. When price drops all orders at triggered price levels are converted to market orders simultaneously and queued for the matching engine.

Solution Part 2 — Circuit Breakers — Lower Circuit and Upper Circuit:

When price moves too fast the exchange temporarily halts trading to prevent cascade.

Indian market NSE/BSE circuit breaker levels:

Index drops 10 percent — trading halts 45 minutes
Index drops 15 percent — trading halts 1 hour 45 minutes
Index drops 20 percent — trading halts rest of day
Individual stocks have 5, 10, or 20 percent circuit limits

Implementation:

Price update arrives
Circuit Breaker Service checks current price versus opening price
Move exceeds circuit limit — set stock status to HALTED in Redis instantly
Matching Engine checks Redis status before processing every single order
Status HALTED — new orders rejected and stored in pending queue with original time priority preserved
Timer expires — status set back to ACTIVE — pending queue resumes processing in order

Stop loss cascade with circuit breakers in action:

Apple drops to $144 — 2 million stop losses trigger simultaneously
First batch hits matching engine
Price drops 10 percent from opening — circuit breaker trips
Redis status set to HALTED
Remaining 1.9 million stop losses held in pending queue
Trading halts 45 minutes — market stabilizes
Trading resumes — orders process in controlled manner with original time priority preserved

Key Insight: A separate Stop Loss Order Book handles trigger detection efficiently. Circuit breakers act as the emergency brake preventing cascade failures from destroying market stability.

Challenge 5: Atomic Trade Settlement

Interview Question: Every trade must settle atomically — buyer gets shares and seller gets money together or neither happens. A crash between the two steps leaves one party with nothing. How do you guarantee atomicity?

Solution: Two Phase Commit with Write Ahead Log and UUID idempotency.

Phase 1 — Write intent to WAL before executing anything:
Both settlement steps are written to the WAL as PENDING with a single UUID before any execution begins. This is the commit point — if the system crashes before this write nothing has happened and nothing needs to be undone.

Phase 2 — Execute steps and mark complete one by one:
Execute step 1, mark it COMPLETED in WAL. Execute step 2, mark it COMPLETED in WAL. Transaction fully settled.

Crash recovery at any point:

Crash before WAL write — checksum mismatch — entry discarded — no action taken
Crash after WAL written — both steps PENDING — replay from beginning safely
Crash after step 1 — WAL shows step 1 COMPLETED step 2 PENDING — UUID check skips step 1 — executes step 2 only
Crash after step 2 — WAL shows both COMPLETED — UUID check skips both — transaction already settled

Note on T+2 settlement: Most markets settle trades 2 business days after execution. The WAL and two phase commit pattern applies identically at settlement time — the same crash safety guarantees apply whether settlement happens in microseconds or 2 days later.

Key Insight: Write intent before acting, mark completion after each step, use UUID to prevent duplicate execution. This three part pattern makes any multi-step financial operation completely crash safe.

Challenge 6: Counterparty Risk and Pre-Trade Fund Locking

Interview Question: Settlement happens T+2 days after trade execution. What if the buyer does not have enough money or the seller does not own the shares they sold? How does the exchange protect against counterparty risk?

Solution: Pre-trade risk checks with immediate fund and share locking.

Before any order enters the Order Book:

Check available balance — does buyer have sufficient funds?
Lock required funds immediately — frozen for this specific order
Check share ownership — does seller actually own the shares?
Lock shares immediately — cannot be sold in any other order

Account state during pending trade:
Total Balance 15 million dollars
Locked Amount 10 million dollars — frozen for pending order
Available Balance 5 million dollars — available for new orders only

Lock lifecycle:

Order placed — funds locked immediately
Order cancelled — locked funds released immediately
Order partially filled — proportional locked amount released proportionally
Order fully filled — locked funds transferred to counterparty at T+2

This is called Margin Requirements in financial terminology. Exchanges also require traders to maintain a minimum margin balance as additional protection against large adverse price moves during the T+2 window.

Key Insight: Pre-trade risk checks with immediate fund locking eliminate counterparty risk entirely. By the time a trade executes all required funds and shares are already reserved and cannot be used elsewhere.

Challenge 7: Balance Storage — ACID vs Speed

Interview Question: Trader balance checks happen before every order — thousands per second. You need extremely fast reads and strongly consistent writes. Eventual consistency is not acceptable. What storage solution do you use?

Wrong Approach: Store all balance data in Redis only.

Why It Fails: Redis in cluster mode is eventually consistent. Two simultaneous orders can both read the same available balance before either locks funds — allowing a trader to commit more funds than they have. This race condition on financial balance is catastrophic.

Why Eventual Consistency Is Unacceptable Here:

Twitter feed showing slightly stale tweets — acceptable, nobody loses money
Uber showing driver location 100 meters off — acceptable, minor inconvenience
Trader balance showing stale available funds — catastrophic, exchange loses millions instantly

Solution: Relational Database for ACID guarantees plus Redis cache for read speed.

Write path — fund locking via relational database with row level locking:
Begin transaction. Select balance for trader with row level lock. If available balance is greater than or equal to order amount then update locked amount and reduce available balance and commit. Otherwise rollback and reject order. End transaction.

Two simultaneous orders handled safely:

Order 1 acquires row level lock — locks 10 million dollars — commits
Order 2 waits for row level lock to be released
Order 1 commits — lock released — balance updated to reflect locking
Order 2 reads updated balance — zero available — rejected cleanly
No double spending possible under any timing scenario

Read path — balance check via Redis cache:

Available balance cached in Redis — O(1) read for every order check
Balance updated in relational DB — Redis cache invalidated immediately
Next read — cache miss — reload from DB — cache refreshed
Stale cache never used for actual locking — only the relational DB performs locking

Key Insight: Relational database provides ACID guarantees preventing race conditions on financial balances. Redis cache provides read speed for high throughput balance checks. Never use eventual consistency for financial data.

Full Architecture Summary

Order Book data structure — BST plus Hashmap plus Queue per price level
Fault tolerance — NVM RAM plus RDMA replication plus Write Ahead Log
WAL crash safety — Checksum for entry validity plus UUID for idempotent replay
Stop loss handling — Separate Stop Loss Order Book with price bucketing
Cascade prevention — Circuit breakers with Redis status and pending queue
Trade settlement — Two Phase Commit with WAL replay
Counterparty risk — Pre-trade risk checks with immediate fund locking
Balance storage — Relational DB for ACID plus Redis cache for read speed

Final Thoughts

A stock exchange is where computer science meets finance at the highest possible stakes. Every design decision has a dollar value attached to it. Microsecond latency, zero data loss, atomic settlements, and race condition free balance management are not nice to have — they are regulatory requirements.

The recurring theme throughout this design is that financial systems demand absolute guarantees at every layer. Where other systems tolerate eventual consistency and minor data loss, a stock exchange tolerates neither. Write Ahead Logging, ACID transactions, checksums, UUID idempotency, and circuit breakers are not over-engineering — they are the minimum bar.

Happy building. 🚀

DEV Community