ThisisSteven

Posted on Nov 17

Why another exchange architecture post?

#systemdesign #cryptocurrency #security #architecture

Every few months a new "crypto exchange reference architecture" hits Hacker News, full of colored boxes and microservice buzzwords. Most of them gloss over the boring but regulator-shaped constraints that turn an elegant diagram into a hairball in production.

I spent the last six years keeping real exchanges online while answering auditors at 2 AM. This post is the cheat-sheet I wish I had before writing my first line of matching-engine code.

What you will get:

Concrete patterns (and anti-patterns) for KYC/AML, custody, and auditability.
Pseudo-code that compiles in your head, not on the slide deck.
War stories about things that actually broke — and why they didn’t become front-page news.
If you are building anything that touches digital assets — full exchange, brokerage widget or internal treasury tool — keep reading.

REAL-WORLD CONSTRAINTS
The ugly constraints you can’t diagram away

Before we sketch services and message buses, we need to accept that regulators get a permanent seat in your incident channel. The following constraints shape every design choice:

Licenses: Each jurisdiction demands its own sandbox environment, reporting API, and audit schedule. Your code must be portable across them or you will fork yourself into oblivion.
KYC / AML: Identity verification (Onfido, Jumio), real-time sanctions screening (OFAC, EU, UN lists), behavioural transaction monitoring, and quarterly external audits are non-negotiable.
Travel Rule: For VASP-to-VASP transfers you have 2 things to send — the coins and the counterparties’ identifying data — within seconds.
Immutable audit logs: Write-once, append-only storage with crypto signing and geo-replication. If you cannot replay every balance change, you are out of business.
Custody split: Hot wallets with rate limits, warm wallets for batched outflows, cold storage in air-gapped HSMs. Automated daily sweeps keep hot balances low.
Withdrawal throttling: Per-user, per-asset and global caps with multi-sig unlocks. The risk team will wake you if your math allows 0.1 BTC more than the policy.
Rate limits & abuse prevention: Public APIs face bot armies; admin APIs must survive fat-finger mistakes. Circuit-breakers and RBAC matter as much as TPS numbers.
Design anything that ignores even one of these bullets and you will retrofit it later — during a production incident.

BIRD’S-EYE ARCHITECTURE
Architecture at 10 000 ft

Below is the textual version of the diagram I keep on the whiteboard. Feel free to steal it.

Network & isolation layers

Public zone ➜ REST/WebSocket gateways, rate-limited.
Private zone ➜ Stateless API pods + queues.
Core processing zone ➜ Matching engine, risk engine, wallet service.
Admin / air-gapped zone ➜ Cold wallets, HSMs, reconciliation tools.
Service modules (inside the boxes)

User Management → registration, RBAC, passwordless auth.
KYC Module → calls Onfido/Jumio, sanctions API, behavioural scoring.
Order Matching → in-memory order book with write-ahead log.
Risk Engine → pre-trade checks, withdrawal caps, circuit breakers.
Wallet Service → HD address derivation, hot/warm/cold orchestration.
Notification Service → e-mail, push, Slack for ops.
Reporting & Analytics → daily regulatory exports, proof-of-reserves.
Glue & messaging

A single event bus (Kafka/NATS) carries account-credited, order-filled, kyc-passed events. Producers publish with unique IDs; consumers are idempotent.
Critical paths (matching, balance updates) are strongly consistent within a single service boundary; cross-service propagation is eventually consistent and tolerates seconds of lag.
Where eventual consistency works:

E-mail notifications
Aggregated trading metrics
Where it does not:

Matching engine vs ledger (balances)
Wallet hot-balance tracking vs withdrawal API
Design rule — if a race loses money, make it synchronous; if it only delays an alert, ship it onto the bus.

DEEP DIVE: MATCHING ENGINE CONSISTENCY
What can go wrong when prices move faster than disk writes?

The matching layer is where milliseconds and dollars intersect. The in-memory order book keeps latency under 10 ms, but regulators will ask “how do you prove it never lost an order?”.

Core pattern

Dual write: every mutation hits RAM and a persistent write-ahead log (WAL) before we ACK to the gateway.
Sequence numbers: each event gets a monotonic seq_id. Consumers detect holes and stall.
Replay on boot: on start-up the engine loads the last snapshot then replays the WAL to reconstruct the book deterministically.
Danger zones

WAL on the same box as RAM — a kernel panic ruins both. Use a replicated log (Kafka with acks=all or Raft) or local NVMe + async shipper.
Cross-exchange latency arbitrage — if your public WebSocket lags behind the engine feed, traders will exploit it. Publish the same internal seq_id so they can audit fairness.
Failover inconsistencies — replica must catch up before it starts matching. A naïve leader election that promotes a stale node tends to cost seven figures.
Minimal idempotent fill handler (pseudo-code)

onMatch(fill): if auditLog.exists(fill.id): return // duplicate delivery ledger.debit(fill.makerId, fill.baseQty) ledger.credit(fill.takerId, fill.baseQty) auditLog.append(fill)

auditLog.exists is O(1) via a Bloom filter + secondary immutable store. The handler can run twice without breaking balances — the ledger is a strictly monotonic event store itself.

DEEP DIVE: WALLET & CUSTODY
Your wallet service is a mini-bank, treat it like one

Custody failures dwarf matching-engine failures in cost and publicity, so the design leans on defense in depth.

Wallet tiers

Hot (in the private zone) — single HSM-backed key, per-asset withdrawal limits, real-time balance monitor. Aim for <1 % of circulating user balances.
Warm (separate VPC) — multi-sig, used for scheduled bulk withdrawals and inter-exchange transfers.
Cold (air-gapped) — multi-party computation (MPC) or classic 3-of-5 hardware wallets, accessible only via escorted procedures.
Automated sweep protocol

New deposits land on hot addresses derived from an HD key.
Cron (or event trigger) moves excess funds to warm if hot balance > threshold.
Daily job writes a manifest, signs it in the air-gapped room, then publishes a cold-sweep required ticket.
Withdrawal pipeline

client.requestWithdrawal()
→ API Gateway (idempotency-key)
→ WalletService.validate(user, addr, amt)
→ RiskEngine.checkLimits()
→ Chainalysis.score(addr)
→ queue:withdrawal
→ HotWalletSigner (HSM)
→ broadcast tx
→ event:withdrawal-broadcast
Places we permit eventual consistency: the email that says “your withdrawal is on the way”. Places we don’t: the ledger debit vs on-chain broadcast — those must be in the same atomic unit.

Idempotent withdrawal consumer (Go-ish pseudo-code)

func Handle(msg WithdrawalMsg) error {
if ledger.HasTx(msg.Id) {
return nil // already processed
}
if !risk.StillValid(msg) {
return fmt.Errorf("risk window expired")
}
// 1. Debit user in internal ledger
ledger.Move(msg.UserId, hotWalletId, msg.Amount)
// 2. Sign & send
tx := hotWallet.Sign(msg.Address, msg.Amount)
broadcast(tx)
// 3. Persist irreversible record
audit.Append("withdrawal", msg.Id, tx.Hash)
return nil
}
If the consumer crashes after step 2 but before step 3, the reconciliation daemon will detect an on-chain tx without an audit record and backfill it. Worst-case outcome: extra log line, not missing funds.

Security checklist (non-exhaustive)

AES-256 at rest, TLS 1.3 in transit.
RBAC with per-asset roles; only the BTC-signer service account can touch BTC keys.
Blockchain analytics exposure score gates high-risk addresses.
Daily proof-of-reserves job reads the ledger event store and cold-wallet balances, then posts the Merkle root publicly.
DEEP DIVE: COMPLIANCE & AUDIT LOGS
Logs, or it didn’t happen

If you cannot prove an invariant, assume it never held. That is the mindset regulators adopt when the subpoena arrives.

Requirements we have to hit

Immutable, append-only storage (WORM or S3 Object Lock).
Cryptographic signing of every entry; a SHA-256 chain makes tampering detectable.
Geo-replication — at least two fault domains.
User actions, system changes, financial events — everything funnels through the same log schema.
Architecture pattern: immutable audit trail

┌──────────┐ append ┌─────────┐ async ┌───────────────┐
│producer │──────────▶│log-api │──────────▶│WORM storage │
└──────────┘ REST+sig └─────────┘ │ (S3 + Object │
│ Lock + KMS) │
└───────────────┘
log-api computes entryHash = SHA256(prevHash + payload) and writes once.
Kafka holds a triple-replicated copy for low-latency queries; the WORM bucket is the source of truth.
Minimal Go writer showing the hash chain:

func Append(prevHash, payload []byte) (newHash []byte, err error) {
h := sha256.New()
h.Write(prevHash)
h.Write(payload)
newHash = h.Sum(nil)

entry := Entry{Prev: prevHash, Data: payload, Hash: newHash}
if err = wormStore.Put(entry); err != nil {
return nil, err
}
kafkaBus.Publish(entry) // optional fast path
return newHash, nil
}
Retention & querying

Regulators typically ask for 7-years online, 15-years cold. Glacier Deep Archive is cheaper than lawsuits.
Index only metadata in Elasticsearch. Large binary blobs (e.g. KYC documents) stay in object storage; paths are in the log.
Common failure mode: devs delete a topic to “clear staging data” — and the retention policy forgets to exclude production. Mitigation: a root guardrail that prevents deletion on prod clusters, enforced by GitOps.

FAILURE STORIES & TRADE-OFFS
Incidents that actually happen (and how to survive them)

Below are three composite incidents compiled from the last few years. Names are changed; the pager noise is real.

The phantom fill — WAL disk filled up at midnight

What happened: The matching engine kept matching in RAM but the write-ahead log blocked on fsync. Orders were acknowledged to users but never persisted. A node crash two hours later rewound the book.
Blast radius: 12 % of fills missing, negative balances across 64 accounts.
Why the architecture mattered:
Sequence gaps were detected by the risk engine which halted trading (circuit_breaker.seq_gap=true).
Audit log replay identified missing fills; a reconciliation script replayed them deterministically.
Takeaway: Always put WAL on a volume with its own alert budget. Disk-full is a consistency bug, not an infra ticket.

Hot wallet drained — but funds were safe

What happened: A leaked CI token triggered the withdrawal API in a loop. Rate limits allowed 50 BTC before detection.
Why it didn’t bankrupt the exchange:
Per-address exposure scoring blocked transfers to a high-risk address after 10 BTC.
Withdrawal velocity limits on the hot wallet paused the queue automatically.
The remaining 40 BTC sat in warm and cold tiers out of attacker reach.
Design flaw exposed: Incident responders lacked an automated “sweep remaining hot balance to cold” button. It is now one click.

Audit log topic deleted in staging — propagated to prod

What happened: A junior engineer testing retention settings deleted the audit.events topic in staging. Terraform re-applied the change in production (same resource ID).
Mitigations in place:
Root guardrail prevented deletion on the production cluster; plan failed.
Immutable WORM bucket held the canonical history anyway.
Cost: 30 minutes of tense Slack messages, zero data loss.
Risk check snippet that caught incident #1 (pseudo-Rust)

fn pre_trade_check(order: &Order, account: &Account) -> Result<()> {
if account.balance < order.required_margin() {
bail!("INSUFFICIENT_FUNDS");
}
if seq::gap_detected(order.seq_id) {
circuit_breaker::trip("SEQ_GAP");
bail!("TRADING_HALTED");
}
Ok(())
}
KEY TAKEAWAYS
Cheat-sheet for your architecture review

Draw the regulatory boundary first, code second. Know which invariants map directly to license clauses (KYC completion, audit log retention) and treat them as unbreakable.
Synchronous where money can disappear, asynchronous everywhere else. Ledger debits, order matching and withdrawal signing are synchronous. Dashboards, metrics and emails can lag.
Every write is an event, every event is immutable. Adopt an event-sourced ledger early; retrofitting it after launch is a refactor few teams finish.
Wallet segregation buys you response time. Hot < Warm < Cold turns a key compromise into an ops problem instead of an existential one.
Sequence numbers are cheap insurance. From the gateway to the matching engine, gaps reveal hidden corruption.
Treat the risk engine as a first-class citizen. It is not “business logic” — it is your last defense when the unexpected happens.
Automate incident playbooks. Humans decide, software executes: pause trading, sweep wallets, trip circuit breakers.
Guardrails over guidelines. Terraform policies, IAM boundaries, and WORM storage are harder to bypass than wiki pages.
If you adopt only two ideas from this post: (1) make every critical path idempotent, and (2) never delete an audit log — even in staging.

Happy building, and may your on-call rotations be quiet.

DEV Community

Why another exchange architecture post?

Top comments (0)