Designing a reliable async market data client with ordering guarantees, backpressure awareness, and recovery logic
Real-time trading systems are ingestion systems.
The hard problem is not parsing JSON quickly.
The hard problem is:
- preserving message ordering
- recovering cleanly from disconnects
- preventing silent data corruption
- handling slow consumers
- maintaining predictable latency under load
This project was built to explore those constraints using Rust’s async ecosystem.
System Constraints
Before writing code, I defined explicit design constraints:
- Messages must be processed strictly in order
- WebSocket ownership must be deterministic
- Reconnect must not lose subscription state
- Orderbook must match exchange checksum
- Consumers may be slower than ingestion
- Recovery must be automatic
Every architectural decision flowed from these constraints.
High-Level Runtime Flow
Core idea:
One ingestion loop owns the socket.
Everything else consumes typed events.
No concurrent writers.
No fragmented recovery logic.
Connection Lifecycle
Interviewers care about lifecycle clarity.
Here is the full connection state flow:
Important points:
- Subscription state is stored separately from socket
-
On reconnect:
- backoff
- reauthenticate if needed
- resubscribe
- resync orderbook
System never assumes connection stability
Failure is a first-class state.
Core Event Loop Design
The WebSocket connection is owned by a single async task using tokio::select!.
Responsibilities:
- read frames
- process outgoing commands
- heartbeat
- trigger reconnect
Why single-loop ownership?
Because:
- concurrent readers introduce nondeterministic ordering
- multiple writers complicate recovery
- state transitions become fragmented
This design behaves like an actor:
one owner, explicit state transitions, deterministic execution.
What Broke First (And Why It Matters)
The initial version used multiple reader tasks:
- one for WebSocket frames
- one for parsing
- one for state updates
This worked — until reconnect logic was introduced.
During disconnects:
- tasks raced to update state
- partial orderbook snapshots were applied
- ordering bugs surfaced under load
Fix:
Move to a single ingestion loop that:
- owns the socket
- owns the parser
- owns state mutation
This eliminated race conditions and simplified recovery logic dramatically.
Lesson:
Simplicity beats parallelism in ingestion systems.
Typed Deserialization Strategy
Kraken sends heterogeneous JSON array messages.
Instead of dynamic dispatch:
#[serde(untagged)]
enum WsMessage {
Trade(TradeData),
Book(OrderBookData),
Ticker(TickerData),
Heartbeat { event: String },
}
Benefits:
- Compile-time exhaustiveness
- No runtime reflection
- Deterministic routing
- Clear failure modes
Parsing becomes predictable and measurable.
Orderbook State & Data Structures
Local orderbook uses BTreeMap.
Why?
- ordered price levels
- O(log n) inserts
- stable iteration
- deterministic checksum reconstruction
HashMap would give faster lookup but no ordering guarantee.
For financial systems, ordering matters more than raw speed.
Checksum Validation
Every snapshot/update:
- Apply delta
- Reconstruct canonical string
- Compute CRC32
- Compare with exchange
If mismatch:
- invalidate local book
- trigger full resync
Integrity is prioritized over throughput.
Backpressure & Consumer Decoupling
Ingestion uses tokio::broadcast.
Benefits:
- multiple strategies subscribe
- ingestion never blocks
- near-zero fanout overhead
Tradeoffs:
- slow consumers can lag
- buffer overflow drops messages
Production additions would include:
- lag metrics
- bounded channels
- backpressure signaling
- optional durable stream (Kafka/NATS)
Fast ingestion without backpressure awareness leads to silent failure.
Benchmarking Philosophy
The benchmark goal was not peak speed.
The goal was:
deterministic processing under sustained load.
Measured:
- parsing + routing throughput
- allocation behavior
- latency per message
- CPU utilization
Results (local machine):
- ~648k msgs/sec (Rust)
- ~600k msgs/sec (Python reference)
Important context:
- TLS + network latency not included
- Measured using recorded streams
- Focused on processing layer, not transport
Throughput was secondary to:
- stable latency
- no ordering drift
- no state corruption
Runtime Observations (Under Load)
Measured locally under sustained stream replay:
- Latency per message: ~1–2µs parsing + routing
- CPU usage: parsing dominated (~70% of core)
- Peak memory usage: ~10–15MB during normal ingestion
- Allocation spikes: occurred during full orderbook resync
Hotspots:
- JSON array parsing
- temporary allocation during snapshot rebuild
These observations influenced:
- minimizing cloning
- reusing buffers
- reducing intermediate allocations
The system remained stable under sustained load without memory growth.
Architectural Tradeoffs
This implementation favors determinism over horizontal scalability.
Benefits
- deterministic ordering
- no socket contention
- simple recovery
- easier debugging
- minimal locking
Tradeoff
Single-core parsing bottleneck at extreme rates.
Production scaling options:
- shard by trading pair
- multiple ingestion loops
- forward frames into Kafka/NATS
- multi-process ingestion layer
Correctness first.
Scale second.
Failure Modes Considered
Designed assuming failure is normal:
- connection drops
- malformed messages
- partial snapshot
- checksum mismatch
- slow consumers
- duplicate subscriptions
Core principle:
ingestion must be recoverable, not fragile.
What I Would Improve Next
Reliability
- persistent event log for replay
- message durability layer
- lag-aware bounded queues
Observability
- Prometheus metrics
- structured tracing
- latency histograms
- reconnect counters
Scalability
- symbol-based sharding
- multi-loop ingestion
- partitioned state per pair
Key Lessons
Deterministic ownership simplifies distributed reasoning.
Backpressure matters more than raw speed.
Recovery logic is not edge-case logic — it is core logic.
Type safety reduces runtime surprises.
The hardest part of ingestion systems is not speed.
It is predictable behavior under failure.
Code
Full implementation:
https://github.com/Nihal-Pandey-2302/kraken-rs


Top comments (0)