DEV Community: Nihal Pandey

Designing a Crash-Safe, Idempotent EVM Indexer in Rust

Nihal Pandey — Thu, 19 Feb 2026 19:18:35 +0000

Building a data pipeline that survives failures without corrupting state

Data pipelines don’t fail because they’re slow.

They fail because they write partial state, retry blindly, and restart into inconsistency.

When building an EVM indexer, the real challenge isn’t fetching blocks — it’s answering a harder question:

If the process crashes halfway through indexing a block, what state does the database end up in?

If the answer is “it depends,” the system isn’t safe.

This article walks through how I designed a Rust-based EVM indexer that:

Processes blocks atomically
Is safe to retry
Never commits partial state
Recovers deterministically after crashes
Avoids duplicate data without sacrificing correctness

Stack:

Rust (Tokio)
ethers-rs for RPC
PostgreSQL
SQLx
Axum for query API

The Real Problem: Partial State

Let’s say block N contains:

120 transactions
350 logs

Naively, you might:

Insert block
Insert transactions
Insert logs
Update checkpoint

But what if the process crashes after inserting transactions but before inserting logs?

Now your database contains:

Block exists
Transactions exist
Logs missing
Checkpoint not updated

On restart:

Do you retry?
Do you skip?
Do you overwrite?
Do you detect partial writes?

Most indexers get this wrong.

Design Goals

Before writing code, I defined strict invariants:

A block is either fully written or not written at all
Restarting must be safe
Duplicate processing must not corrupt state
Checkpoint must reflect durable state
No external consistency assumptions

The design centers on one idea:

The database is the source of truth. The process is disposable.

System Architecture

Everything for a single block happens inside one PostgreSQL transaction.

Either everything commits.
Or nothing exists.

No partial state.

Atomic Block Processing

The core pattern looks like:

let mut tx = pool.begin().await?;

store_block(&mut tx, &block).await?;
store_transactions(&mut tx, &block).await?;
store_logs(&mut tx, &block).await?;
update_checkpoint(&mut tx, block_number).await?;

tx.commit().await?;

Key detail:

The checkpoint update is inside the same transaction.

If the transaction rolls back:

The block isn’t stored
The checkpoint doesn’t move

Recovery becomes trivial.

What Broke First

Originally, I updated the checkpoint in a separate query after committing block data.

It worked — until I simulated crashes.

If the process crashed:

Block was stored
Checkpoint wasn’t updated

On restart:

The system reprocessed the same block
Duplicate insert attempts happened
Foreign key constraints triggered
Recovery logic became messy

Fix:

Move checkpoint update inside the block transaction.

This changed everything.

Now:

Commit guarantees durable block + checkpoint
Rollback guarantees nothing happened
Restart logic becomes deterministic

Lesson:

Recovery logic must be part of the write path, not an afterthought.

Idempotency Strategy

Crashes happen.
Retries happen.
RPC timeouts happen.

The system must be safe to retry the same block multiple times.

All inserts use:

ON CONFLICT DO NOTHING

Why?

Because:

If block already exists, ignore
If transaction already exists, ignore
If log already exists, ignore

Calling sync_block(N) 10 times produces identical state to calling it once.

This dramatically simplifies retry logic.

Idempotency is not an optimization.
It is survival.

Isolation Level Considerations

PostgreSQL defaults to READ COMMITTED.

For this indexer, that’s sufficient because:

Blocks are processed sequentially
No concurrent writers modify the same block

If I parallelized block ingestion, I would evaluate:

REPEATABLE READ for consistency
Explicit row-level locking
Partitioned writes

Atomicity matters more than raw speed.

Failure Scenarios Modeled

This system was built assuming failure is normal.

Handled scenarios:

Crash before commit

Entire transaction rolls back.
Checkpoint unchanged.

Crash after commit

Checkpoint updated.
Block fully durable.

Duplicate processing

Safe due to ON CONFLICT DO NOTHING.

RPC timeout

Retry with exponential backoff.
Idempotent writes make this safe.

Database lock contention

Transaction scope kept minimal.
No external I/O inside transaction.

Design principle:

Every block sync must be atomic and idempotent.

Runtime Observations

Under sustained historical sync:

Average block processing time: ~5–15ms (RPC-bound)
Database time per block: <3ms
CPU mostly idle (network-bound workload)
Memory stable (~20–30MB during sync)
No unbounded growth observed

Hotspots:

JSON decoding of receipts
Large batch inserts during high-activity blocks

Batching via SQLx QueryBuilder significantly reduced round-trip overhead.

Why Sequential Processing (For Now)

Parallelizing blocks sounds attractive.

But Ethereum has:

strict ordering
parent hash relationships
potential reorgs

Sequential processing simplifies:

checkpoint logic
parent verification
reorg detection
rollback handling

Correctness first.
Parallelism later.

Reorg Handling (Planned)

Current implementation detects duplicates but does not fully support deep reorg rollback.

Planned approach:

Compare incoming block parent_hash with local block hash
If mismatch:

delete orphaned blocks
rewind checkpoint
resync from fork point

This will require:

indexed parent hash
cascading deletes
careful transactional rollback

Reorg handling is not trivial.
It must be deliberate.

Why Atomicity > Throughput

Strict per-block transactions add minor overhead.

Tradeoff:

Slightly higher write latency
Massive increase in correctness guarantees

In indexing systems:

Corrupted data is worse than delayed data.

What I Would Improve Next

Parallel historical sync with bounded worker pool
Reorg-safe rollback logic
Partitioned block tables
WAL-based replication for read scaling
Prometheus metrics for ingestion lag

Lessons Learned

The hardest part of backend systems is not performance.

It’s state recovery.

Making writes atomic simplified:

retry logic
crash recovery
reasoning about invariants

Rust helped enforce correctness.
PostgreSQL enforced durability.
Transactions enforced sanity.

The system is not optimized for speed.

It is optimized for being correct when things go wrong.

That is what matters in infrastructure.

Building a Deterministic High-Throughput WebSocket Ingestion System in Rust

Nihal Pandey — Thu, 19 Feb 2026 18:55:51 +0000

Designing a reliable async market data client with ordering guarantees, backpressure awareness, and recovery logic

Real-time trading systems are ingestion systems.

The hard problem is not parsing JSON quickly.
The hard problem is:

preserving message ordering
recovering cleanly from disconnects
preventing silent data corruption
handling slow consumers
maintaining predictable latency under load

This project was built to explore those constraints using Rust’s async ecosystem.

System Constraints

Before writing code, I defined explicit design constraints:

Messages must be processed strictly in order
WebSocket ownership must be deterministic
Reconnect must not lose subscription state
Orderbook must match exchange checksum
Consumers may be slower than ingestion
Recovery must be automatic

Every architectural decision flowed from these constraints.

High-Level Runtime Flow

Core idea:

One ingestion loop owns the socket.
Everything else consumes typed events.

No concurrent writers.
No fragmented recovery logic.

Connection Lifecycle

Interviewers care about lifecycle clarity.
Here is the full connection state flow:

Important points:

Subscription state is stored separately from socket
On reconnect:
- backoff
- reauthenticate if needed
- resubscribe
- resync orderbook
System never assumes connection stability

Failure is a first-class state.

Core Event Loop Design

The WebSocket connection is owned by a single async task using tokio::select!.

Responsibilities:

read frames
process outgoing commands
heartbeat
trigger reconnect

Why single-loop ownership?

Because:

concurrent readers introduce nondeterministic ordering
multiple writers complicate recovery
state transitions become fragmented

This design behaves like an actor:

one owner, explicit state transitions, deterministic execution.

What Broke First (And Why It Matters)

The initial version used multiple reader tasks:

one for WebSocket frames
one for parsing
one for state updates

This worked — until reconnect logic was introduced.

During disconnects:

tasks raced to update state
partial orderbook snapshots were applied
ordering bugs surfaced under load

Fix:

Move to a single ingestion loop that:

owns the socket
owns the parser
owns state mutation

This eliminated race conditions and simplified recovery logic dramatically.

Lesson:

Simplicity beats parallelism in ingestion systems.

Typed Deserialization Strategy

Kraken sends heterogeneous JSON array messages.

Instead of dynamic dispatch:

#[serde(untagged)]
enum WsMessage {
    Trade(TradeData),
    Book(OrderBookData),
    Ticker(TickerData),
    Heartbeat { event: String },
}

Benefits:

Compile-time exhaustiveness
No runtime reflection
Deterministic routing
Clear failure modes

Parsing becomes predictable and measurable.

Orderbook State & Data Structures

Local orderbook uses BTreeMap.

Why?

ordered price levels
O(log n) inserts
stable iteration
deterministic checksum reconstruction

HashMap would give faster lookup but no ordering guarantee.

For financial systems, ordering matters more than raw speed.

Checksum Validation

Every snapshot/update:

Apply delta
Reconstruct canonical string
Compute CRC32
Compare with exchange

If mismatch:

invalidate local book
trigger full resync

Integrity is prioritized over throughput.

Backpressure & Consumer Decoupling

Ingestion uses tokio::broadcast.

Benefits:

multiple strategies subscribe
ingestion never blocks
near-zero fanout overhead

Tradeoffs:

slow consumers can lag
buffer overflow drops messages

Production additions would include:

lag metrics
bounded channels
backpressure signaling
optional durable stream (Kafka/NATS)

Fast ingestion without backpressure awareness leads to silent failure.

Benchmarking Philosophy

The benchmark goal was not peak speed.

The goal was:

deterministic processing under sustained load.

Measured:

parsing + routing throughput
allocation behavior
latency per message
CPU utilization

Results (local machine):

~648k msgs/sec (Rust)
~600k msgs/sec (Python reference)

Important context:

TLS + network latency not included
Measured using recorded streams
Focused on processing layer, not transport

Throughput was secondary to:

stable latency
no ordering drift
no state corruption

Runtime Observations (Under Load)

Measured locally under sustained stream replay:

Latency per message: ~1–2µs parsing + routing
CPU usage: parsing dominated (~70% of core)
Peak memory usage: ~10–15MB during normal ingestion
Allocation spikes: occurred during full orderbook resync

Hotspots:

JSON array parsing
temporary allocation during snapshot rebuild

These observations influenced:

minimizing cloning
reusing buffers
reducing intermediate allocations

The system remained stable under sustained load without memory growth.

Architectural Tradeoffs

This implementation favors determinism over horizontal scalability.

Benefits

deterministic ordering
no socket contention
simple recovery
easier debugging
minimal locking

Tradeoff

Single-core parsing bottleneck at extreme rates.

Production scaling options:

shard by trading pair
multiple ingestion loops
forward frames into Kafka/NATS
multi-process ingestion layer

Correctness first.
Scale second.

Failure Modes Considered

Designed assuming failure is normal:

connection drops
malformed messages
partial snapshot
checksum mismatch
slow consumers
duplicate subscriptions

Core principle:

ingestion must be recoverable, not fragile.

What I Would Improve Next

Reliability

persistent event log for replay
message durability layer
lag-aware bounded queues

Observability

Prometheus metrics
structured tracing
latency histograms
reconnect counters

Scalability

symbol-based sharding
multi-loop ingestion
partitioned state per pair

Key Lessons

Deterministic ownership simplifies distributed reasoning.
Backpressure matters more than raw speed.
Recovery logic is not edge-case logic — it is core logic.
Type safety reduces runtime surprises.

The hardest part of ingestion systems is not speed.

It is predictable behavior under failure.

Code

Full implementation:

https://github.com/Nihal-Pandey-2302/kraken-rs