Building a data pipeline that survives failures without corrupting state
Data pipelines don’t fail because they’re slow.
They fail because they write partial state, retry blindly, and restart into inconsistency.
When building an EVM indexer, the real challenge isn’t fetching blocks — it’s answering a harder question:
If the process crashes halfway through indexing a block, what state does the database end up in?
If the answer is “it depends,” the system isn’t safe.
This article walks through how I designed a Rust-based EVM indexer that:
- Processes blocks atomically
- Is safe to retry
- Never commits partial state
- Recovers deterministically after crashes
- Avoids duplicate data without sacrificing correctness
Stack:
- Rust (Tokio)
-
ethers-rsfor RPC - PostgreSQL
- SQLx
- Axum for query API
The Real Problem: Partial State
Let’s say block N contains:
- 120 transactions
- 350 logs
Naively, you might:
- Insert block
- Insert transactions
- Insert logs
- Update checkpoint
But what if the process crashes after inserting transactions but before inserting logs?
Now your database contains:
- Block exists
- Transactions exist
- Logs missing
- Checkpoint not updated
On restart:
- Do you retry?
- Do you skip?
- Do you overwrite?
- Do you detect partial writes?
Most indexers get this wrong.
Design Goals
Before writing code, I defined strict invariants:
- A block is either fully written or not written at all
- Restarting must be safe
- Duplicate processing must not corrupt state
- Checkpoint must reflect durable state
- No external consistency assumptions
The design centers on one idea:
The database is the source of truth. The process is disposable.
System Architecture
Everything for a single block happens inside one PostgreSQL transaction.
Either everything commits.
Or nothing exists.
No partial state.
Atomic Block Processing
The core pattern looks like:
let mut tx = pool.begin().await?;
store_block(&mut tx, &block).await?;
store_transactions(&mut tx, &block).await?;
store_logs(&mut tx, &block).await?;
update_checkpoint(&mut tx, block_number).await?;
tx.commit().await?;
Key detail:
The checkpoint update is inside the same transaction.
If the transaction rolls back:
- The block isn’t stored
- The checkpoint doesn’t move
Recovery becomes trivial.
What Broke First
Originally, I updated the checkpoint in a separate query after committing block data.
It worked — until I simulated crashes.
If the process crashed:
- Block was stored
- Checkpoint wasn’t updated
On restart:
- The system reprocessed the same block
- Duplicate insert attempts happened
- Foreign key constraints triggered
- Recovery logic became messy
Fix:
Move checkpoint update inside the block transaction.
This changed everything.
Now:
- Commit guarantees durable block + checkpoint
- Rollback guarantees nothing happened
- Restart logic becomes deterministic
Lesson:
Recovery logic must be part of the write path, not an afterthought.
Idempotency Strategy
Crashes happen.
Retries happen.
RPC timeouts happen.
The system must be safe to retry the same block multiple times.
All inserts use:
ON CONFLICT DO NOTHING
Why?
Because:
- If block already exists, ignore
- If transaction already exists, ignore
- If log already exists, ignore
Calling sync_block(N) 10 times produces identical state to calling it once.
This dramatically simplifies retry logic.
Idempotency is not an optimization.
It is survival.
Isolation Level Considerations
PostgreSQL defaults to READ COMMITTED.
For this indexer, that’s sufficient because:
- Blocks are processed sequentially
- No concurrent writers modify the same block
If I parallelized block ingestion, I would evaluate:
-
REPEATABLE READfor consistency - Explicit row-level locking
- Partitioned writes
Atomicity matters more than raw speed.
Failure Scenarios Modeled
This system was built assuming failure is normal.
Handled scenarios:
Crash before commit
Entire transaction rolls back.
Checkpoint unchanged.
Crash after commit
Checkpoint updated.
Block fully durable.
Duplicate processing
Safe due to ON CONFLICT DO NOTHING.
RPC timeout
Retry with exponential backoff.
Idempotent writes make this safe.
Database lock contention
Transaction scope kept minimal.
No external I/O inside transaction.
Design principle:
Every block sync must be atomic and idempotent.
Runtime Observations
Under sustained historical sync:
- Average block processing time: ~5–15ms (RPC-bound)
- Database time per block: <3ms
- CPU mostly idle (network-bound workload)
- Memory stable (~20–30MB during sync)
- No unbounded growth observed
Hotspots:
- JSON decoding of receipts
- Large batch inserts during high-activity blocks
Batching via SQLx QueryBuilder significantly reduced round-trip overhead.
Why Sequential Processing (For Now)
Parallelizing blocks sounds attractive.
But Ethereum has:
- strict ordering
- parent hash relationships
- potential reorgs
Sequential processing simplifies:
- checkpoint logic
- parent verification
- reorg detection
- rollback handling
Correctness first.
Parallelism later.
Reorg Handling (Planned)
Current implementation detects duplicates but does not fully support deep reorg rollback.
Planned approach:
- Compare incoming block parent_hash with local block hash
- If mismatch:
- delete orphaned blocks
- rewind checkpoint
- resync from fork point
This will require:
- indexed parent hash
- cascading deletes
- careful transactional rollback
Reorg handling is not trivial.
It must be deliberate.
Why Atomicity > Throughput
Strict per-block transactions add minor overhead.
Tradeoff:
- Slightly higher write latency
- Massive increase in correctness guarantees
In indexing systems:
Corrupted data is worse than delayed data.
What I Would Improve Next
- Parallel historical sync with bounded worker pool
- Reorg-safe rollback logic
- Partitioned block tables
- WAL-based replication for read scaling
- Prometheus metrics for ingestion lag
Lessons Learned
The hardest part of backend systems is not performance.
It’s state recovery.
Making writes atomic simplified:
- retry logic
- crash recovery
- reasoning about invariants
Rust helped enforce correctness.
PostgreSQL enforced durability.
Transactions enforced sanity.
The system is not optimized for speed.
It is optimized for being correct when things go wrong.
That is what matters in infrastructure.

Top comments (0)