<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nihal Pandey</title>
    <description>The latest articles on DEV Community by Nihal Pandey (@nihalpandey2302).</description>
    <link>https://dev.to/nihalpandey2302</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1400732%2F56d56ad6-7924-4d2e-a9cb-74b4b9ef7439.jpeg</url>
      <title>DEV Community: Nihal Pandey</title>
      <link>https://dev.to/nihalpandey2302</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nihalpandey2302"/>
    <language>en</language>
    <item>
      <title>Designing a Crash-Safe, Idempotent EVM Indexer in Rust</title>
      <dc:creator>Nihal Pandey</dc:creator>
      <pubDate>Thu, 19 Feb 2026 19:18:35 +0000</pubDate>
      <link>https://dev.to/nihalpandey2302/designing-a-crash-safe-idempotent-evm-indexer-in-rust-3ca8</link>
      <guid>https://dev.to/nihalpandey2302/designing-a-crash-safe-idempotent-evm-indexer-in-rust-3ca8</guid>
      <description>&lt;h2&gt;
  
  
  Building a data pipeline that survives failures without corrupting state
&lt;/h2&gt;

&lt;p&gt;Data pipelines don’t fail because they’re slow.&lt;/p&gt;

&lt;p&gt;They fail because they write partial state, retry blindly, and restart into inconsistency.&lt;/p&gt;

&lt;p&gt;When building an EVM indexer, the real challenge isn’t fetching blocks — it’s answering a harder question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the process crashes halfway through indexing a block, what state does the database end up in?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the answer is “it depends,” the system isn’t safe.&lt;/p&gt;

&lt;p&gt;This article walks through how I designed a Rust-based EVM indexer that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processes blocks atomically&lt;/li&gt;
&lt;li&gt;Is safe to retry&lt;/li&gt;
&lt;li&gt;Never commits partial state&lt;/li&gt;
&lt;li&gt;Recovers deterministically after crashes&lt;/li&gt;
&lt;li&gt;Avoids duplicate data without sacrificing correctness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rust (Tokio)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ethers-rs&lt;/code&gt; for RPC&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;SQLx&lt;/li&gt;
&lt;li&gt;Axum for query API&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  The Real Problem: Partial State
&lt;/h1&gt;

&lt;p&gt;Let’s say block &lt;code&gt;N&lt;/code&gt; contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;120 transactions&lt;/li&gt;
&lt;li&gt;350 logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Naively, you might:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Insert block&lt;/li&gt;
&lt;li&gt;Insert transactions&lt;/li&gt;
&lt;li&gt;Insert logs&lt;/li&gt;
&lt;li&gt;Update checkpoint&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But what if the process crashes after inserting transactions but before inserting logs?&lt;/p&gt;

&lt;p&gt;Now your database contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Block exists&lt;/li&gt;
&lt;li&gt;Transactions exist&lt;/li&gt;
&lt;li&gt;Logs missing&lt;/li&gt;
&lt;li&gt;Checkpoint not updated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On restart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you retry?&lt;/li&gt;
&lt;li&gt;Do you skip?&lt;/li&gt;
&lt;li&gt;Do you overwrite?&lt;/li&gt;
&lt;li&gt;Do you detect partial writes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most indexers get this wrong.&lt;/p&gt;




&lt;h1&gt;
  
  
  Design Goals
&lt;/h1&gt;

&lt;p&gt;Before writing code, I defined strict invariants:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A block is either fully written or not written at all&lt;/li&gt;
&lt;li&gt;Restarting must be safe&lt;/li&gt;
&lt;li&gt;Duplicate processing must not corrupt state&lt;/li&gt;
&lt;li&gt;Checkpoint must reflect durable state&lt;/li&gt;
&lt;li&gt;No external consistency assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The design centers on one idea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The database is the source of truth. The process is disposable.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  System Architecture
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy29rl7ahoyat2yzzxik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy29rl7ahoyat2yzzxik.png" alt=" " width="800" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everything for a single block happens inside one PostgreSQL transaction.&lt;/p&gt;

&lt;p&gt;Either everything commits.&lt;br&gt;
Or nothing exists.&lt;/p&gt;

&lt;p&gt;No partial state.&lt;/p&gt;


&lt;h1&gt;
  
  
  Atomic Block Processing
&lt;/h1&gt;

&lt;p&gt;The core pattern looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="nf"&gt;.begin&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;store_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nf"&gt;store_transactions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nf"&gt;store_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nf"&gt;update_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="nf"&gt;.commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key detail:&lt;/p&gt;

&lt;p&gt;The checkpoint update is inside the same transaction.&lt;/p&gt;

&lt;p&gt;If the transaction rolls back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The block isn’t stored&lt;/li&gt;
&lt;li&gt;The checkpoint doesn’t move&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recovery becomes trivial.&lt;/p&gt;




&lt;h1&gt;
  
  
  What Broke First
&lt;/h1&gt;

&lt;p&gt;Originally, I updated the checkpoint in a separate query after committing block data.&lt;/p&gt;

&lt;p&gt;It worked — until I simulated crashes.&lt;/p&gt;

&lt;p&gt;If the process crashed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Block was stored&lt;/li&gt;
&lt;li&gt;Checkpoint wasn’t updated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On restart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The system reprocessed the same block&lt;/li&gt;
&lt;li&gt;Duplicate insert attempts happened&lt;/li&gt;
&lt;li&gt;Foreign key constraints triggered&lt;/li&gt;
&lt;li&gt;Recovery logic became messy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;p&gt;Move checkpoint update inside the block transaction.&lt;/p&gt;

&lt;p&gt;This changed everything.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Commit guarantees durable block + checkpoint&lt;/li&gt;
&lt;li&gt;Rollback guarantees nothing happened&lt;/li&gt;
&lt;li&gt;Restart logic becomes deterministic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Recovery logic must be part of the write path, not an afterthought.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Idempotency Strategy
&lt;/h1&gt;

&lt;p&gt;Crashes happen.&lt;br&gt;
Retries happen.&lt;br&gt;
RPC timeouts happen.&lt;/p&gt;

&lt;p&gt;The system must be safe to retry the same block multiple times.&lt;/p&gt;

&lt;p&gt;All inserts use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTHING&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If block already exists, ignore&lt;/li&gt;
&lt;li&gt;If transaction already exists, ignore&lt;/li&gt;
&lt;li&gt;If log already exists, ignore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calling &lt;code&gt;sync_block(N)&lt;/code&gt; 10 times produces identical state to calling it once.&lt;/p&gt;

&lt;p&gt;This dramatically simplifies retry logic.&lt;/p&gt;

&lt;p&gt;Idempotency is not an optimization.&lt;br&gt;
It is survival.&lt;/p&gt;




&lt;h1&gt;
  
  
  Isolation Level Considerations
&lt;/h1&gt;

&lt;p&gt;PostgreSQL defaults to &lt;code&gt;READ COMMITTED&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For this indexer, that’s sufficient because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blocks are processed sequentially&lt;/li&gt;
&lt;li&gt;No concurrent writers modify the same block&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I parallelized block ingestion, I would evaluate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;REPEATABLE READ&lt;/code&gt; for consistency&lt;/li&gt;
&lt;li&gt;Explicit row-level locking&lt;/li&gt;
&lt;li&gt;Partitioned writes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Atomicity matters more than raw speed.&lt;/p&gt;




&lt;h1&gt;
  
  
  Failure Scenarios Modeled
&lt;/h1&gt;

&lt;p&gt;This system was built assuming failure is normal.&lt;/p&gt;

&lt;p&gt;Handled scenarios:&lt;/p&gt;

&lt;h3&gt;
  
  
  Crash before commit
&lt;/h3&gt;

&lt;p&gt;Entire transaction rolls back.&lt;br&gt;
Checkpoint unchanged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Crash after commit
&lt;/h3&gt;

&lt;p&gt;Checkpoint updated.&lt;br&gt;
Block fully durable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Duplicate processing
&lt;/h3&gt;

&lt;p&gt;Safe due to &lt;code&gt;ON CONFLICT DO NOTHING&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  RPC timeout
&lt;/h3&gt;

&lt;p&gt;Retry with exponential backoff.&lt;br&gt;
Idempotent writes make this safe.&lt;/p&gt;

&lt;h3&gt;
  
  
  Database lock contention
&lt;/h3&gt;

&lt;p&gt;Transaction scope kept minimal.&lt;br&gt;
No external I/O inside transaction.&lt;/p&gt;

&lt;p&gt;Design principle:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every block sync must be atomic and idempotent.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Runtime Observations
&lt;/h1&gt;

&lt;p&gt;Under sustained historical sync:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average block processing time: ~5–15ms (RPC-bound)&lt;/li&gt;
&lt;li&gt;Database time per block: &amp;lt;3ms&lt;/li&gt;
&lt;li&gt;CPU mostly idle (network-bound workload)&lt;/li&gt;
&lt;li&gt;Memory stable (~20–30MB during sync)&lt;/li&gt;
&lt;li&gt;No unbounded growth observed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hotspots:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON decoding of receipts&lt;/li&gt;
&lt;li&gt;Large batch inserts during high-activity blocks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Batching via SQLx &lt;code&gt;QueryBuilder&lt;/code&gt; significantly reduced round-trip overhead.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Sequential Processing (For Now)
&lt;/h1&gt;

&lt;p&gt;Parallelizing blocks sounds attractive.&lt;/p&gt;

&lt;p&gt;But Ethereum has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strict ordering&lt;/li&gt;
&lt;li&gt;parent hash relationships&lt;/li&gt;
&lt;li&gt;potential reorgs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sequential processing simplifies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;checkpoint logic&lt;/li&gt;
&lt;li&gt;parent verification&lt;/li&gt;
&lt;li&gt;reorg detection&lt;/li&gt;
&lt;li&gt;rollback handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Correctness first.&lt;br&gt;
Parallelism later.&lt;/p&gt;




&lt;h1&gt;
  
  
  Reorg Handling (Planned)
&lt;/h1&gt;

&lt;p&gt;Current implementation detects duplicates but does not fully support deep reorg rollback.&lt;/p&gt;

&lt;p&gt;Planned approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compare incoming block parent_hash with local block hash&lt;/li&gt;
&lt;li&gt;If mismatch:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;delete orphaned blocks&lt;/li&gt;
&lt;li&gt;rewind checkpoint&lt;/li&gt;
&lt;li&gt;resync from fork point&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This will require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;indexed parent hash&lt;/li&gt;
&lt;li&gt;cascading deletes&lt;/li&gt;
&lt;li&gt;careful transactional rollback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reorg handling is not trivial.&lt;br&gt;
It must be deliberate.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Atomicity &amp;gt; Throughput
&lt;/h1&gt;

&lt;p&gt;Strict per-block transactions add minor overhead.&lt;/p&gt;

&lt;p&gt;Tradeoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slightly higher write latency&lt;/li&gt;
&lt;li&gt;Massive increase in correctness guarantees&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In indexing systems:&lt;/p&gt;

&lt;p&gt;Corrupted data is worse than delayed data.&lt;/p&gt;




&lt;h1&gt;
  
  
  What I Would Improve Next
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Parallel historical sync with bounded worker pool&lt;/li&gt;
&lt;li&gt;Reorg-safe rollback logic&lt;/li&gt;
&lt;li&gt;Partitioned block tables&lt;/li&gt;
&lt;li&gt;WAL-based replication for read scaling&lt;/li&gt;
&lt;li&gt;Prometheus metrics for ingestion lag&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Lessons Learned
&lt;/h1&gt;

&lt;p&gt;The hardest part of backend systems is not performance.&lt;/p&gt;

&lt;p&gt;It’s state recovery.&lt;/p&gt;

&lt;p&gt;Making writes atomic simplified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retry logic&lt;/li&gt;
&lt;li&gt;crash recovery&lt;/li&gt;
&lt;li&gt;reasoning about invariants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rust helped enforce correctness.&lt;br&gt;
PostgreSQL enforced durability.&lt;br&gt;
Transactions enforced sanity.&lt;/p&gt;

&lt;p&gt;The system is not optimized for speed.&lt;/p&gt;

&lt;p&gt;It is optimized for being correct when things go wrong.&lt;/p&gt;

&lt;p&gt;That is what matters in infrastructure.&lt;/p&gt;




</description>
      <category>rust</category>
      <category>backend</category>
      <category>postgressql</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Building a Deterministic High-Throughput WebSocket Ingestion System in Rust</title>
      <dc:creator>Nihal Pandey</dc:creator>
      <pubDate>Thu, 19 Feb 2026 18:55:51 +0000</pubDate>
      <link>https://dev.to/nihalpandey2302/building-a-deterministic-high-throughput-websocket-ingestion-system-in-rust-38ia</link>
      <guid>https://dev.to/nihalpandey2302/building-a-deterministic-high-throughput-websocket-ingestion-system-in-rust-38ia</guid>
      <description>&lt;h2&gt;
  
  
  Designing a reliable async market data client with ordering guarantees, backpressure awareness, and recovery logic
&lt;/h2&gt;

&lt;p&gt;Real-time trading systems are ingestion systems.&lt;/p&gt;

&lt;p&gt;The hard problem is not parsing JSON quickly.&lt;br&gt;
The hard problem is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;preserving message ordering&lt;/li&gt;
&lt;li&gt;recovering cleanly from disconnects&lt;/li&gt;
&lt;li&gt;preventing silent data corruption&lt;/li&gt;
&lt;li&gt;handling slow consumers&lt;/li&gt;
&lt;li&gt;maintaining predictable latency under load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project was built to explore those constraints using Rust’s async ecosystem.&lt;/p&gt;


&lt;h1&gt;
  
  
  System Constraints
&lt;/h1&gt;

&lt;p&gt;Before writing code, I defined explicit design constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messages must be processed strictly in order&lt;/li&gt;
&lt;li&gt;WebSocket ownership must be deterministic&lt;/li&gt;
&lt;li&gt;Reconnect must not lose subscription state&lt;/li&gt;
&lt;li&gt;Orderbook must match exchange checksum&lt;/li&gt;
&lt;li&gt;Consumers may be slower than ingestion&lt;/li&gt;
&lt;li&gt;Recovery must be automatic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every architectural decision flowed from these constraints.&lt;/p&gt;


&lt;h1&gt;
  
  
  High-Level Runtime Flow
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy2og1ds85qm6v7bj91g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy2og1ds85qm6v7bj91g.png" alt=" " width="800" height="1075"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Core idea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;One ingestion loop owns the socket.&lt;br&gt;
Everything else consumes typed events.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No concurrent writers.&lt;br&gt;
No fragmented recovery logic.&lt;/p&gt;


&lt;h1&gt;
  
  
  Connection Lifecycle
&lt;/h1&gt;

&lt;p&gt;Interviewers care about lifecycle clarity.&lt;br&gt;
Here is the full connection state flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jcqkzg8w7jhdicocq6j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jcqkzg8w7jhdicocq6j.png" alt=" " width="800" height="1224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Important points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subscription state is stored separately from socket&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;On reconnect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;backoff&lt;/li&gt;
&lt;li&gt;reauthenticate if needed&lt;/li&gt;
&lt;li&gt;resubscribe&lt;/li&gt;
&lt;li&gt;resync orderbook&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;System never assumes connection stability&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failure is a first-class state.&lt;/p&gt;


&lt;h1&gt;
  
  
  Core Event Loop Design
&lt;/h1&gt;

&lt;p&gt;The WebSocket connection is owned by a single async task using &lt;code&gt;tokio::select!&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read frames&lt;/li&gt;
&lt;li&gt;process outgoing commands&lt;/li&gt;
&lt;li&gt;heartbeat&lt;/li&gt;
&lt;li&gt;trigger reconnect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why single-loop ownership?&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;concurrent readers introduce nondeterministic ordering&lt;/li&gt;
&lt;li&gt;multiple writers complicate recovery&lt;/li&gt;
&lt;li&gt;state transitions become fragmented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design behaves like an actor:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;one owner, explicit state transitions, deterministic execution.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h1&gt;
  
  
  What Broke First (And Why It Matters)
&lt;/h1&gt;

&lt;p&gt;The initial version used multiple reader tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one for WebSocket frames&lt;/li&gt;
&lt;li&gt;one for parsing&lt;/li&gt;
&lt;li&gt;one for state updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This worked — until reconnect logic was introduced.&lt;/p&gt;

&lt;p&gt;During disconnects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tasks raced to update state&lt;/li&gt;
&lt;li&gt;partial orderbook snapshots were applied&lt;/li&gt;
&lt;li&gt;ordering bugs surfaced under load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;p&gt;Move to a single ingestion loop that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;owns the socket&lt;/li&gt;
&lt;li&gt;owns the parser&lt;/li&gt;
&lt;li&gt;owns state mutation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This eliminated race conditions and simplified recovery logic dramatically.&lt;/p&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Simplicity beats parallelism in ingestion systems.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h1&gt;
  
  
  Typed Deserialization Strategy
&lt;/h1&gt;

&lt;p&gt;Kraken sends heterogeneous JSON array messages.&lt;/p&gt;

&lt;p&gt;Instead of dynamic dispatch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[serde(untagged)]&lt;/span&gt;
&lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;WsMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;Trade&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TradeData&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;Book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OrderBookData&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;Ticker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TickerData&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;Heartbeat&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compile-time exhaustiveness&lt;/li&gt;
&lt;li&gt;No runtime reflection&lt;/li&gt;
&lt;li&gt;Deterministic routing&lt;/li&gt;
&lt;li&gt;Clear failure modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parsing becomes predictable and measurable.&lt;/p&gt;




&lt;h1&gt;
  
  
  Orderbook State &amp;amp; Data Structures
&lt;/h1&gt;

&lt;p&gt;Local orderbook uses &lt;code&gt;BTreeMap&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ordered price levels&lt;/li&gt;
&lt;li&gt;O(log n) inserts&lt;/li&gt;
&lt;li&gt;stable iteration&lt;/li&gt;
&lt;li&gt;deterministic checksum reconstruction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;HashMap&lt;/code&gt; would give faster lookup but no ordering guarantee.&lt;/p&gt;

&lt;p&gt;For financial systems, ordering matters more than raw speed.&lt;/p&gt;




&lt;h1&gt;
  
  
  Checksum Validation
&lt;/h1&gt;

&lt;p&gt;Every snapshot/update:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Apply delta&lt;/li&gt;
&lt;li&gt;Reconstruct canonical string&lt;/li&gt;
&lt;li&gt;Compute CRC32&lt;/li&gt;
&lt;li&gt;Compare with exchange&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If mismatch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;invalidate local book&lt;/li&gt;
&lt;li&gt;trigger full resync&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Integrity is prioritized over throughput.&lt;/p&gt;




&lt;h1&gt;
  
  
  Backpressure &amp;amp; Consumer Decoupling
&lt;/h1&gt;

&lt;p&gt;Ingestion uses &lt;code&gt;tokio::broadcast&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple strategies subscribe&lt;/li&gt;
&lt;li&gt;ingestion never blocks&lt;/li&gt;
&lt;li&gt;near-zero fanout overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tradeoffs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;slow consumers can lag&lt;/li&gt;
&lt;li&gt;buffer overflow drops messages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production additions would include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lag metrics&lt;/li&gt;
&lt;li&gt;bounded channels&lt;/li&gt;
&lt;li&gt;backpressure signaling&lt;/li&gt;
&lt;li&gt;optional durable stream (Kafka/NATS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fast ingestion without backpressure awareness leads to silent failure.&lt;/p&gt;




&lt;h1&gt;
  
  
  Benchmarking Philosophy
&lt;/h1&gt;

&lt;p&gt;The benchmark goal was not peak speed.&lt;/p&gt;

&lt;p&gt;The goal was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;deterministic processing under sustained load.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parsing + routing throughput&lt;/li&gt;
&lt;li&gt;allocation behavior&lt;/li&gt;
&lt;li&gt;latency per message&lt;/li&gt;
&lt;li&gt;CPU utilization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Results (local machine):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~648k msgs/sec (Rust)&lt;/li&gt;
&lt;li&gt;~600k msgs/sec (Python reference)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLS + network latency not included&lt;/li&gt;
&lt;li&gt;Measured using recorded streams&lt;/li&gt;
&lt;li&gt;Focused on processing layer, not transport&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Throughput was secondary to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable latency&lt;/li&gt;
&lt;li&gt;no ordering drift&lt;/li&gt;
&lt;li&gt;no state corruption&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Runtime Observations (Under Load)
&lt;/h1&gt;

&lt;p&gt;Measured locally under sustained stream replay:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency per message&lt;/strong&gt;: ~1–2µs parsing + routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU usage&lt;/strong&gt;: parsing dominated (~70% of core)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak memory usage&lt;/strong&gt;: ~10–15MB during normal ingestion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allocation spikes&lt;/strong&gt;: occurred during full orderbook resync&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hotspots:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON array parsing&lt;/li&gt;
&lt;li&gt;temporary allocation during snapshot rebuild&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These observations influenced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;minimizing cloning&lt;/li&gt;
&lt;li&gt;reusing buffers&lt;/li&gt;
&lt;li&gt;reducing intermediate allocations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system remained stable under sustained load without memory growth.&lt;/p&gt;




&lt;h1&gt;
  
  
  Architectural Tradeoffs
&lt;/h1&gt;

&lt;p&gt;This implementation favors determinism over horizontal scalability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;deterministic ordering&lt;/li&gt;
&lt;li&gt;no socket contention&lt;/li&gt;
&lt;li&gt;simple recovery&lt;/li&gt;
&lt;li&gt;easier debugging&lt;/li&gt;
&lt;li&gt;minimal locking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tradeoff
&lt;/h3&gt;

&lt;p&gt;Single-core parsing bottleneck at extreme rates.&lt;/p&gt;

&lt;p&gt;Production scaling options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shard by trading pair&lt;/li&gt;
&lt;li&gt;multiple ingestion loops&lt;/li&gt;
&lt;li&gt;forward frames into Kafka/NATS&lt;/li&gt;
&lt;li&gt;multi-process ingestion layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Correctness first.&lt;br&gt;
Scale second.&lt;/p&gt;




&lt;h1&gt;
  
  
  Failure Modes Considered
&lt;/h1&gt;

&lt;p&gt;Designed assuming failure is normal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;connection drops&lt;/li&gt;
&lt;li&gt;malformed messages&lt;/li&gt;
&lt;li&gt;partial snapshot&lt;/li&gt;
&lt;li&gt;checksum mismatch&lt;/li&gt;
&lt;li&gt;slow consumers&lt;/li&gt;
&lt;li&gt;duplicate subscriptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Core principle:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ingestion must be recoverable, not fragile.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  What I Would Improve Next
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Reliability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;persistent event log for replay&lt;/li&gt;
&lt;li&gt;message durability layer&lt;/li&gt;
&lt;li&gt;lag-aware bounded queues&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus metrics&lt;/li&gt;
&lt;li&gt;structured tracing&lt;/li&gt;
&lt;li&gt;latency histograms&lt;/li&gt;
&lt;li&gt;reconnect counters&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;symbol-based sharding&lt;/li&gt;
&lt;li&gt;multi-loop ingestion&lt;/li&gt;
&lt;li&gt;partitioned state per pair&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Key Lessons
&lt;/h1&gt;

&lt;p&gt;Deterministic ownership simplifies distributed reasoning.&lt;br&gt;
Backpressure matters more than raw speed.&lt;br&gt;
Recovery logic is not edge-case logic — it is core logic.&lt;br&gt;
Type safety reduces runtime surprises.&lt;/p&gt;

&lt;p&gt;The hardest part of ingestion systems is not speed.&lt;/p&gt;

&lt;p&gt;It is &lt;strong&gt;predictable behavior under failure&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Code
&lt;/h1&gt;

&lt;p&gt;Full implementation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Nihal-Pandey-2302/kraken-rs" rel="noopener noreferrer"&gt;https://github.com/Nihal-Pandey-2302/kraken-rs&lt;/a&gt;&lt;/p&gt;




</description>
      <category>rust</category>
      <category>websockets</category>
      <category>backend</category>
      <category>distributedsystems</category>
    </item>
  </channel>
</rss>
