<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sergei Karasev</title>
    <description>The latest articles on DEV Community by Sergei Karasev (@ahiipsa).</description>
    <link>https://dev.to/ahiipsa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3834134%2Ff58e337d-4873-4902-b048-ded15818d2cf.png</url>
      <title>DEV Community: Sergei Karasev</title>
      <link>https://dev.to/ahiipsa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ahiipsa"/>
    <language>en</language>
    <item>
      <title>How I solved Ethereum RPC rate limits with traffic engineering instead of paying $250/month</title>
      <dc:creator>Sergei Karasev</dc:creator>
      <pubDate>Wed, 25 Mar 2026 22:30:52 +0000</pubDate>
      <link>https://dev.to/ahiipsa/how-i-solved-ethereum-rpc-rate-limits-with-traffic-engineering-instead-of-paying-250month-30ed</link>
      <guid>https://dev.to/ahiipsa/how-i-solved-ethereum-rpc-rate-limits-with-traffic-engineering-instead-of-paying-250month-30ed</guid>
      <description>&lt;h2&gt;
  
  
  A production engineering story about rate limits, retries, failure behavior, and building RPC traffic control
&lt;/h2&gt;

&lt;p&gt;At some point our backend started failing.&lt;/p&gt;

&lt;p&gt;Not completely.&lt;/p&gt;

&lt;p&gt;Not catastrophically.&lt;/p&gt;

&lt;p&gt;Just small strange things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a cron job running longer than usual&lt;/li&gt;
&lt;li&gt;random RPC failures&lt;/li&gt;
&lt;li&gt;occasional timeouts&lt;/li&gt;
&lt;li&gt;rare stuck executions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing dramatic.&lt;/p&gt;

&lt;p&gt;But enough to feel dangerous.&lt;/p&gt;

&lt;p&gt;If you've worked with distributed systems — you know this pattern.&lt;/p&gt;

&lt;p&gt;Systems rarely explode.&lt;/p&gt;

&lt;p&gt;They &lt;strong&gt;slowly become unreliable first.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And this story is about how a simple wallet balance collector turned into an RPC infrastructure problem… and why the real solution wasn't buying a bigger plan.&lt;/p&gt;




&lt;h2&gt;
  
  
  The original problem was simple
&lt;/h2&gt;

&lt;p&gt;We needed to collect wallet balances to display user positions.&lt;/p&gt;

&lt;p&gt;Nothing unusual.&lt;/p&gt;

&lt;p&gt;Architecture was basic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cron (every 15 min)
    ↓
fetch balances from blockchain
    ↓
store results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each execution made around &lt;strong&gt;30–50 RPC calls&lt;/strong&gt;, mostly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eth_call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NestJS backend&lt;/li&gt;
&lt;li&gt;ethers.js&lt;/li&gt;
&lt;li&gt;standard RPC providers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything worked perfectly.&lt;/p&gt;

&lt;p&gt;Until growth started.&lt;/p&gt;




&lt;h2&gt;
  
  
  When things started breaking
&lt;/h2&gt;

&lt;p&gt;As the number of wallets increased we started seeing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;random RPC failures&lt;/li&gt;
&lt;li&gt;unstable execution time&lt;/li&gt;
&lt;li&gt;occasional timeouts&lt;/li&gt;
&lt;li&gt;jobs sometimes hanging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first nothing clearly indicated the cause.&lt;/p&gt;

&lt;p&gt;Then logs revealed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP 429&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rate limits.&lt;/p&gt;

&lt;p&gt;Expected.&lt;/p&gt;

&lt;p&gt;But then something more interesting appeared.&lt;/p&gt;




&lt;h2&gt;
  
  
  The RPC behavior nobody warns you about
&lt;/h2&gt;

&lt;p&gt;I expected providers to reject requests when overloaded.&lt;/p&gt;

&lt;p&gt;Instead some providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;accepted connections&lt;/li&gt;
&lt;li&gt;delayed responses&lt;/li&gt;
&lt;li&gt;eventually timed out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which is much worse.&lt;/p&gt;

&lt;p&gt;Fast failure is manageable.&lt;/p&gt;

&lt;p&gt;Slow failure kills systems.&lt;/p&gt;

&lt;p&gt;Because backend workers stay blocked.&lt;/p&gt;

&lt;p&gt;And ethers default timeout is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 minutes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Which means under pressure:&lt;/p&gt;

&lt;p&gt;Requests didn't fail.&lt;/p&gt;

&lt;p&gt;They accumulated.&lt;/p&gt;

&lt;p&gt;This is exactly how cascading failures begin.&lt;/p&gt;




&lt;h2&gt;
  
  
  First fix: add more RPC providers
&lt;/h2&gt;

&lt;p&gt;The obvious fix:&lt;/p&gt;

&lt;p&gt;Add redundancy.&lt;/p&gt;

&lt;p&gt;We added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infura free plan&lt;/li&gt;
&lt;li&gt;Quicknode free plan&lt;/li&gt;
&lt;li&gt;public RPC endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ethers.FallbackProvider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logic looked clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Provider 1 → fail
Provider 2 → success
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Problem solved?&lt;/p&gt;

&lt;p&gt;Not really.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why FallbackProvider doesn't solve infrastructure problems
&lt;/h2&gt;

&lt;p&gt;Important detail:&lt;/p&gt;

&lt;p&gt;FallbackProvider does &lt;strong&gt;not penalize failing providers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If provider 1 constantly returns 429:&lt;/p&gt;

&lt;p&gt;Every request becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;try provider 1
fail

try provider 2
success
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which creates hidden costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;guaranteed failed request&lt;/li&gt;
&lt;li&gt;extra latency&lt;/li&gt;
&lt;li&gt;wasted credits&lt;/li&gt;
&lt;li&gt;retry overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under load this becomes dangerous.&lt;/p&gt;

&lt;p&gt;Because retries increase pressure.&lt;/p&gt;

&lt;p&gt;This isn't failover.&lt;/p&gt;

&lt;p&gt;This is failure amplification.&lt;/p&gt;




&lt;h2&gt;
  
  
  The realization moment
&lt;/h2&gt;

&lt;p&gt;At some point it became obvious:&lt;/p&gt;

&lt;p&gt;This wasn't an RPC provider problem.&lt;/p&gt;

&lt;p&gt;This wasn't a pricing problem.&lt;/p&gt;

&lt;p&gt;This was a:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traffic control problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RPS control&lt;/li&gt;
&lt;li&gt;concurrency control&lt;/li&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;provider health awareness&lt;/li&gt;
&lt;li&gt;failure isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;p&gt;We didn't need better providers.&lt;/p&gt;

&lt;p&gt;We needed better infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimizations I tried before building infrastructure
&lt;/h2&gt;

&lt;p&gt;Before writing infrastructure you should try optimization.&lt;/p&gt;

&lt;p&gt;I did.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multicall
&lt;/h2&gt;

&lt;p&gt;Reduced number of requests.&lt;/p&gt;

&lt;p&gt;Helped efficiency.&lt;/p&gt;

&lt;p&gt;Didn't solve RPS spikes.&lt;/p&gt;




&lt;h2&gt;
  
  
  ethers v6 batching
&lt;/h2&gt;

&lt;p&gt;Same result.&lt;/p&gt;

&lt;p&gt;Better efficiency.&lt;/p&gt;

&lt;p&gt;Same burst pressure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bigger plans
&lt;/h2&gt;

&lt;p&gt;Plans that satisfied requirements:&lt;/p&gt;

&lt;p&gt;About &lt;strong&gt;$250/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Technically valid.&lt;/p&gt;

&lt;p&gt;Architecturally lazy.&lt;/p&gt;

&lt;p&gt;Because we didn't lack compute.&lt;/p&gt;

&lt;p&gt;We lacked control.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real bottleneck
&lt;/h2&gt;

&lt;p&gt;The problem wasn't total requests.&lt;/p&gt;

&lt;p&gt;It was request distribution.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10 requests in 1 second → fail
10 requests across 5 seconds → OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same work.&lt;/p&gt;

&lt;p&gt;Different shape.&lt;/p&gt;

&lt;p&gt;Infrastructure cares about shape.&lt;/p&gt;

&lt;p&gt;Not totals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture evolution
&lt;/h2&gt;

&lt;p&gt;The architecture evolved naturally.&lt;/p&gt;

&lt;p&gt;Reality forced it.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 1 — naive design
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backend
   │
Single RPC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;single point of failure&lt;/li&gt;
&lt;li&gt;hard RPS ceiling&lt;/li&gt;
&lt;li&gt;timeout risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Worked until growth.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 2 — redundancy design
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backend
   │
FallbackProvider
   │
RPC1 RPC2 RPC3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Improved availability.&lt;/p&gt;

&lt;p&gt;Still unstable.&lt;/p&gt;

&lt;p&gt;Problems remained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no RPS control&lt;/li&gt;
&lt;li&gt;retry overhead&lt;/li&gt;
&lt;li&gt;no provider isolation&lt;/li&gt;
&lt;li&gt;unpredictable latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We improved redundancy.&lt;/p&gt;

&lt;p&gt;Not resilience.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 3 — infrastructure design
&lt;/h3&gt;

&lt;p&gt;Final architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backend
   │
RPCPoolProvider
   │
Traffic control layer
   │
RPC1 RPC2 RPC3 RPC4 RPC5 RPC6 RPC7 RPC8 RPC9 RPC10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key difference:&lt;/p&gt;

&lt;p&gt;We stopped reacting to failures.&lt;/p&gt;

&lt;p&gt;We started preventing them.&lt;/p&gt;

&lt;p&gt;This is the difference between redundancy and infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building RPC traffic control
&lt;/h2&gt;

&lt;p&gt;Requirements became clear:&lt;/p&gt;

&lt;p&gt;The provider must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;never exceed RPS&lt;/li&gt;
&lt;li&gt;control concurrency&lt;/li&gt;
&lt;li&gt;distribute requests&lt;/li&gt;
&lt;li&gt;isolate bad providers&lt;/li&gt;
&lt;li&gt;retry intelligently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I built a provider that does exactly that.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App → RPC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App → RPC Pool → RPC endpoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RPC became:&lt;/p&gt;

&lt;p&gt;A managed resource.&lt;/p&gt;

&lt;p&gt;Not a dependency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production design decisions
&lt;/h2&gt;

&lt;p&gt;These decisions made the system stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rate limiting per endpoint
&lt;/h2&gt;

&lt;p&gt;Each RPC gets independent RPS control.&lt;/p&gt;

&lt;p&gt;Token bucket model ensures smooth traffic shaping.&lt;/p&gt;

&lt;p&gt;Instead of burst → fail:&lt;/p&gt;

&lt;p&gt;We get:&lt;/p&gt;

&lt;p&gt;Flow → stability.&lt;/p&gt;

&lt;p&gt;This alone removed most rate limits.&lt;/p&gt;




&lt;h3&gt;
  
  
  Concurrency isolation
&lt;/h3&gt;

&lt;p&gt;Hidden killer:&lt;/p&gt;

&lt;p&gt;Concurrency spikes.&lt;/p&gt;

&lt;p&gt;Even if RPS is correct.&lt;/p&gt;

&lt;p&gt;Solution:&lt;/p&gt;

&lt;p&gt;Semaphore per endpoint.&lt;/p&gt;

&lt;p&gt;Prevents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provider overload&lt;/li&gt;
&lt;li&gt;retry cascades&lt;/li&gt;
&lt;li&gt;self-inflicted pressure&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Smart routing
&lt;/h3&gt;

&lt;p&gt;Naive round robin assumes equal providers.&lt;/p&gt;

&lt;p&gt;Reality:&lt;/p&gt;

&lt;p&gt;Providers degrade.&lt;/p&gt;

&lt;p&gt;Providers throttle.&lt;/p&gt;

&lt;p&gt;Providers fail.&lt;/p&gt;

&lt;p&gt;Router avoids providers in cooldown.&lt;/p&gt;

&lt;p&gt;Bad nodes temporarily removed.&lt;/p&gt;

&lt;p&gt;This prevents failure amplification.&lt;/p&gt;




&lt;h3&gt;
  
  
  Failure-aware retry strategy
&lt;/h3&gt;

&lt;p&gt;Retries only happen for infrastructure failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeouts&lt;/li&gt;
&lt;li&gt;rate limits&lt;/li&gt;
&lt;li&gt;5xx errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Never for logical errors.&lt;/p&gt;

&lt;p&gt;Prevents retry storms.&lt;/p&gt;

&lt;p&gt;One of the most dangerous distributed system bugs.&lt;/p&gt;




&lt;h3&gt;
  
  
  Observability (most missing piece)
&lt;/h3&gt;

&lt;p&gt;Most RPC wrappers are blind.&lt;/p&gt;

&lt;p&gt;Production systems cannot be blind.&lt;/p&gt;

&lt;p&gt;Every request emits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;request event&lt;/li&gt;
&lt;li&gt;response event&lt;/li&gt;
&lt;li&gt;error event&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metrics&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;provider comparison&lt;/li&gt;
&lt;li&gt;anomaly detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability is what turns code into infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Engineering mistakes I made
&lt;/h2&gt;

&lt;p&gt;Looking back, several mistakes were obvious.&lt;/p&gt;

&lt;p&gt;Common ones.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 1 — assuming providers fail cleanly
&lt;/h3&gt;

&lt;p&gt;I assumed providers reject overload.&lt;/p&gt;

&lt;p&gt;Reality:&lt;/p&gt;

&lt;p&gt;Some delay instead.&lt;/p&gt;

&lt;p&gt;Much worse.&lt;/p&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;p&gt;External systems rarely fail nicely.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 2 — believing retries increase reliability
&lt;/h3&gt;

&lt;p&gt;Retries without shaping increase pressure.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;failure → retry → more load → more failure → more retry
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feedback loop.&lt;/p&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;p&gt;Retries must be controlled.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 3 — optimizing totals instead of distribution
&lt;/h3&gt;

&lt;p&gt;I optimized:&lt;/p&gt;

&lt;p&gt;Total requests.&lt;/p&gt;

&lt;p&gt;Real problem:&lt;/p&gt;

&lt;p&gt;Requests per second.&lt;/p&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;p&gt;Time distribution matters more than totals.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 4 — treating RPC as just an API
&lt;/h3&gt;

&lt;p&gt;RPC is not just an API.&lt;/p&gt;

&lt;p&gt;It is infrastructure.&lt;/p&gt;

&lt;p&gt;Which means it needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;shaping&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;failure strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;p&gt;If it can break production — it's infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What can still go wrong
&lt;/h2&gt;

&lt;p&gt;No solution is perfect.&lt;/p&gt;

&lt;p&gt;Remaining risks:&lt;/p&gt;




&lt;h3&gt;
  
  
  Block consistency
&lt;/h3&gt;

&lt;p&gt;Different providers may lag blocks.&lt;/p&gt;

&lt;p&gt;Possible solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sticky routing&lt;/li&gt;
&lt;li&gt;blockTag consistency&lt;/li&gt;
&lt;li&gt;session affinity&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Public RPC instability
&lt;/h3&gt;

&lt;p&gt;Public endpoints can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;disappear&lt;/li&gt;
&lt;li&gt;throttle&lt;/li&gt;
&lt;li&gt;change limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;large provider pool&lt;/li&gt;
&lt;li&gt;health scoring&lt;/li&gt;
&lt;li&gt;adaptive routing&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Retry pressure under mass failure
&lt;/h3&gt;

&lt;p&gt;If many providers fail:&lt;/p&gt;

&lt;p&gt;Retries can still create pressure.&lt;/p&gt;

&lt;p&gt;Future improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;circuit breakers&lt;/li&gt;
&lt;li&gt;global backoff&lt;/li&gt;
&lt;li&gt;adaptive retry limits&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Latency variance
&lt;/h3&gt;

&lt;p&gt;Providers have different speeds.&lt;/p&gt;

&lt;p&gt;Future improvement:&lt;/p&gt;

&lt;p&gt;Latency weighted routing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I would build differently today
&lt;/h2&gt;

&lt;p&gt;With hindsight:&lt;/p&gt;

&lt;p&gt;Several improvements are obvious.&lt;/p&gt;




&lt;h3&gt;
  
  
  Health scoring instead of cooldown
&lt;/h3&gt;

&lt;p&gt;Instead of simple cooldown:&lt;/p&gt;

&lt;p&gt;Use dynamic scoring based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;success rate&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;timeout ratio&lt;/li&gt;
&lt;li&gt;rate limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Route based on health.&lt;/p&gt;




&lt;h3&gt;
  
  
  Latency aware routing
&lt;/h3&gt;

&lt;p&gt;Fast providers should get more traffic.&lt;/p&gt;

&lt;p&gt;Weighted routing improves stability.&lt;/p&gt;




&lt;h3&gt;
  
  
  Adaptive RPS
&lt;/h3&gt;

&lt;p&gt;Instead of static limits:&lt;/p&gt;

&lt;p&gt;Adaptive limits based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;failures&lt;/li&gt;
&lt;li&gt;latency growth&lt;/li&gt;
&lt;li&gt;retry-after headers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Infrastructure should adapt automatically.&lt;/p&gt;




&lt;h3&gt;
  
  
  Request deduplication
&lt;/h3&gt;

&lt;p&gt;Multiple identical calls could merge.&lt;/p&gt;

&lt;p&gt;(singleflight pattern)&lt;/p&gt;

&lt;p&gt;Reduces pressure dramatically.&lt;/p&gt;

&lt;p&gt;Especially for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eth_call
eth_blockNumber
eth_gasPrice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Sticky routing
&lt;/h3&gt;

&lt;p&gt;Some workloads benefit from provider affinity.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Vault calculations.&lt;/p&gt;

&lt;p&gt;Session stickiness could help.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production outcome
&lt;/h2&gt;

&lt;p&gt;Today we run:&lt;/p&gt;

&lt;p&gt;~10 RPC endpoints per network.&lt;/p&gt;

&lt;p&gt;Mix of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;free plans&lt;/li&gt;
&lt;li&gt;public RPC&lt;/li&gt;
&lt;li&gt;basic tiers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No premium plans required.&lt;/p&gt;

&lt;p&gt;Backend became:&lt;/p&gt;

&lt;p&gt;Stable.&lt;/p&gt;

&lt;p&gt;Predictable.&lt;/p&gt;

&lt;p&gt;Boring.&lt;/p&gt;

&lt;p&gt;Which is exactly what infrastructure should be.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I open sourced it
&lt;/h2&gt;

&lt;p&gt;After running this in production I realized:&lt;/p&gt;

&lt;p&gt;Many teams probably hit this exact stage:&lt;/p&gt;

&lt;p&gt;Single RPC → instability → bigger plan.&lt;/p&gt;

&lt;p&gt;Few build traffic control.&lt;/p&gt;

&lt;p&gt;So I extracted the provider into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ethers-rpc-pool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A drop-in JsonRpcProvider replacement adding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;load balancing&lt;/li&gt;
&lt;li&gt;RPS limiting&lt;/li&gt;
&lt;li&gt;concurrency control&lt;/li&gt;
&lt;li&gt;retry strategy&lt;/li&gt;
&lt;li&gt;instrumentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal wasn't to build a library.&lt;/p&gt;

&lt;p&gt;The goal was to make RPC predictable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What makes this different from typical RPC wrappers
&lt;/h2&gt;

&lt;p&gt;Typical wrappers add:&lt;/p&gt;

&lt;p&gt;Load balancing.&lt;/p&gt;

&lt;p&gt;This adds:&lt;/p&gt;

&lt;p&gt;Traffic engineering.&lt;/p&gt;

&lt;p&gt;Which includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;request shaping&lt;/li&gt;
&lt;li&gt;failure classification&lt;/li&gt;
&lt;li&gt;provider cooldown&lt;/li&gt;
&lt;li&gt;retry discipline&lt;/li&gt;
&lt;li&gt;concurrency isolation&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Less helper.&lt;/p&gt;

&lt;p&gt;More infrastructure layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real lesson
&lt;/h2&gt;

&lt;p&gt;Most scaling problems are not scale problems.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Control problems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not:&lt;/p&gt;

&lt;p&gt;How many requests you send.&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How you shape them.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  If you hit RPC limits
&lt;/h2&gt;

&lt;p&gt;Before buying bigger plans ask:&lt;/p&gt;

&lt;p&gt;Do I need more capacity?&lt;/p&gt;

&lt;p&gt;Or better engineering?&lt;/p&gt;

&lt;p&gt;Often the cheaper solution is also the more senior one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Good engineers solve problems.&lt;/p&gt;

&lt;p&gt;Senior engineers prevent them.&lt;/p&gt;

&lt;p&gt;Staff engineers reshape systems so problems cannot easily appear.&lt;/p&gt;

&lt;p&gt;This started as:&lt;/p&gt;

&lt;p&gt;RPC failures.&lt;/p&gt;

&lt;p&gt;It ended as:&lt;/p&gt;

&lt;p&gt;Traffic engineering.&lt;/p&gt;

&lt;p&gt;And that's usually how infrastructure stories go.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project
&lt;/h2&gt;

&lt;p&gt;I extracted this solution into a reusable provider:&lt;/p&gt;

&lt;p&gt;ethers-rpc-pool&lt;/p&gt;

&lt;p&gt;GitHub:&lt;br&gt;
&lt;a href="https://github.com/ahiipsa/ethers-rpc-pool" rel="noopener noreferrer"&gt;https://github.com/ahiipsa/ethers-rpc-pool&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;npm:&lt;br&gt;
&lt;a href="https://www.npmjs.com/package/ethers-rpc-pool" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/ethers-rpc-pool&lt;/a&gt;&lt;/p&gt;

</description>
      <category>blockchain</category>
      <category>ethereum</category>
      <category>javascript</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
