DEV Community

Sergei Karasev
Sergei Karasev

Posted on

How I solved Ethereum RPC rate limits with traffic engineering instead of paying $250/month

A production engineering story about rate limits, retries, failure behavior, and building RPC traffic control

At some point our backend started failing.

Not completely.

Not catastrophically.

Just small strange things:

  • a cron job running longer than usual
  • random RPC failures
  • occasional timeouts
  • rare stuck executions

Nothing dramatic.

But enough to feel dangerous.

If you've worked with distributed systems — you know this pattern.

Systems rarely explode.

They slowly become unreliable first.

And this story is about how a simple wallet balance collector turned into an RPC infrastructure problem… and why the real solution wasn't buying a bigger plan.


The original problem was simple

We needed to collect wallet balances to display user positions.

Nothing unusual.

Architecture was basic:

cron (every 15 min)
    ↓
fetch balances from blockchain
    ↓
store results
Enter fullscreen mode Exit fullscreen mode

Each execution made around 30–50 RPC calls, mostly:

eth_call
Enter fullscreen mode Exit fullscreen mode

Stack:

  • NestJS backend
  • ethers.js
  • standard RPC providers

Everything worked perfectly.

Until growth started.


When things started breaking

As the number of wallets increased we started seeing:

  • random RPC failures
  • unstable execution time
  • occasional timeouts
  • jobs sometimes hanging

At first nothing clearly indicated the cause.

Then logs revealed:

HTTP 429

Rate limits.

Expected.

But then something more interesting appeared.


The RPC behavior nobody warns you about

I expected providers to reject requests when overloaded.

Instead some providers:

  • accepted connections
  • delayed responses
  • eventually timed out

Which is much worse.

Fast failure is manageable.

Slow failure kills systems.

Because backend workers stay blocked.

And ethers default timeout is:

5 minutes

Which means under pressure:

Requests didn't fail.

They accumulated.

This is exactly how cascading failures begin.


First fix: add more RPC providers

The obvious fix:

Add redundancy.

We added:

  • Infura free plan
  • Quicknode free plan
  • public RPC endpoints

And used:

ethers.FallbackProvider
Enter fullscreen mode Exit fullscreen mode

Logic looked clean:

Provider 1 → fail
Provider 2 → success
Enter fullscreen mode Exit fullscreen mode

Problem solved?

Not really.


Why FallbackProvider doesn't solve infrastructure problems

Important detail:

FallbackProvider does not penalize failing providers.

If provider 1 constantly returns 429:

Every request becomes:

try provider 1
fail

try provider 2
success
Enter fullscreen mode Exit fullscreen mode

Which creates hidden costs:

  • guaranteed failed request
  • extra latency
  • wasted credits
  • retry overhead

Under load this becomes dangerous.

Because retries increase pressure.

This isn't failover.

This is failure amplification.


The realization moment

At some point it became obvious:

This wasn't an RPC provider problem.

This wasn't a pricing problem.

This was a:

Traffic control problem.

We needed:

  • RPS control
  • concurrency control
  • routing
  • provider health awareness
  • failure isolation

In other words:

We didn't need better providers.

We needed better infrastructure.


Optimizations I tried before building infrastructure

Before writing infrastructure you should try optimization.

I did.


Multicall

Reduced number of requests.

Helped efficiency.

Didn't solve RPS spikes.


ethers v6 batching

Same result.

Better efficiency.

Same burst pressure.


Bigger plans

Plans that satisfied requirements:

About $250/month

Technically valid.

Architecturally lazy.

Because we didn't lack compute.

We lacked control.


The real bottleneck

The problem wasn't total requests.

It was request distribution.

Example:

10 requests in 1 second → fail
10 requests across 5 seconds → OK
Enter fullscreen mode Exit fullscreen mode

Same work.

Different shape.

Infrastructure cares about shape.

Not totals.


Architecture evolution

The architecture evolved naturally.

Reality forced it.


Phase 1 — naive design

Backend
   │
Single RPC
Enter fullscreen mode Exit fullscreen mode

Problems:

  • single point of failure
  • hard RPS ceiling
  • timeout risk

Worked until growth.


Phase 2 — redundancy design

Backend
   │
FallbackProvider
   │
RPC1 RPC2 RPC3
Enter fullscreen mode Exit fullscreen mode

Improved availability.

Still unstable.

Problems remained:

  • no RPS control
  • retry overhead
  • no provider isolation
  • unpredictable latency

We improved redundancy.

Not resilience.


Phase 3 — infrastructure design

Final architecture:

Backend
   │
RPCPoolProvider
   │
Traffic control layer
   │
RPC1 RPC2 RPC3 RPC4 RPC5 RPC6 RPC7 RPC8 RPC9 RPC10
Enter fullscreen mode Exit fullscreen mode

Key difference:

We stopped reacting to failures.

We started preventing them.

This is the difference between redundancy and infrastructure.


Building RPC traffic control

Requirements became clear:

The provider must:

  • never exceed RPS
  • control concurrency
  • distribute requests
  • isolate bad providers
  • retry intelligently

So I built a provider that does exactly that.

Instead of:

App → RPC
Enter fullscreen mode Exit fullscreen mode

We now have:

App → RPC Pool → RPC endpoints
Enter fullscreen mode Exit fullscreen mode

RPC became:

A managed resource.

Not a dependency.


Production design decisions

These decisions made the system stable.


Rate limiting per endpoint

Each RPC gets independent RPS control.

Token bucket model ensures smooth traffic shaping.

Instead of burst → fail:

We get:

Flow → stability.

This alone removed most rate limits.


Concurrency isolation

Hidden killer:

Concurrency spikes.

Even if RPS is correct.

Solution:

Semaphore per endpoint.

Prevents:

  • provider overload
  • retry cascades
  • self-inflicted pressure

Smart routing

Naive round robin assumes equal providers.

Reality:

Providers degrade.

Providers throttle.

Providers fail.

Router avoids providers in cooldown.

Bad nodes temporarily removed.

This prevents failure amplification.


Failure-aware retry strategy

Retries only happen for infrastructure failures:

  • timeouts
  • rate limits
  • 5xx errors

Never for logical errors.

Prevents retry storms.

One of the most dangerous distributed system bugs.


Observability (most missing piece)

Most RPC wrappers are blind.

Production systems cannot be blind.

Every request emits:

  • request event
  • response event
  • error event
  • latency

This enables:

  • metrics
  • monitoring
  • provider comparison
  • anomaly detection

Observability is what turns code into infrastructure.


Engineering mistakes I made

Looking back, several mistakes were obvious.

Common ones.


Mistake 1 — assuming providers fail cleanly

I assumed providers reject overload.

Reality:

Some delay instead.

Much worse.

Lesson:

External systems rarely fail nicely.


Mistake 2 — believing retries increase reliability

Retries without shaping increase pressure.

Example:

failure → retry → more load → more failure → more retry
Enter fullscreen mode Exit fullscreen mode

Feedback loop.

Lesson:

Retries must be controlled.


Mistake 3 — optimizing totals instead of distribution

I optimized:

Total requests.

Real problem:

Requests per second.

Lesson:

Time distribution matters more than totals.


Mistake 4 — treating RPC as just an API

RPC is not just an API.

It is infrastructure.

Which means it needs:

  • routing
  • shaping
  • monitoring
  • failure strategy

Lesson:

If it can break production — it's infrastructure.


What can still go wrong

No solution is perfect.

Remaining risks:


Block consistency

Different providers may lag blocks.

Possible solutions:

  • sticky routing
  • blockTag consistency
  • session affinity

Public RPC instability

Public endpoints can:

  • disappear
  • throttle
  • change limits

Mitigation:

  • large provider pool
  • health scoring
  • adaptive routing

Retry pressure under mass failure

If many providers fail:

Retries can still create pressure.

Future improvements:

  • circuit breakers
  • global backoff
  • adaptive retry limits

Latency variance

Providers have different speeds.

Future improvement:

Latency weighted routing.


What I would build differently today

With hindsight:

Several improvements are obvious.


Health scoring instead of cooldown

Instead of simple cooldown:

Use dynamic scoring based on:

  • success rate
  • latency
  • timeout ratio
  • rate limits

Route based on health.


Latency aware routing

Fast providers should get more traffic.

Weighted routing improves stability.


Adaptive RPS

Instead of static limits:

Adaptive limits based on:

  • failures
  • latency growth
  • retry-after headers

Infrastructure should adapt automatically.


Request deduplication

Multiple identical calls could merge.

(singleflight pattern)

Reduces pressure dramatically.

Especially for:

eth_call
eth_blockNumber
eth_gasPrice
Enter fullscreen mode Exit fullscreen mode

Sticky routing

Some workloads benefit from provider affinity.

Example:

Vault calculations.

Session stickiness could help.


Production outcome

Today we run:

~10 RPC endpoints per network.

Mix of:

  • free plans
  • public RPC
  • basic tiers

No premium plans required.

Backend became:

Stable.

Predictable.

Boring.

Which is exactly what infrastructure should be.


Why I open sourced it

After running this in production I realized:

Many teams probably hit this exact stage:

Single RPC → instability → bigger plan.

Few build traffic control.

So I extracted the provider into:

ethers-rpc-pool

A drop-in JsonRpcProvider replacement adding:

  • load balancing
  • RPS limiting
  • concurrency control
  • retry strategy
  • instrumentation

The goal wasn't to build a library.

The goal was to make RPC predictable.


What makes this different from typical RPC wrappers

Typical wrappers add:

Load balancing.

This adds:

Traffic engineering.

Which includes:

  • request shaping
  • failure classification
  • provider cooldown
  • retry discipline
  • concurrency isolation
  • observability

Less helper.

More infrastructure layer.


The real lesson

Most scaling problems are not scale problems.

They are:

Control problems.

Not:

How many requests you send.

But:

How you shape them.


If you hit RPC limits

Before buying bigger plans ask:

Do I need more capacity?

Or better engineering?

Often the cheaper solution is also the more senior one.


Final thought

Good engineers solve problems.

Senior engineers prevent them.

Staff engineers reshape systems so problems cannot easily appear.

This started as:

RPC failures.

It ended as:

Traffic engineering.

And that's usually how infrastructure stories go.

Project

I extracted this solution into a reusable provider:

ethers-rpc-pool

GitHub:
https://github.com/ahiipsa/ethers-rpc-pool

npm:
https://www.npmjs.com/package/ethers-rpc-pool

Top comments (0)