Sergei Karasev

Posted on Mar 25

How I solved Ethereum RPC rate limits with traffic engineering instead of paying $250/month

#ethereum #architecture #blockchain #javascript

A production engineering story about rate limits, retries, failure behavior, and building RPC traffic control

At some point our backend started failing.

Not completely.

Not catastrophically.

Just small strange things:

a cron job running longer than usual
random RPC failures
occasional timeouts
rare stuck executions

Nothing dramatic.

But enough to feel dangerous.

If you've worked with distributed systems — you know this pattern.

Systems rarely explode.

They slowly become unreliable first.

And this story is about how a simple wallet balance collector turned into an RPC infrastructure problem… and why the real solution wasn't buying a bigger plan.

The original problem was simple

We needed to collect wallet balances to display user positions.

Nothing unusual.

Architecture was basic:

cron (every 15 min)
    ↓
fetch balances from blockchain
    ↓
store results

Each execution made around 30–50 RPC calls, mostly:

eth_call

Stack:

NestJS backend
ethers.js
standard RPC providers

Everything worked perfectly.

Until growth started.

When things started breaking

As the number of wallets increased we started seeing:

random RPC failures
unstable execution time
occasional timeouts
jobs sometimes hanging

At first nothing clearly indicated the cause.

Then logs revealed:

HTTP 429

Rate limits.

Expected.

But then something more interesting appeared.

The RPC behavior nobody warns you about

I expected providers to reject requests when overloaded.

Instead some providers:

accepted connections
delayed responses
eventually timed out

Which is much worse.

Fast failure is manageable.

Slow failure kills systems.

Because backend workers stay blocked.

And ethers default timeout is:

5 minutes

Which means under pressure:

Requests didn't fail.

They accumulated.

This is exactly how cascading failures begin.

First fix: add more RPC providers

The obvious fix:

Add redundancy.

We added:

Infura free plan
Quicknode free plan
public RPC endpoints

And used:

ethers.FallbackProvider

Logic looked clean:

Provider 1 → fail
Provider 2 → success

Problem solved?

Not really.

Why FallbackProvider doesn't solve infrastructure problems

Important detail:

FallbackProvider does not penalize failing providers.

If provider 1 constantly returns 429:

Every request becomes:

try provider 1
fail

try provider 2
success

Which creates hidden costs:

guaranteed failed request
extra latency
wasted credits
retry overhead

Under load this becomes dangerous.

Because retries increase pressure.

This isn't failover.

This is failure amplification.

The realization moment

At some point it became obvious:

This wasn't an RPC provider problem.

This wasn't a pricing problem.

This was a:

Traffic control problem.

We needed:

RPS control
concurrency control
routing
provider health awareness
failure isolation

In other words:

We didn't need better providers.

We needed better infrastructure.

Optimizations I tried before building infrastructure

Before writing infrastructure you should try optimization.

I did.

Multicall

Reduced number of requests.

Helped efficiency.

Didn't solve RPS spikes.

ethers v6 batching

Same result.

Better efficiency.

Same burst pressure.

Bigger plans

Plans that satisfied requirements:

About $250/month

Technically valid.

Architecturally lazy.

Because we didn't lack compute.

We lacked control.

The real bottleneck

The problem wasn't total requests.

It was request distribution.

Example:

10 requests in 1 second → fail
10 requests across 5 seconds → OK

Same work.

Different shape.

Infrastructure cares about shape.

Not totals.

Architecture evolution

The architecture evolved naturally.

Reality forced it.

Phase 1 — naive design

Backend
   │
Single RPC

Problems:

single point of failure
hard RPS ceiling
timeout risk

Worked until growth.

Phase 2 — redundancy design

Backend
   │
FallbackProvider
   │
RPC1 RPC2 RPC3

Improved availability.

Still unstable.

Problems remained:

no RPS control
retry overhead
no provider isolation
unpredictable latency

We improved redundancy.

Not resilience.

Phase 3 — infrastructure design

Final architecture:

Backend
   │
RPCPoolProvider
   │
Traffic control layer
   │
RPC1 RPC2 RPC3 RPC4 RPC5 RPC6 RPC7 RPC8 RPC9 RPC10

Key difference:

We stopped reacting to failures.

We started preventing them.

This is the difference between redundancy and infrastructure.

Building RPC traffic control

Requirements became clear:

The provider must:

never exceed RPS
control concurrency
distribute requests
isolate bad providers
retry intelligently

So I built a provider that does exactly that.

Instead of:

App → RPC

We now have:

App → RPC Pool → RPC endpoints

RPC became:

A managed resource.

Not a dependency.

Production design decisions

These decisions made the system stable.

Rate limiting per endpoint

Each RPC gets independent RPS control.

Token bucket model ensures smooth traffic shaping.

Instead of burst → fail:

We get:

Flow → stability.

This alone removed most rate limits.

Concurrency isolation

Hidden killer:

Concurrency spikes.

Even if RPS is correct.

Solution:

Semaphore per endpoint.

Prevents:

provider overload
retry cascades
self-inflicted pressure

Smart routing

Naive round robin assumes equal providers.

Reality:

Providers degrade.

Providers throttle.

Providers fail.

Router avoids providers in cooldown.

Bad nodes temporarily removed.

This prevents failure amplification.

Failure-aware retry strategy

Retries only happen for infrastructure failures:

timeouts
rate limits
5xx errors

Never for logical errors.

Prevents retry storms.

One of the most dangerous distributed system bugs.

Observability (most missing piece)

Most RPC wrappers are blind.

Production systems cannot be blind.

Every request emits:

request event
response event
error event
latency

This enables:

metrics
monitoring
provider comparison
anomaly detection

Observability is what turns code into infrastructure.

Engineering mistakes I made

Looking back, several mistakes were obvious.

Common ones.

Mistake 1 — assuming providers fail cleanly

I assumed providers reject overload.

Reality:

Some delay instead.

Much worse.

Lesson:

External systems rarely fail nicely.

Mistake 2 — believing retries increase reliability

Retries without shaping increase pressure.

Example:

failure → retry → more load → more failure → more retry

Feedback loop.

Lesson:

Retries must be controlled.

Mistake 3 — optimizing totals instead of distribution

I optimized:

Total requests.

Real problem:

Requests per second.

Lesson:

Time distribution matters more than totals.

Mistake 4 — treating RPC as just an API

RPC is not just an API.

It is infrastructure.

Which means it needs:

routing
shaping
monitoring
failure strategy

Lesson:

If it can break production — it's infrastructure.

What can still go wrong

No solution is perfect.

Remaining risks:

Block consistency

Different providers may lag blocks.

Possible solutions:

sticky routing
blockTag consistency
session affinity

Public RPC instability

Public endpoints can:

disappear
throttle
change limits

Mitigation:

large provider pool
health scoring
adaptive routing

Retry pressure under mass failure

If many providers fail:

Retries can still create pressure.

Future improvements:

circuit breakers
global backoff
adaptive retry limits

Latency variance

Providers have different speeds.

Future improvement:

Latency weighted routing.

What I would build differently today

With hindsight:

Several improvements are obvious.

Health scoring instead of cooldown

Instead of simple cooldown:

Use dynamic scoring based on:

success rate
latency
timeout ratio
rate limits

Route based on health.

Latency aware routing

Fast providers should get more traffic.

Weighted routing improves stability.

Adaptive RPS

Instead of static limits:

Adaptive limits based on:

failures
latency growth
retry-after headers

Infrastructure should adapt automatically.

Request deduplication

Multiple identical calls could merge.

(singleflight pattern)

Reduces pressure dramatically.

Especially for:

eth_call
eth_blockNumber
eth_gasPrice

Sticky routing

Some workloads benefit from provider affinity.

Example:

Vault calculations.

Session stickiness could help.

Production outcome

Today we run:

~10 RPC endpoints per network.

Mix of:

free plans
public RPC
basic tiers

No premium plans required.

Backend became:

Stable.

Predictable.

Boring.

Which is exactly what infrastructure should be.

Why I open sourced it

After running this in production I realized:

Many teams probably hit this exact stage:

Single RPC → instability → bigger plan.

Few build traffic control.

So I extracted the provider into:

ethers-rpc-pool

A drop-in JsonRpcProvider replacement adding:

load balancing
RPS limiting
concurrency control
retry strategy
instrumentation

The goal wasn't to build a library.

The goal was to make RPC predictable.

What makes this different from typical RPC wrappers

Typical wrappers add:

Load balancing.

This adds:

Traffic engineering.

Which includes:

request shaping
failure classification
provider cooldown
retry discipline
concurrency isolation
observability

Less helper.

More infrastructure layer.

The real lesson

Most scaling problems are not scale problems.

They are:

Control problems.

Not:

How many requests you send.

But:

How you shape them.

If you hit RPC limits

Before buying bigger plans ask:

Do I need more capacity?

Or better engineering?

Often the cheaper solution is also the more senior one.

Final thought

Good engineers solve problems.

Senior engineers prevent them.

Staff engineers reshape systems so problems cannot easily appear.

This started as:

RPC failures.

It ended as:

Traffic engineering.

And that's usually how infrastructure stories go.

Project

I extracted this solution into a reusable provider:

ethers-rpc-pool

GitHub:
https://github.com/ahiipsa/ethers-rpc-pool

npm:
https://www.npmjs.com/package/ethers-rpc-pool

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.