A production engineering story about rate limits, retries, failure behavior, and building RPC traffic control
At some point our backend started failing.
Not completely.
Not catastrophically.
Just small strange things:
- a cron job running longer than usual
- random RPC failures
- occasional timeouts
- rare stuck executions
Nothing dramatic.
But enough to feel dangerous.
If you've worked with distributed systems — you know this pattern.
Systems rarely explode.
They slowly become unreliable first.
And this story is about how a simple wallet balance collector turned into an RPC infrastructure problem… and why the real solution wasn't buying a bigger plan.
The original problem was simple
We needed to collect wallet balances to display user positions.
Nothing unusual.
Architecture was basic:
cron (every 15 min)
↓
fetch balances from blockchain
↓
store results
Each execution made around 30–50 RPC calls, mostly:
eth_call
Stack:
- NestJS backend
- ethers.js
- standard RPC providers
Everything worked perfectly.
Until growth started.
When things started breaking
As the number of wallets increased we started seeing:
- random RPC failures
- unstable execution time
- occasional timeouts
- jobs sometimes hanging
At first nothing clearly indicated the cause.
Then logs revealed:
HTTP 429
Rate limits.
Expected.
But then something more interesting appeared.
The RPC behavior nobody warns you about
I expected providers to reject requests when overloaded.
Instead some providers:
- accepted connections
- delayed responses
- eventually timed out
Which is much worse.
Fast failure is manageable.
Slow failure kills systems.
Because backend workers stay blocked.
And ethers default timeout is:
5 minutes
Which means under pressure:
Requests didn't fail.
They accumulated.
This is exactly how cascading failures begin.
First fix: add more RPC providers
The obvious fix:
Add redundancy.
We added:
- Infura free plan
- Quicknode free plan
- public RPC endpoints
And used:
ethers.FallbackProvider
Logic looked clean:
Provider 1 → fail
Provider 2 → success
Problem solved?
Not really.
Why FallbackProvider doesn't solve infrastructure problems
Important detail:
FallbackProvider does not penalize failing providers.
If provider 1 constantly returns 429:
Every request becomes:
try provider 1
fail
try provider 2
success
Which creates hidden costs:
- guaranteed failed request
- extra latency
- wasted credits
- retry overhead
Under load this becomes dangerous.
Because retries increase pressure.
This isn't failover.
This is failure amplification.
The realization moment
At some point it became obvious:
This wasn't an RPC provider problem.
This wasn't a pricing problem.
This was a:
Traffic control problem.
We needed:
- RPS control
- concurrency control
- routing
- provider health awareness
- failure isolation
In other words:
We didn't need better providers.
We needed better infrastructure.
Optimizations I tried before building infrastructure
Before writing infrastructure you should try optimization.
I did.
Multicall
Reduced number of requests.
Helped efficiency.
Didn't solve RPS spikes.
ethers v6 batching
Same result.
Better efficiency.
Same burst pressure.
Bigger plans
Plans that satisfied requirements:
About $250/month
Technically valid.
Architecturally lazy.
Because we didn't lack compute.
We lacked control.
The real bottleneck
The problem wasn't total requests.
It was request distribution.
Example:
10 requests in 1 second → fail
10 requests across 5 seconds → OK
Same work.
Different shape.
Infrastructure cares about shape.
Not totals.
Architecture evolution
The architecture evolved naturally.
Reality forced it.
Phase 1 — naive design
Backend
│
Single RPC
Problems:
- single point of failure
- hard RPS ceiling
- timeout risk
Worked until growth.
Phase 2 — redundancy design
Backend
│
FallbackProvider
│
RPC1 RPC2 RPC3
Improved availability.
Still unstable.
Problems remained:
- no RPS control
- retry overhead
- no provider isolation
- unpredictable latency
We improved redundancy.
Not resilience.
Phase 3 — infrastructure design
Final architecture:
Backend
│
RPCPoolProvider
│
Traffic control layer
│
RPC1 RPC2 RPC3 RPC4 RPC5 RPC6 RPC7 RPC8 RPC9 RPC10
Key difference:
We stopped reacting to failures.
We started preventing them.
This is the difference between redundancy and infrastructure.
Building RPC traffic control
Requirements became clear:
The provider must:
- never exceed RPS
- control concurrency
- distribute requests
- isolate bad providers
- retry intelligently
So I built a provider that does exactly that.
Instead of:
App → RPC
We now have:
App → RPC Pool → RPC endpoints
RPC became:
A managed resource.
Not a dependency.
Production design decisions
These decisions made the system stable.
Rate limiting per endpoint
Each RPC gets independent RPS control.
Token bucket model ensures smooth traffic shaping.
Instead of burst → fail:
We get:
Flow → stability.
This alone removed most rate limits.
Concurrency isolation
Hidden killer:
Concurrency spikes.
Even if RPS is correct.
Solution:
Semaphore per endpoint.
Prevents:
- provider overload
- retry cascades
- self-inflicted pressure
Smart routing
Naive round robin assumes equal providers.
Reality:
Providers degrade.
Providers throttle.
Providers fail.
Router avoids providers in cooldown.
Bad nodes temporarily removed.
This prevents failure amplification.
Failure-aware retry strategy
Retries only happen for infrastructure failures:
- timeouts
- rate limits
- 5xx errors
Never for logical errors.
Prevents retry storms.
One of the most dangerous distributed system bugs.
Observability (most missing piece)
Most RPC wrappers are blind.
Production systems cannot be blind.
Every request emits:
- request event
- response event
- error event
- latency
This enables:
- metrics
- monitoring
- provider comparison
- anomaly detection
Observability is what turns code into infrastructure.
Engineering mistakes I made
Looking back, several mistakes were obvious.
Common ones.
Mistake 1 — assuming providers fail cleanly
I assumed providers reject overload.
Reality:
Some delay instead.
Much worse.
Lesson:
External systems rarely fail nicely.
Mistake 2 — believing retries increase reliability
Retries without shaping increase pressure.
Example:
failure → retry → more load → more failure → more retry
Feedback loop.
Lesson:
Retries must be controlled.
Mistake 3 — optimizing totals instead of distribution
I optimized:
Total requests.
Real problem:
Requests per second.
Lesson:
Time distribution matters more than totals.
Mistake 4 — treating RPC as just an API
RPC is not just an API.
It is infrastructure.
Which means it needs:
- routing
- shaping
- monitoring
- failure strategy
Lesson:
If it can break production — it's infrastructure.
What can still go wrong
No solution is perfect.
Remaining risks:
Block consistency
Different providers may lag blocks.
Possible solutions:
- sticky routing
- blockTag consistency
- session affinity
Public RPC instability
Public endpoints can:
- disappear
- throttle
- change limits
Mitigation:
- large provider pool
- health scoring
- adaptive routing
Retry pressure under mass failure
If many providers fail:
Retries can still create pressure.
Future improvements:
- circuit breakers
- global backoff
- adaptive retry limits
Latency variance
Providers have different speeds.
Future improvement:
Latency weighted routing.
What I would build differently today
With hindsight:
Several improvements are obvious.
Health scoring instead of cooldown
Instead of simple cooldown:
Use dynamic scoring based on:
- success rate
- latency
- timeout ratio
- rate limits
Route based on health.
Latency aware routing
Fast providers should get more traffic.
Weighted routing improves stability.
Adaptive RPS
Instead of static limits:
Adaptive limits based on:
- failures
- latency growth
- retry-after headers
Infrastructure should adapt automatically.
Request deduplication
Multiple identical calls could merge.
(singleflight pattern)
Reduces pressure dramatically.
Especially for:
eth_call
eth_blockNumber
eth_gasPrice
Sticky routing
Some workloads benefit from provider affinity.
Example:
Vault calculations.
Session stickiness could help.
Production outcome
Today we run:
~10 RPC endpoints per network.
Mix of:
- free plans
- public RPC
- basic tiers
No premium plans required.
Backend became:
Stable.
Predictable.
Boring.
Which is exactly what infrastructure should be.
Why I open sourced it
After running this in production I realized:
Many teams probably hit this exact stage:
Single RPC → instability → bigger plan.
Few build traffic control.
So I extracted the provider into:
ethers-rpc-pool
A drop-in JsonRpcProvider replacement adding:
- load balancing
- RPS limiting
- concurrency control
- retry strategy
- instrumentation
The goal wasn't to build a library.
The goal was to make RPC predictable.
What makes this different from typical RPC wrappers
Typical wrappers add:
Load balancing.
This adds:
Traffic engineering.
Which includes:
- request shaping
- failure classification
- provider cooldown
- retry discipline
- concurrency isolation
- observability
Less helper.
More infrastructure layer.
The real lesson
Most scaling problems are not scale problems.
They are:
Control problems.
Not:
How many requests you send.
But:
How you shape them.
If you hit RPC limits
Before buying bigger plans ask:
Do I need more capacity?
Or better engineering?
Often the cheaper solution is also the more senior one.
Final thought
Good engineers solve problems.
Senior engineers prevent them.
Staff engineers reshape systems so problems cannot easily appear.
This started as:
RPC failures.
It ended as:
Traffic engineering.
And that's usually how infrastructure stories go.
Project
I extracted this solution into a reusable provider:
ethers-rpc-pool
Top comments (0)