We Replaced Our RPC Layer With eRPC

#web3 #ethereum #devops #kubernetes

TL;DR

We finished migrating from in-house RPC to eRPC. Our homegrown stack handled scale, but it was noisy and hard to tune. eRPC—the open-source EVM RPC proxy—gave us a pragmatic RPC load balancer with JSON-RPC caching, hedging, and clean failure modes. Result: fewer spikes, clearer scaling signals, and latency mostly dominated by providers (not our proxy).

Why touch a “working” system?

Running wallets across many chains means lots of JSON-RPC reads/writes, multiple providers per chain, 4337/bundler methods, and strict uptime. Our custom Go + caches + selection logic worked… until tail-latency spikes and flaky providers made it painful to operate and evolve.

What we wanted in one box:

Multi-upstream per chain with smart selection
RPC load balancer behavior (hedging, retries, backoff)
JSON-RPC caching that’s reorg-aware
Clean observability and failure modes
Easy k8s deploy

Why eRPC?

eRPC is an EVM RPC proxy that fronts your upstream providers and adds:

Hedged requests to cut tail latency
Permanent/reorg-aware caching and multiplexing to slash duplicate reads
Failover + circuit breakers so provider blips don’t page you
Simple config, Kubernetes-friendly

Migration story (short & hoonest)

We rolled it out per-chain:

Start with one chain and a couple of upstreams
Verify standard and 4337/bundler calls
Run in parallel with our old system for a while
Tune cache TTLs, hedging, and selection policy
Promote to more chains

The “failsafe” designation only kicks in when others are unhealthy. We wanted “A, then B, then a true last-resort C.” Marking the last one as a normal upstream but with low priority did the trick.

Results that mattered

Stability: CPU/memory stopped yo-yoing. Spikes turned into smooth plateaus.
Latency: After hedging + cache tuning, p95 mostly tracked provider health, not our proxy.
Operational clarity: Easier to attribute errors (provider vs proxy), calmer on-call.
Costs: Caching/multiplexing reduced duplicate reads.

If you run multi-chain infra, try this

Ship small: 1 chain, a few upstreams, and explicit routing for writes.
Run both paths for a week; compare logs/metrics before cutting over.
Be intentional with “last resort” upstreams (don’t rely on magic).
Tune TTLs per method; logs & blockByNumber benefit from short, reorg-aware caches.
Watch tail latency—hedging is your friend, but set limits.

Openfort overview (how we wire this into wallets):
Docs → Overview

Open to Discussion:

How would you tune hedging for indexers vs. interactive user flows?
What’s your policy for eth_getLogs (batch-and-split vs single call)?
Any clever upstream selection strategies we should test?

DEV Community