TL;DR
We finished migrating from in-house RPC to eRPC. Our homegrown stack handled scale, but it was noisy and hard to tune. eRPC—the open-source EVM RPC proxy—gave us a pragmatic RPC load balancer with JSON-RPC caching, hedging, and clean failure modes. Result: fewer spikes, clearer scaling signals, and latency mostly dominated by providers (not our proxy).
Why touch a “working” system?
Running wallets across many chains means lots of JSON-RPC reads/writes, multiple providers per chain, 4337/bundler methods, and strict uptime. Our custom Go + caches + selection logic worked… until tail-latency spikes and flaky providers made it painful to operate and evolve.
What we wanted in one box:
- Multi-upstream per chain with smart selection
- RPC load balancer behavior (hedging, retries, backoff)
- JSON-RPC caching that’s reorg-aware
- Clean observability and failure modes
- Easy k8s deploy
Why eRPC?
eRPC is an EVM RPC proxy that fronts your upstream providers and adds:
- Hedged requests to cut tail latency
- Permanent/reorg-aware caching and multiplexing to slash duplicate reads
- Failover + circuit breakers so provider blips don’t page you
- Simple config, Kubernetes-friendly
Migration story (short & hoonest)
We rolled it out per-chain:
- Start with one chain and a couple of upstreams
- Verify standard and 4337/bundler calls
- Run in parallel with our old system for a while
- Tune cache TTLs, hedging, and selection policy
- Promote to more chains
The “failsafe” designation only kicks in when others are unhealthy. We wanted “A, then B, then a true last-resort C.” Marking the last one as a normal upstream but with low priority did the trick.
Results that mattered
- Stability: CPU/memory stopped yo-yoing. Spikes turned into smooth plateaus.
- Latency: After hedging + cache tuning, p95 mostly tracked provider health, not our proxy.
- Operational clarity: Easier to attribute errors (provider vs proxy), calmer on-call.
- Costs: Caching/multiplexing reduced duplicate reads.
If you run multi-chain infra, try this
- Ship small: 1 chain, a few upstreams, and explicit routing for writes.
- Run both paths for a week; compare logs/metrics before cutting over.
- Be intentional with “last resort” upstreams (don’t rely on magic).
- Tune TTLs per method; logs & blockByNumber benefit from short, reorg-aware caches.
- Watch tail latency—hedging is your friend, but set limits.
Openfort overview (how we wire this into wallets):
Docs → Overview
Open to Discussion:
- How would you tune hedging for indexers vs. interactive user flows?
- What’s your policy for eth_getLogs (batch-and-split vs single call)?
- Any clever upstream selection strategies we should test?
Top comments (0)