Most WebRTC Projects Don't Fail at Scale. They Fail at 300 Users.

#opensource #devops #discuss #webrtc

I've lost count of how many WebRTC projects I've seen hit a wall long before they hit "scale."
Not at 10,000 concurrent users. Not at 1,000. At 300.

And almost every time, the root cause is the same thing: the team built for the demo, not for the third user who tried to join from a hotel Wi-Fi in Manila while the second user was tethering off a phone in a parking lot.

The "It works on my machine" trap is worse in WebRTC
Every backend engineer knows the "works on my machine" meme. In WebRTC, it's weaponized.

Your demo works because:

Everyone's on the same LAN
Nobody's behind a symmetric NAT
Your STUN server is reachable
The codec negotiation happens to pick something both browsers agree on
There's no packet loss because you're all sitting in the same office

Then you ship. And suddenly half your users can connect, a quarter can connect but have one-way audio, and the rest get stuck in a "connecting..." state that never resolves.

The infrastructure didn't fail. Your assumptions did.

Why 10,000 concurrent users isn't actually the hard part
Here's the thing nobody tells you early: scaling WebRTC to 10,000 concurrent users is mostly a solved problem. Deploy an SFU. Cluster it. Put a load balancer in front. Geographic distribution. Done, basically.

The hard part is the other 40% of your users who can't establish a connection at all because:

Their corporate firewall blocks UDP
They're behind a CGNAT
Their ISP is rate-limiting STUN requests
ICE candidate gathering times out before it finds a working path
Your TURN server is in a region that adds 400ms of latency

You can have the most beautifully architected SFU cluster in the world. If your TURN infrastructure is an afterthought, it's a $50,000/month decoration.

The three things people skip that actually matter
In my experience, the teams that quietly scale WebRTC to meaningful numbers without drama do three things differently from day one:

1. They treat TURN as a first-class service, not an add-on
Most teams install coturn on one box, set turn:your-server:3478, and move on. Production teams deploy TURN clusters across multiple regions, with their own monitoring, their own capacity planning, and their own SLAs. TURN traffic isn't 5% of your load. For some user populations, it's 40%.

2. They instrument ICE, not just call quality
Everyone measures MOS scores, packet loss, and jitter. Fewer teams measure ICE gathering time, ICE candidate selection patterns, or how many users fall back to relay candidates. But those metrics tell you whether users are even successfully connecting. Quality metrics only matter if the connection happens.

3. They stop thinking of browsers as the only client
WebRTC starts in the browser for most teams. Then someone wants it on iOS. Then Android. Then a desktop app. Each of these has different media stack quirks, different codec support, and different edge cases with network transitions. Teams that plan for this from the start build abstraction layers. Teams that don't rewrite their signaling three times.

The uncomfortable truth about WebRTC scaling
Most teams building on WebRTC think the challenge is technical. It's actually operational.

Can you spin up a new media server in under 10 minutes when traffic spikes? Do you have observability into individual RTP streams? Can your on-call engineer trace why a specific user had one-way audio last Tuesday at 3:47 PM?

The teams that succeed at 10K concurrent users aren't smarter than the teams that fail at 300. They just took the operational side seriously before they needed to.

If you're building toward real scale
Here's my honest advice: don't optimize for 10,000 users until you can reliably support 1,000. And don't worry about 1,000 until the first 100 work every time, across every network, on every client.

The architecture matters. The infrastructure matters. But the thing that matters most is the willingness to build for the conditions your users actually have, not the ones your development network gives you.

If you're thinking through what a production-grade WebRTC architecture actually looks like at 10K concurrent users, here is a more detailed breakdown on: How to Architect WebRTC Systems for 10K Concurrent Users. It covers SFU selection, TURN cluster design, signaling patterns, and the operational bits most articles skip.

I work on VoIP and real-time communications infrastructure. If you're running into WebRTC scaling issues and want to trade war stories, drop a comment happy to dig into specifics.

DEV Community

Most WebRTC Projects Don't Fail at Scale. They Fail at 300 Users.

Top comments (0)