Anton Bukarev

Posted on Mar 20

I Built a 700-Person Video Call Without Zoom. Here's Every Mistake I Made

#webrtc #tutorial #selfhosted #opensource

TL;DR: I ditched Zoom for a self-hosted solution using Janus WebRTC Gateway. It works, but I hit three painful production issues: a signaling storm that crashed the client, silent session timeouts that confused users, and a ulimit that killed me at participant #200. This post covers the full architecture, the API flow that actually matters, and a pre-flight checklist before you try this at scale.

My company needed to host a 700-person all-hands call (on-premises, no data leaving the corporate network, recordings on our own disks). The usual answer ("just use Zoom") wasn't on the table. Compliance requirements were strict, and this landed on me.

So I had to own the stack.

Under the hood: Janus WebRTC Gateway, an open-source C-based media server from Meetecho, GPLv3, on GitHub since 2014. What I write below applies to any WebRTC infrastructure; Janus is just what I ran.

Why a Media Server at All?

In a naive P2P mesh, each participant sends a video stream to every other participant. Connections grow as N×(N−1)/2. At 700 people, no client can hold 699 simultaneous outgoing streams at 1.5 Mbps. Your laptop would overheat before the first frame arrives. You need a central server.

Janus works as an SFU (Selective Forwarding Unit): each participant sends one stream to the server, and the server forwards it to subscribers. No re-encoding, no decryption of payloads. Just routing encrypted SRTP packets. For audio, it ships an MCU (AudioBridge plugin) that mixes all Opus streams into one, so each client receives a single audio track regardless of room size.

The Plugin Architecture

Janus's design principle: the core handles WebRTC connections (ICE, DTLS, SRTP) and JSON message transport (REST, WebSocket, RabbitMQ, MQTT). All application logic lives in plugins. You don't get a monolithic server. You get the plugins you actually need:

VideoRoom: SFU room with publish/subscribe model
AudioBridge: MCU audio mixing
Streaming: for broadcasting pre-recorded or live streams
Admin API: diagnostics, pcap dumps, live draining

The API: Three Steps to Media

Interaction with Janus is a strict state machine. Skip a step, get an unhelpful error.

Step 1: Create a session

POST /janus

{ "janus": "create", "transaction": "tx-001" }

// Response: {"janus": "success", "data": {"id": 3718046820721403}}

Step 2: Attach to a plugin

POST /janus/{session_id}

{
  "janus": "attach",
  "plugin": "janus.plugin.videoroom",
  "transaction": "tx-002"
}

Step 3: Send plugin messages. Here's creating a VideoRoom with sane defaults for large calls:

{
  "janus": "message",
  "transaction": "tx-010",
  "body": {
    "request": "create",
    "room": 1234,
    "publishers": 12,
    "bitrate": 1500000,
    "notify_joining": false,
    "record": true,
    "lock_record": true
  }
}

That notify_joining: false line? I'll come back to it. It's the one I ignored and paid for.

Once a participant publishes their stream with an SDP offer, Janus returns an SDP answer and ICE candidates flow via trickle requests. After ICE and DTLS complete, Janus starts emitting state events you must handle:

webrtcup: media channel established. If this never arrives, look at NAT/firewall.
media: Janus started (or stopped) receiving media. receiving: false means the publisher dropped.
slowlink: packet loss detected via RTCP NACK. The nacks field tells you how bad. Act before the user complains.
hangup: connection torn down. reason explains why (ICE failed, DTLS alert, close).

I built a real-time dashboard from these events. At 700 people, flying blind is not an option.

Mistake #1: The Signaling Storm

I turned on notify_joining: true so the UI could show a participant list: every join and leave event, not just active publishers.

The docs literally warn: "in large rooms this can be overly verbose and chatty."

I ignored it. At 700 participants, the signaling channel became a flood of join/leave events that overwhelmed the client-side event handler. The UI froze. Users thought the call was broken. I got paged.

Fix: Keep notify_joining: false. If you need a participant list, fetch it on demand from your app server. Don't fan-out every presence event through Janus.

Two other VideoRoom parameters that matter at scale:

threads: number of forwarding threads for publisher-to-subscriber fan-out. Increase this when the plugin starts falling behind under heavy load.
bitrate: enforced via RTCP REMB. Set it at room level, override individually for screen-share presenters who need more bandwidth.

Mistake #2: Sessions Silently Dying

The most common complaint I heard from users: "the call just dropped." In 80% of cases, the session had expired because I broke the keepalive cycle somewhere in the reconnect logic.

Janus will silently kill a connection if it stops receiving activity signals. The mechanism depends on transport.

REST: You must maintain a continuous long-poll loop:

GET /janus/{session_id}?maxev=5

If no events arrive within 30 seconds, Janus returns {"janus": "keepalive"} and you must immediately open the next long-poll. Breaking that cycle for longer than session_timeout destroys the session.

WebSocket: No long-poll needed, but the client must periodically send:

{
  "janus": "keepalive",
  "session_id": 3718046820721403,
  "transaction": "tx-ka-001"
}

Mobile tabs going into the background, frozen JavaScript timers due to power-saving. I hit both of these on the first load test. They're boring bugs that cost you real user trust.

There's a reclaim_session_timeout parameter that gives a short window to reclaim an expired session, but don't rely on it as a primary mechanism. Implement proper keepalive with reconnect logic from day one.

Authentication: Don't Ship the Dev Default

By default, Janus API is open to everyone. That's an intentional choice to simplify development. Deploying it that way to production is the equivalent of SSH with no password.

Three real options:

A. Stored tokens: every request carries a token field. Tokens are created/deleted via Admin API and can be scoped to specific plugins. Works well, but requires synchronizing token state across all Janus instances if you run multiple.

B. HMAC-signed tokens: your app server signs short-lived tokens with a shared HMAC secret; Janus validates the signature without external storage. Stateless and clean, but only available for VideoRoom, and tokens can't be revoked before TTL expires. Keep TTL in minutes, not hours.

C. Janus hidden behind your app server: external clients never see Janus at all. Only your backend talks to Janus using a static api_secret. The Janus docs explicitly describe this pattern as "useful when wrapping the Janus API." This is what I went with. For a corporate environment with strict security requirements, it's the right answer.

Scaling: Sharding and Cascading

Janus is stateful at the room level. Participants, subscriptions, and transport state all live inside a specific instance. Horizontal scaling means room sharding: "room X lives on instance Y." Your signaling server maintains a registry and routes join requests accordingly.

For storage, I used a PostgreSQL table; Redis or etcd work just as well. The important part is consistency at room creation time and handling the case where an instance crashes.

Cascading for rooms that outgrow one instance:

When a room exceeds single-instance capacity (by traffic or geography), VideoRoom supports cascading via remote publishers:

Publisher connects to Janus A
Your signaling server calls add_remote_publisher to project them into a room on Janus B
Subscribers on Janus B see the remote publisher exactly like a local one

The instances exchange RTP packets directly. Unlike the older rtp_forward mechanism, remote publishers integrate transparently with VideoRoom: subscriptions, presence events, and SDP negotiation all work through the standard API.

One thing that surprised me coming from a Kubernetes mindset: this is not auto-clustering. Janus gives you the primitives: add_remote_publisher, publish_remotely, update_remote_publisher, remove_remote_publisher. Deciding when to cascade, which instances to use, and monitoring load is entirely your signaling server's responsibility. There's no control plane magic.

Recording: MJR, janus-pp-rec, ffmpeg

Janus records in MJR format (Meetecho Janus Recording): a structured dump of raw RTP packets with metadata headers (codec, SSRC). The key property: recording adds no CPU overhead beyond normal forwarding. No decoding, no re-encoding.

I enabled it at room creation with record: true and lock_record: true (prevents toggling without room secret, which my compliance team cared about). You can also toggle it for the whole room on demand:

{
  "request": "enable_recording",
  "room": 1234,
  "secret": "room-secret-XYZ",
  "record": true
}

MJR files aren't directly playable. Run them through janus-pp-rec to get standard containers:

# Audio: MJR to Opus
janus-pp-rec /recordings/room1234-user42-audio.mjr /output/user42-audio.opus

# Video: MJR to WebM
janus-pp-rec /recordings/room1234-user42-video.mjr /output/user42-video.webm

From there, ffmpeg handles multi-track mixing, sync, and final encoding. Do not run post-processing on the same server handling live calls. I learned this the hard way; janus-pp-rec and ffmpeg will spike CPU and I/O at exactly the wrong moment.

From .opus output you can pipe into any ASR engine (Whisper, Vosk, etc.) and get a timestamped transcript. Useful if your compliance team wants searchable call records linked to business objects.

Mistake #3: The ulimit That Killed Me at Participant 200

Each WebRTC session in Janus consumes several file descriptors: RTP/RTCP sockets, DTLS state. At 700 participants with audio and video, you're in the tens of thousands.

Default ulimit -n on most Linux distros: 1024.

I found this on my first load test at 300 emulated participants. Janus started rejecting new connections silently. It took me an embarrassing amount of time to diagnose because nothing in the Janus logs pointed obviously at the OS limit.

Set ulimit -n 65536 (or higher) and update /etc/security/limits.conf so it persists across reboots.

Pre-Flight Checklist for 700 People

Network: UDP range for RTP (typically 10000-60000) must be reachable between clients and media server. Janus itself should not be externally reachable. Put a reverse proxy (nginx, HAProxy) in front, TLS-terminate on 443. Admin API: never exposed to the internet.
TURN: Verify ICE/STUN/TURN reachability from every network segment your users will connect from. Set per-username allocation limits and bandwidth caps on your TURN server.
Keepalive: Implement a robust loop with reconnect handling. Test with mobile clients, background tabs, and power-saving modes.
Room params: notify_joining: false, publishers set to your actual speaker count, bitrate via REMB. Increase threads in VideoRoom config for heavy fan-out.
Recording I/O: 10 publishers recording audio + video = 20 simultaneous write streams. NVMe handles it; spinning disk may not. Run post-processing on separate machines.
File descriptors: ulimit -n 65536 before anything else.
Load test: Not optional. Your first test will find something you missed. GStreamer with WebRTC support can simulate hundreds of fake clients.

Admin API for production incidents (keep it behind VPN or allow-list):

handle_info: snapshot of WebRTC/ICE/DTLS/RTCP stats for a specific handle. Use it when one participant has frozen video or no audio.
start_pcap / stop_pcap: targeted RTP dump for a single handle, opens in Wireshark. Invaluable for debugging individual participants without capturing everyone's traffic.
accept_new_sessions: false: puts an instance in draining mode. Existing sessions finish, new joins go elsewhere. Use this before upgrades.

References

Janus Documentation:

General Docs: architecture overview, installation, configuration
REST API: sessions, attach, message, trickle, keepalive
VideoRoom Plugin: SFU rooms, publish/subscribe, remote publishers, cascading
AudioBridge Plugin: MCU audio mixing
Admin/Monitor API: handle_info, pcap dumps, draining, Event Handlers
Authentication: stored tokens, HMAC-signed tokens, api_secret
Recording (MJR): MJR format, janus-pp-rec, mjr2pcap
GitHub: source code, issues, demos

What's your current self-hosted video stack? Drop a comment. Curious what others are running.

DEV Community