Let me be honest with you.
The first time I built a calling feature, I thought I was done in a weekend. Two users, a couple of socket events, WebRTC offer/answer — done. Ship it.
Three weeks later I was staring at a bug report that said: "Call ended for me but my friend was still in it for five minutes."
That was the moment I realized I had built an illusion, not a system.
The Core Problem Nobody Talks About
Everyone shows you the happy path. User A calls User B. B accepts. They talk. They hang up. Tutorial over.
But real users don't do that.
They decline and then try again. They lose connection mid-call. They get invited into a call that's already active. Their phone dies and they rejoin. And when any of that happens on a system built around ad-hoc socket messages and a couple of boolean flags in memory — it falls apart fast.
The root cause is almost always the same: the call state lives in your head, not in your database.
The Shift That Changed Everything
The moment things clicked for me was when I stopped thinking about a call as an event and started thinking about it as a session with participants.
This sounds obvious in retrospect. But when you're early in the build, it's tempting to model a call like this:
call_id, caller_id, callee_id, status
That works for 1:1. Until you need to invite someone mid-call. Or handle a reconnect. Or show a participant roster. Then you're hacking around a model that was never designed for reality.
The better model:
video_calls (call_id, status, created_at)
call_participants (call_id, user_id, status, joined_at)
Now your call is just a container. Participants join, leave, get invited, reconnect — all as rows in a table. Your socket layer reads from that truth, not from its own memory.
Two Modules, One Clear Boundary
Once I had the data model right, I split the code into two distinct responsibilities.
Call lifecycle handles the hard stuff: start, accept, reject, end, restart, invite, and WebRTC signal relay. These events touch the database. They have side effects. They are strict.
In-call state sync handles the soft stuff: what's happening during a call between participants. Things like a shared file being opened, a page changing, a cursor moving. These are fast, ephemeral, and need to evolve quickly without touching core calling logic.
Keeping these separate isn't just clean code philosophy — it's practical. The lifecycle code is high-risk and changes rarely. The sync code is low-risk and changes constantly. Mixing them means every new collaboration feature becomes a liability to your call stability.
Walking Through What Actually Happens
Starting a Call
When a user hits "call," a few things happen before any socket event fires:
- Payload is normalized. Who are the targets? What type of call?
- Block checks run. Is anyone blocking anyone?
- A
call_idis generated and persisted. - Participant rows are created: caller is
CONNECTED, invitees areINVITED.
Only then does the ring event go out.
The caller immediately gets a CALL.JOINED confirmation. The callees get the incoming ring. Everyone gets a CALL.STARTED broadcast. And a timeout watchdog starts in the background — if nobody answers in N seconds, the call is automatically cleaned up.
This is the difference between "ring and hope" and a reliable call initiation flow.
Accepting a Call
On accept, we don't just flip a boolean. We:
- Validate that this user is actually an invited participant (not just anyone who got the call ID)
- Move their status to
connected - Transition the call status to
active - Rebuild the full roster: who's connected, who's invited, who's declined
- Fan out
CALL.ACCEPTEDandCALL.JOINEDto everyone relevant
That last step is what makes every client converge on the same view. Nobody is staring at stale state.
Rejecting and Ending Calls
Rejection is more nuanced than it looks. When someone declines:
- Their status becomes
declined - The service checks: are there any active or invited participants left?
- If yes, the call continues for them
- If no, the call is canceled and cleaned up
This is what allows you to build group-call behavior even when you started with a direct call. The model supports it without special-casing.
Ending a call is similar. It's participant-aware. A user disconnecting doesn't kill the call for everyone — it just removes them. The call ends when the last active participant leaves. This alone fixes a whole class of "call dropped for everyone when my connection hiccupped" bugs.
WebRTC Signaling
Offer, answer, ICE candidate — none of these get relayed unless:
- The call exists and is valid
- Both sender and target are confirmed participants
No membership check = ghost peers. I've seen signaling relayed to users who were never part of the call. It creates confusing peer connection failures that are almost impossible to debug. Strict validation at every relay point prevents it.
The State Sync Layer
During an active call, you often need participants to share more than just video.
In our case, one action type is FILE_OPEN — one participant opens a file and the other participants see it too.
The sync flow is deliberately simple:
- Validate the payload (call ID, action type, file ID)
- Confirm the sender is an active participant
- Resolve who should receive this (explicit peer or everyone else)
- For
FILE_OPEN, resolve the file metadata and a shareable URL - Emit a normalized payload to recipients
The security piece matters: the file URL is only resolved if the file owner is among the allowed participant IDs. You don't want sync convenience to become a file-sharing leak.
Adding new action types — CURSOR_MOVE, PAGE_CHANGE, ANNOTATION_ADD — is now additive. New cases in a handler, new resolution logic. Core call lifecycle is untouched.
Patterns That Made This Production-Ready
A few things I'd call out specifically:
A centralized domain service. CallService owns all participant and call mutations. Socket handlers call it. Nothing else. Business rules don't leak into event handlers.
Filter-based participant resolution. When you need to fan out to "all connected participants" or "everyone who was invited," you derive that from the database state — not from a list you assembled at call start and hoped stayed accurate.
Structured error emission. When something goes wrong on the server, the client gets a structured error event it can display. Not a silent failure, not a console log nobody sees.
History recording on real transitions. Call history is written when actual state changes happen, not speculatively. If a call never became active, there's no misleading "call ended" record.
What I'd Tell My Past Self
Don't model calls as fire-and-forget messages. Model them as a state machine where participant records drive every decision.
If your fanout logic is using variables you assembled at call start rather than querying current participant state — that's a bug waiting to surface under reconnects.
If your end-call handler terminates the call for everyone when one person leaves — you'll get bug reports the day someone's laptop battery dies.
And if your WebRTC relay doesn't check membership — you'll spend days chasing peer connection errors that make no sense until you realize you're signaling to ghosts.
Get the data model right first. Everything else follows.
The demo will always work. Production is where the design gets tested.
Build for the second.
Top comments (0)