Devanshu Biswas

Posted on Jun 7

I Built a Browser-to-Browser Video Chat in 250 Lines — Zero Backend, Zero SDKs, Zero Cost

#webrtc #nextjs #javascript #beginners

🌐 Live demo: https://webrtc-from-zero.vercel.app
🔗 Full code: https://github.com/dev48v/webrtc-from-zero

Day 42 of my TechFromZero series. One new technology every day, real working project, no Hello World.

Today: WebRTC. The thing that powers Google Meet, Discord voice, Zoom Web, Twitter Spaces, every "video chat in the browser" you have ever used. Three years ago doing this yourself required signaling servers, Janus or Jitsi, two days of YouTube tutorials and a TURN bill at the end. This article gets you to a working two-tab video call in about 250 lines and zero backend.

What WebRTC actually is

WebRTC is three browser APIs glued together. That's it.

API	One-line job
`navigator.mediaDevices.getUserMedia()`	"Browser, please ask the user for the webcam." Returns a `MediaStream`.
`RTCPeerConnection`	The actual peer-to-peer pipe. You stuff your tracks in, the other side pulls them out.
Signaling (SDP + ICE)	Each peer describes itself in a JSON blob. They swap blobs somehow. WebRTC doesn't care how.

That last point is the part everyone overcomplicates. WebRTC has no opinion on how the two peers find each other. The browsers happily talk peer-to-peer once they know each other's network coordinates. Getting the coordinates from A to B is your problem, not WebRTC's.

In this article we use the simplest signaling mechanism known to humans: the user copies a JSON blob from one tab and pastes it into the other tab. No server. No WebSocket. No Firebase. Once you understand the handshake at that level, swapping the human in for a WebSocket is trivial — but understand it without the WebSocket first.

The handshake in 3 messages

It looks complicated. It is not. It is three messages.

Tab A                                    Tab B
  │                                        │
  │  1. "Here is my offer SDP +            │
  │      every ICE candidate I found"      │
  │ ───────────────────────────────────►   │
  │                                        │
  │  2. "Here is my answer SDP +           │
  │      every ICE candidate I found"      │
  │  ◄───────────────────────────────────  │
  │                                        │
  │  3. (Tab A confirms by setting          │
  │      Tab B's answer)                    │
  │                                        │
  │  ════════ frames flow both ways ═════  │

SDP ("Session Description Protocol") = a multi-line text blob describing what codecs each side supports, what tracks it has, and where to send them.
ICE ("Interactive Connectivity Establishment") = a list of network addresses (your LAN IP, your public IP, sometimes a TURN relay) the other peer can try.

Each side runs ICE gathering, mashes the candidates into the SDP, hands you one blob. The other side does the same.

Build it: 8 step-by-step commits

The repo is structured so each commit adds exactly one idea. Walk the history one commit at a time.

Step 1 — Next.js scaffold

npx create-next-app@latest webrtc-from-zero --typescript --tailwind --app

Default landing page. No WebRTC yet.

Step 2 — Local webcam preview

The simplest browser API in the world.

async function startCamera() {
  const stream = await navigator.mediaDevices.getUserMedia({
    video: true,
    audio: true,
  });
  localVideo.current!.srcObject = stream;
}

Permission prompt fires the first time. srcObject = stream and the frames render in real time. No <source> tag, no MIME type, no nothing.

Step 3 — RTCPeerConnection + remote pane

const pc = new RTCPeerConnection({
  iceServers: [{ urls: "stun:stun.l.google.com:19302" }],
});

pc.ontrack = (event) => {
  remoteVideo.current!.srcObject = event.streams[0];
};

stream.getTracks().forEach((t) => pc.addTrack(t, stream));

Three things happen here:

new RTCPeerConnection(...) creates the peer-to-peer pipe. Each side has its own.
pc.ontrack fires when the other side sends us frames. We grab the incoming MediaStream and feed it to a <video> exactly like we did for the local camera.
addTrack(track, stream) attaches each of our outgoing tracks to the pipe. They'll be packaged into the SDP offer on the next step.

Note the iceServers config — Google's public STUN server. STUN tells your browser its public IP/port (the one the world sees, not your LAN IP). It costs nothing and is fine for dev. Production also needs a TURN server for users behind hostile NATs.

Step 4 — Caller creates the offer

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

// Wait for ICE gathering to finish so the SDP contains every candidate.
await waitForIceGatheringComplete(pc);

// pc.localDescription is now the FINAL SDP: offer + every candidate baked in.
setOfferSdp(JSON.stringify(pc.localDescription));

createOffer() returns the local SDP. setLocalDescription() tells the pc "yes, that's me." We then wait for ICE gathering to complete — that's the "non-trickle" approach, simpler for a tutorial. (Production usually trickles the candidates as they come in to save handshake time, but that's just an optimization on top of the same idea.)

When ICE gathering finishes, pc.localDescription contains the offer AND every candidate, merged. One blob. Drop it in a textarea. User copies it to tab B.

Step 5 — Callee accepts offer, creates answer

// Tab B receives the offer.
await pc.setRemoteDescription(JSON.parse(pastedOffer));

// Now generate our own SDP describing what WE'll send.
const answer = await pc.createAnswer();
await pc.setLocalDescription(answer);

await waitForIceGatheringComplete(pc);
setAnswerSdp(JSON.stringify(pc.localDescription));

Same shape as the offer side, mirrored. setRemoteDescription tells our pc "here's what the other side wants." createAnswer is the symmetric companion to createOffer. We wait for ICE again, dump the answer SDP into a textarea, user copies it back to tab A.

Step 6 — Caller accepts answer

await pc.setRemoteDescription(JSON.parse(pastedAnswer));

One line. The handshake is over. ICE picks the best candidate pair (usually a direct LAN connection on the same Wi-Fi, or a STUN-discovered public-IP connection across the internet), and frames start flowing in both directions.

The pc.ontrack handler from step 3 fires for the first time. The remote <video> lights up with the other tab's webcam. You see your own face on two browsers. You wave at yourself. The lag is 50-100 ms because there is no server in the middle.

Step 7 — Live connection-state badge

pc.onconnectionstatechange = () => {
  setConnState(pc.connectionState);
};

pc.connectionState moves through new → connecting → connected → (disconnected | failed | closed). Show this in the UI as a colored dot. Students stop guessing whether their handshake worked.

Step 8 — Mute, camera, hang up

// Toggle without renegotiating the call:
audioTrack.enabled = !audioTrack.enabled;
videoTrack.enabled = !videoTrack.enabled;

.enabled is cheap — flipping it pauses the track but keeps the pc alive. No new SDP exchange. The other side just sees frozen video or silence.

Hang up is the real teardown:

pc.close();
stream.getTracks().forEach((t) => t.stop());

pc.close() releases the connection. track.stop() releases the hardware (the green camera light goes off).

What's NOT in this article (on purpose)

I cut a lot to keep the file under 250 lines. Here's what a real product adds, in rough order of importance:

Concern	What you add
Skip the copy-paste step	Any signaling channel — WebSocket, Firebase Realtime, Ably, Pusher. The server only ever relays SDP + ICE blobs. It never sees the video.
Reliability across NATs	A TURN server (coturn, Twilio NTS, Cloudflare Calls). STUN alone fails on ~15% of corporate / cellular networks.
3+ participants	Either a mesh (each peer holds N-1 connections, fine up to ~5 people) or an SFU (LiveKit, mediasoup, Cloudflare Calls) that forwards streams.
Screen share	`navigator.mediaDevices.getDisplayMedia()` returns a `MediaStream` of the screen. Drop into the same `addTrack` pipeline.
Recording	`MediaRecorder` on the local stream, or pipe SFU output to S3.

The 250 lines above are the floor. Everything else is layers on top of the same three APIs.

Why I made you do it this way

Every WebRTC tutorial I read in 2022 either:

Used peer.js / simple-peer and never explained the handshake, or
Required spinning up a Node + WebSocket signaling server, which obscures the actual WebRTC part.

Stripping signaling down to copy-paste isolates the part the browser does for you (the entire P2P media pipe) from the part you're responsible for (relaying two JSON blobs). Once you've seen the handshake work with copy-paste, you'll never be confused about what your signaling server's job is again — its only job is relay two blobs. That's it.

Try it now

git clone https://github.com/dev48v/webrtc-from-zero
cd webrtc-from-zero
npm install
npm run dev

Open http://localhost:3000 in two browser windows. Both click ▶ Start camera. Tab A creates an offer, tab B answers, tab A accepts the answer, the dot turns green, you see yourself.

Or just open the live demo on Vercel:
https://webrtc-from-zero.vercel.app