loading...
Cover image for I WebRTC you - building a video chat in JavaScript

I WebRTC you - building a video chat in JavaScript

michaelneu profile image Michael Neu ・11 min read

For a recent university project, our team was tasked to deliver a video calling feature for both our iOS and web app. There are many solutions out there that promise video calling, but only few are free and mostly just work for one platform. As we had to build it for iOS and the web, we decided to use plain WebRTC, 'cause "can't be that hard, right ¯\_(ツ)_/¯"

tl;dr

I remember myself skimming through blog posts and tutorials, trying to find the minimum required steps, eventually even reading through the Signal iOS repository. So here's the bare gist of what you need to know to get going with WebRTC (or at least search for the things that don't work in your project):

  • STUN is similar to traceroute: it collects the "hops" between you and a STUN server; those hops are then called ICE candidates
  • ICE candiates are basically ip:port pairs; you can "contact" your app using these candidates
  • you'll need a duplex connection to exchange data between the calling parties. Consider using a WebSocket server, since it's the easiest way to achieve this
  • when one party "discovers" an ICE candidate, send it to the other party via the WebSocket/your duplex channel
  • get your device's media tracks and add them to your local RTCPeerConnection
  • create a WebRTC offer on your RTCPeerConnection, and send it to the other party
  • receive and use the offer, then reply with your answer to it

If this didn't help you with your problems, or you're generally interested in WebRTC, keep on reading. We'll first look at what WebRTC is and then we'll build ourselves a small video chat.

What is WebRTC?

I'll just borrow the "about" section from the official website:

WebRTC is a free, open project that provides browsers and mobile applications with Real-Time Communications (RTC) capabilities via simple APIs. The WebRTC components have been optimized to best serve this purpose.
webrtc.org

In a nutshell, WebRTC allows you to build apps, that exchange data in real-time using a peer-to-peer connection. The data can be audio, video, or anything you want. For instance, Signal calls are done over pure WebRTC, and due to the peer-to-peer nature, work mostly without sending your call data through a third party, e.g. like Skype does now.

STUN

To establish the peer-to-peer connection between two calling parties, they need to know how to connect to each other. This is where STUN comes in. As mentioned above, it's similar to traceroute.

When you create a WebRTC client object in JavaScript, you need to provide iceServerUrls, which are essentially URLs for STUN servers. The client then goes through all hops until it reaches the STUN server. The following sequence diagram shows how it works in a simplified way:

sequence diagram of STUN hops

The "further" a candidate is away from Alice (the more hops it takes to reach her), the higher its network cost is. localhost:12345 is closer to her than public_ip:45678, so the localhost cost could be 10, whereas the public_ip one could be 100. WebRTC tries to establish a connection with the lowest network cost, to ensure a high bandwidth.

Offers, answers and tracks

If you want to FaceTime with a friend, they might be interested in knowing how you're calling them, i.e. they want to see whether you're using only audio or video, or even if you're not using FaceTime at all and just call them from your landline.

WebRTC offers are similar to this: you specify what you'll be sending in the upcoming connection. So when you peer.createOffer(), it checks which tracks, e.g. video or audio, are present and includes them in the offer. Once the called party receives an offer, it peer.createAnswer() specifying its own capabilities, e.g. if it'll also send audio and video.

Signalling

An important part of WebRTC is exchanging information before the peer-to-peer connection is established. Both parties need to exchange an offer and answer, and they need to know the other side's ICE candidates, or they won't know where to send their audio and video streams after all.

That's where signalling comes in: you need to send said information to both parties. You can use anything you want to do this, but it's easiest to use a duplex connection that e.g. WebSockets provide. Using WebSockets, you'll be "notified" whenever there's an update from your signalling server.

A typical WebRTC handshake looks something like this:

WebRTC handshake

First, Alice signals she wants to call Bob, so both parties initiate the WebRTC "handshake". They both acquire their ICE candidates, which they send to the other party via the signalling server. At some point, Alice creates an offer and sends it to Bob. It doesn't matter who creates the offer first (i.e. Alice or Bob), but the other party must create the answer to the offer. As both Alice and Bob know how to contact each other and what data will be sent, the peer-to-peer connection is established and they can have their conversation.

Building it

Now we know how WebRTC works, we "just" have to build it. This post will focus only on using web clients, if there's interest for an iOS version in the comments, I'll summarise the pitfalls in a new post. Also, I currently implemented the web client as a React hook useWebRTC, which I might create a post for as well.

The server will be in TypeScript, whereas the webapp will be plain JavaScript to not have a separate build process. Both will use only plain WebSockets and WebRTC - no magic there. You can find the sources to this post on GitHub.

Server

We'll use express, express-ws and a bunch of other libraries, which you can find in the package.json.

WebSocket channels

Many WebSocket libraries allow sending data in channels. At its core, a channel is just a field in the message (e.g. like { channel: "foo", data: ... }), allowing the server and app to distinguish where the message belongs to.

We'll need 5 channels:

  • start_call: signals that the call should be started
  • webrtc_ice_candidate: exchange ICE candidates
  • webrtc_offer: send the WebRTC offer
  • webrtc_answer: send the WebRTC answer
  • login: let the server know who you are

The browser implementation of WebSockets lacks the ability to send who you are, e.g. adding an Authorization header with your token isn't possible. We could add our token through the WebSocket's URL as a query parameter, but that implies it'll be logged on the web server and potentially cached on the browser - we don't want this.

Instead, we'll use a separate login channel, where we'll just send our name. This could be a token or anything else, but for simplicity we'll assume our name is secure and unique enough.

As we're using TypeScript, we can easily define interfaces for our messages, so we can safely exchange messages without worrying about typos:

interface LoginWebSocketMessage {
  channel: "login";
  name: string;
}

interface StartCallWebSocketMessage {
  channel: "start_call";
  otherPerson: string;
}

interface WebRTCIceCandidateWebSocketMessage {
  channel: "webrtc_ice_candidate";
  candidate: RTCIceCandidate;
  otherPerson: string;
}

interface WebRTCOfferWebSocketMessage {
  channel: "webrtc_offer";
  offer: RTCSessionDescription;
  otherPerson: string;
}

interface WebRTCAnswerWebSocketMessage {
  channel: "webrtc_answer";
  answer: RTCSessionDescription;
  otherPerson: string;
}

// these 4 messages are related to the call itself, thus we can
// bundle them in this type union, maybe we need that later
type WebSocketCallMessage =
  StartCallWebSocketMessage
  | WebRTCIceCandidateWebSocketMessage
  | WebRTCOfferWebSocketMessage
  | WebRTCAnswerWebSocketMessage;

// our overall type union for websocket messages in our backend spans
// both login and call messages
type WebSocketMessage = LoginWebSocketMessage | WebSocketCallMessage;

As we're using union types here, we can later use the TypeScript compiler to identify which message we received from just inspecting the channel property. If message.channel === "start_call", the compiler will infer that the message must be of type StartCallWebSocketMessage. Neat.

Exposing a WebSocket

We'll use express-ws to expose a WebSocket from our server, which happens to be an express app, served via http.createServer():

const app = express();
const server = createServer(app);

// serve our webapp from the public folder
app.use("/", express.static("public"));

const wsApp = expressWs(app, server).app;

// expose websocket under /ws
// handleSocketConnection is explained later
wsApp.ws("/ws", handleSocketConnection);

const port = process.env.PORT || 3000;
server.listen(port, () => {
  console.log(`server started on http://localhost:${port}`);
});

Our app will now run on port 3000 (or whatever we provide via PORT), expose a WebSocket on /ws and serve our webapp from the public directory.

User management

As video calling usually requires > 1 person, we also need to keep track of currently connected users. To do so, we can introduce an array connectedUsers, which we update every time someone connects to the WebSocket:

interface User {
  socket: WebSocket;
  name: string;
}

let connectedUsers: User[] = [];

Additionally, we should add helper functions to find users by their name or socket, for our own convenience:

function findUserBySocket(socket: WebSocket): User | undefined {
  return connectedUsers.find((user) => user.socket === socket);
}

function findUserByName(name: string): User | undefined {
  return connectedUsers.find((user) => user.name === name);
}

For this post we'll just assume there are no bad actors. So whenever a socket connects, it's a person trying to call someone soon. Our handleSocketConnection looks somewhat like this:

function handleSocketConnection(socket: WebSocket): void {
  socket.addEventListener("message", (event) => {
    const json = JSON.parse(event.data.toString());

    // handleMessage will be explained later
    handleMessage(socket, json);
  });

  socket.addEventListener("close", () => {
    // remove the user from our user list
    connectedUsers = connectedUsers.filter((user) => {
      if (user.socket === socket) {
        console.log(`${user.name} disconnected`);
        return false;
      }

      return true;
    });
  });
}

WebSocket messages can be strings or Buffers, so we need to parse them first. If it's a Buffer, calling toString() will convert it to a string.

Forwarding messages

Our signalling server essentially forwards messages between both calling parties, as shown in the sequence diagram above. To do this, we can create another convenience function forwardMessageToOtherPerson, which sends the incoming message to the otherPerson specified in the message. For debugging, we may even replace the otherPerson field with the sender sending the original message:

function forwardMessageToOtherPerson(sender: User, message: WebSocketCallMessage): void {
  const receiver = findUserByName(message.otherPerson);
  if (!receiver) {
    // in case this user doesn't exist, don't do anything
    return;
  }

  const json = JSON.stringify({
    ...message,
    otherPerson: sender.name,
  });

  receiver.socket.send(json);
}

In our handleMessage, we can login our user and potentially forward their messages to the other person. Note that all call related messages could be combined under the default statement, but for the sake of more meaningful logging, I explicitly put each channel there:

function handleMessage(socket: WebSocket, message: WebSocketMessage): void {
  const sender = findUserBySocket(socket) || {
    name: "[unknown]",
    socket,
  };

  switch (message.channel) {
    case "login":
      console.log(`${message.name} joined`);
      connectedUsers.push({ socket, name: message.name });
      break;

    case "start_call":
      console.log(`${sender.name} started a call with ${message.otherPerson}`);
      forwardMessageToOtherPerson(sender, message);
      break;

    case "webrtc_ice_candidate":
      console.log(`received ice candidate from ${sender.name}`);
      forwardMessageToOtherPerson(sender, message);
      break;

    case "webrtc_offer":
      console.log(`received offer from ${sender.name}`);
      forwardMessageToOtherPerson(sender, message);
      break;

    case "webrtc_answer":
      console.log(`received answer from ${sender.name}`);
      forwardMessageToOtherPerson(sender, message);
      break;

    default:
      console.log("unknown message", message);
      break;
  }
}

That's it for the server. When someone connects to the socket, they can login and as soon as they start the WebRTC handshake, messages will be forwarded to the person they're calling.

Web app

The web app consists of the index.html, and a JavaScript file web.js. Both are served from the public directory of the app, as shown above. The most important part of the web app are the two <video /> tags, which will be used to display the local and remote video stream. To get a consistent video feed, autoplay needs to be set on the video, or it'll be stuck on the initial frame:

<!DOCTYPE html>
<html>
  <body>
    <button id="call-button">Call someone</button>

    <div id="video-container">
      <div id="videos">
        <video id="remote-video" autoplay></video>
        <video id="local-video" autoplay></video>
      </div>
    </div>

    <script type="text/javascript" src="web.js"></script>
  </body>
</html>

Connecting to the signalling server

Our WebSocket is listening on the same server as our web app, so we can leverage location.host, which includes both hostname and port, to build our socket url. Once connected, we need to login, as WebSockets don't provide additional authentication possibilities:

// generates a username like "user42"
const randomUsername = `user${Math.floor(Math.random() * 100)}`;
const username = prompt("What's your name?", randomUsername);
const socketUrl = `ws://${location.host}/ws`;
const socket = new WebSocket(socketUrl);

// convenience method for sending json without calling JSON.stringify everytime
function sendMessageToSignallingServer(message) {
  const json = JSON.stringify(message);
  socket.send(json);
}

socket.addEventListener("open", () => {
  console.log("websocket connected");
  sendMessageToSignallingServer({
    channel: "login",
    name: username,
  });
});

socket.addEventListener("message", (event) => {
  const message = JSON.parse(event.data.toString());
  handleMessage(message);
});

Setting up WebRTC

Now this is what we've been waiting for: WebRTC. In JavaScript, there's a RTCPeerConnection class, which we can use to create WebRTC connections. We need to provide servers for ICE candidate discovery, for instance stun.stunprotocol.org:

const webrtc = new RTCPeerConnection({
  iceServers: [
    {
      urls: [
        "stun:stun.stunprotocol.org",
      ],
    },
  ],
});

webrtc.addEventListener("icecandidate", (event) => {
  if (!event.candidate) {
    return;
  }

  // when we discover a candidate, send it to the other
  // party through the signalling server
  sendMessageToSignallingServer({
    channel: "webrtc_ice_candidate",
    candidate: event.candidate,
    otherPerson,
  });
});

Sending and receiving media tracks

Video calling works best when there's video, so we need to send our video stream somehow. Here, the user media API comes in handy, which provides a function to retrieve the user's webcam stream.

navigator
  .mediaDevices
  .getUserMedia({ video: true })
  .then((localStream) => {
    // display our local video in the respective tag
    const localVideo = document.getElementById("local-video");
    localVideo.srcObject = localStream;

    // our local stream can provide different tracks, e.g. audio and
    // video. even though we're just using the video track, we should
    // add all tracks to the webrtc connection
    for (const track of localStream.getTracks()) {
      webrtc.addTrack(track, localStream);
    }
  });

webrtc.addEventListener("track", (event) => {
  // we received a media stream from the other person. as we're sure 
  // we're sending only video streams, we can safely use the first
  // stream we got. by assigning it to srcObject, it'll be rendered
  // in our video tag, just like a normal video
  const remoteVideo = document.getElementById("remote-video");
  remoteVideo.srcObject = event.streams[0];
});

Performing the WebRTC handshake

Our handleMessage function closely follows the sequence diagram above: When Bob receives a start_call message, he sends a WebRTC offer to the signalling server. Alice receives this and replies with her WebRTC answer, which Bob also receives through the signalling server. Once this is done, both exchange ICE candidates.

The WebRTC API is built around Promises, thus it's easiest to declare an async function and await inside it:

// we'll need to have remember the other person we're calling,
// thus we'll store it in a global variable
let otherPerson;

async function handleMessage(message) {
  switch (message.channel) {
    case "start_call":
      // done by Bob: create a webrtc offer for Alice
      otherPerson = message.otherPerson;
      console.log(`receiving call from ${otherPerson}`);

      const offer = await webrtc.createOffer();
      await webrtc.setLocalDescription(offer);
      sendMessageToSignallingServer({
        channel: "webrtc_offer",
        offer,
        otherPerson,
      });
      break;

    case "webrtc_offer":
      // done by Alice: react to Bob's webrtc offer
      console.log("received webrtc offer");
      // we might want to create a new RTCSessionDescription
      // from the incoming offer, but as JavaScript doesn't
      // care about types anyway, this works just fine:
      await webrtc.setRemoteDescription(message.offer);

      const answer = await webrtc.createAnswer();
      await webrtc.setLocalDescription(answer);

      sendMessageToSignallingServer({
        channel: "webrtc_answer",
        answer,
        otherPerson,
      });
      break;

    case "webrtc_answer":
      // done by Bob: use Alice's webrtc answer
      console.log("received webrtc answer");
      await webrtc.setRemoteDescription(message.answer);
      break;

    case "webrtc_ice_candidate":
      // done by both Alice and Bob: add the other one's
      // ice candidates
      console.log("received ice candidate");
      // we could also "revive" this as a new RTCIceCandidate
      await webrtc.addIceCandidate(message.candidate);
      break;

    default:
      console.log("unknown message", message);
      break;
  }
}

Starting a call from a button

The main thing we're still missing, is starting the call from the "Call someone" button. All we need to do, is send a start_call message to our signalling server, everything else will be handled by our WebSocket and handleMessage:

const callButton = document.getElementById("call-button");
callButton.addEventListener("click", () => {
  otherPerson = prompt("Who you gonna call?");
  sendMessageToSignallingServer({
    channel: "start_call",
    otherPerson,
  });
});

Conclusion

If we open the app on Chrome and Safari at the same time, we can call ourselves on different browsers. That's kinda cool!

But besides calling, there's a lot more to do that wasn't covered by this post, e.g. cleaning up our connection, which I might cover in a future post (i.e. using React Hooks for WebRTC and WebSockets). Feel free to check out the repo, where you can re-trace everything that's presented in this post as well. Thanks for reading!

Posted on by:

Discussion

pic
Editor guide
 

Hi Michael,

Why it is not displaying the camera video? Only black rectangle.
When I open the inspect element, I could see this:

Uncaught TypeError: Cannot read property 'getUserMedia' of undefined

Any thoughts?
Thanks
Jake

 

it doesn't give error on mozilla, only on chrome

 
 

This is great! I was just looking at how to do this, but there are a lack of good up-to-date tutorials out there. Thanks!

 

Hi Michael,

Thank you for nice article. I would like to ask one question, you mentioned in the article that ICE servers are STUN/TURN servers which generates ICE candidates. I just want to know if ICE servers are not specified webrtc client object(assume both parties are in the same network), who creates ICE candidates?

 

I like the point you made at the top about minimum required steps - its so hard to find in the video and chat space!

Great post, thanks for sharing!

 

Can you please explain how can we achieve this for an iOS and Android application?