For a recent university project, our team was tasked to deliver a video calling feature for both our iOS and web app. There are many solutions out there that promise video calling, but only few are free and mostly just work for one platform. As we had to build it for iOS and the web, we decided to use plain WebRTC, 'cause "can't be that hard, right ¯\_(ツ)_/¯"
tl;dr
I remember myself skimming through blog posts and tutorials, trying to find the minimum required steps, eventually even reading through the Signal iOS repository. So here's the bare gist of what you need to know to get going with WebRTC (or at least search for the things that don't work in your project):
- STUN is similar to
traceroute
: it collects the "hops" between you and a STUN server; those hops are then called ICE candidates - ICE candiates are basically
ip:port
pairs; you can "contact" your app using these candidates - you'll need a duplex connection to exchange data between the calling parties. Consider using a WebSocket server, since it's the easiest way to achieve this
- when one party "discovers" an ICE candidate, send it to the other party via the WebSocket/your duplex channel
- get your device's media tracks and add them to your local
RTCPeerConnection
- create a WebRTC offer on your
RTCPeerConnection
, and send it to the other party - receive and use the offer, then reply with your answer to it
If this didn't help you with your problems, or you're generally interested in WebRTC, keep on reading. We'll first look at what WebRTC is and then we'll build ourselves a small video chat.
What is WebRTC?
I'll just borrow the "about" section from the official website:
WebRTC is a free, open project that provides browsers and mobile applications with Real-Time Communications (RTC) capabilities via simple APIs. The WebRTC components have been optimized to best serve this purpose.
— webrtc.org
In a nutshell, WebRTC allows you to build apps, that exchange data in real-time using a peer-to-peer connection. The data can be audio, video, or anything you want. For instance, Signal calls are done over pure WebRTC, and due to the peer-to-peer nature, work mostly without sending your call data through a third party, e.g. like Skype does now.
STUN
To establish the peer-to-peer connection between two calling parties, they need to know how to connect to each other. This is where STUN comes in. As mentioned above, it's similar to traceroute
.
When you create a WebRTC client object in JavaScript, you need to provide iceServerUrls
, which are essentially URLs for STUN servers. The client then goes through all hops until it reaches the STUN server. The following sequence diagram shows how it works in a simplified way:
The "further" a candidate is away from Alice (the more hops it takes to reach her), the higher its network cost is. localhost:12345
is closer to her than public_ip:45678
, so the localhost
cost could be 10, whereas the public_ip
one could be 100. WebRTC tries to establish a connection with the lowest network cost, to ensure a high bandwidth.
Offers, answers and tracks
If you want to FaceTime with a friend, they might be interested in knowing how you're calling them, i.e. they want to see whether you're using only audio or video, or even if you're not using FaceTime at all and just call them from your landline.
WebRTC offers are similar to this: you specify what you'll be sending in the upcoming connection. So when you peer.createOffer()
, it checks which tracks, e.g. video or audio, are present and includes them in the offer. Once the called party receives an offer, it peer.createAnswer()
specifying its own capabilities, e.g. if it'll also send audio and video.
Signalling
An important part of WebRTC is exchanging information before the peer-to-peer connection is established. Both parties need to exchange an offer and answer, and they need to know the other side's ICE candidates, or they won't know where to send their audio and video streams after all.
That's where signalling comes in: you need to send said information to both parties. You can use anything you want to do this, but it's easiest to use a duplex connection that e.g. WebSockets provide. Using WebSockets, you'll be "notified" whenever there's an update from your signalling server.
A typical WebRTC handshake looks something like this:
First, Alice signals she wants to call Bob, so both parties initiate the WebRTC "handshake". They both acquire their ICE candidates, which they send to the other party via the signalling server. At some point, Alice creates an offer and sends it to Bob. It doesn't matter who creates the offer first (i.e. Alice or Bob), but the other party must create the answer to the offer. As both Alice and Bob know how to contact each other and what data will be sent, the peer-to-peer connection is established and they can have their conversation.
Building it
Now we know how WebRTC works, we "just" have to build it. This post will focus only on using web clients, if there's interest for an iOS version in the comments, I'll summarise the pitfalls in a new post. Also, I currently implemented the web client as a React hook useWebRTC
, which I might create a post for as well.
The server will be in TypeScript, whereas the webapp will be plain JavaScript to not have a separate build process. Both will use only plain WebSockets and WebRTC - no magic there. You can find the sources to this post on GitHub.
Server
We'll use express
, express-ws
and a bunch of other libraries, which you can find in the package.json.
WebSocket channels
Many WebSocket libraries allow sending data in channels. At its core, a channel is just a field in the message (e.g. like { channel: "foo", data: ... }
), allowing the server and app to distinguish where the message belongs to.
We'll need 5 channels:
-
start_call
: signals that the call should be started -
webrtc_ice_candidate
: exchange ICE candidates -
webrtc_offer
: send the WebRTC offer -
webrtc_answer
: send the WebRTC answer -
login
: let the server know who you are
The browser implementation of WebSockets lacks the ability to send who you are, e.g. adding an Authorization
header with your token isn't possible. We could add our token through the WebSocket's URL as a query parameter, but that implies it'll be logged on the web server and potentially cached on the browser - we don't want this.
Instead, we'll use a separate login
channel, where we'll just send our name. This could be a token or anything else, but for simplicity we'll assume our name is secure and unique enough.
As we're using TypeScript, we can easily define interfaces for our messages, so we can safely exchange messages without worrying about typos:
interface LoginWebSocketMessage {
channel: "login";
name: string;
}
interface StartCallWebSocketMessage {
channel: "start_call";
otherPerson: string;
}
interface WebRTCIceCandidateWebSocketMessage {
channel: "webrtc_ice_candidate";
candidate: RTCIceCandidate;
otherPerson: string;
}
interface WebRTCOfferWebSocketMessage {
channel: "webrtc_offer";
offer: RTCSessionDescription;
otherPerson: string;
}
interface WebRTCAnswerWebSocketMessage {
channel: "webrtc_answer";
answer: RTCSessionDescription;
otherPerson: string;
}
// these 4 messages are related to the call itself, thus we can
// bundle them in this type union, maybe we need that later
type WebSocketCallMessage =
StartCallWebSocketMessage
| WebRTCIceCandidateWebSocketMessage
| WebRTCOfferWebSocketMessage
| WebRTCAnswerWebSocketMessage;
// our overall type union for websocket messages in our backend spans
// both login and call messages
type WebSocketMessage = LoginWebSocketMessage | WebSocketCallMessage;
As we're using union types here, we can later use the TypeScript compiler to identify which message we received from just inspecting the channel
property. If message.channel === "start_call"
, the compiler will infer that the message must be of type StartCallWebSocketMessage
. Neat.
Exposing a WebSocket
We'll use express-ws
to expose a WebSocket from our server, which happens to be an express app, served via http.createServer()
:
const app = express();
const server = createServer(app);
// serve our webapp from the public folder
app.use("/", express.static("public"));
const wsApp = expressWs(app, server).app;
// expose websocket under /ws
// handleSocketConnection is explained later
wsApp.ws("/ws", handleSocketConnection);
const port = process.env.PORT || 3000;
server.listen(port, () => {
console.log(`server started on http://localhost:${port}`);
});
Our app will now run on port 3000 (or whatever we provide via PORT
), expose a WebSocket on /ws
and serve our webapp from the public
directory.
User management
As video calling usually requires > 1 person, we also need to keep track of currently connected users. To do so, we can introduce an array connectedUsers
, which we update every time someone connects to the WebSocket:
interface User {
socket: WebSocket;
name: string;
}
let connectedUsers: User[] = [];
Additionally, we should add helper functions to find users by their name or socket, for our own convenience:
function findUserBySocket(socket: WebSocket): User | undefined {
return connectedUsers.find((user) => user.socket === socket);
}
function findUserByName(name: string): User | undefined {
return connectedUsers.find((user) => user.name === name);
}
For this post we'll just assume there are no bad actors. So whenever a socket connects, it's a person trying to call someone soon. Our handleSocketConnection
looks somewhat like this:
function handleSocketConnection(socket: WebSocket): void {
socket.addEventListener("message", (event) => {
const json = JSON.parse(event.data.toString());
// handleMessage will be explained later
handleMessage(socket, json);
});
socket.addEventListener("close", () => {
// remove the user from our user list
connectedUsers = connectedUsers.filter((user) => {
if (user.socket === socket) {
console.log(`${user.name} disconnected`);
return false;
}
return true;
});
});
}
WebSocket messages can be strings or Buffer
s, so we need to parse them first. If it's a Buffer
, calling toString()
will convert it to a string.
Forwarding messages
Our signalling server essentially forwards messages between both calling parties, as shown in the sequence diagram above. To do this, we can create another convenience function forwardMessageToOtherPerson
, which sends the incoming message to the otherPerson
specified in the message. For debugging, we may even replace the otherPerson
field with the sender sending the original message:
function forwardMessageToOtherPerson(sender: User, message: WebSocketCallMessage): void {
const receiver = findUserByName(message.otherPerson);
if (!receiver) {
// in case this user doesn't exist, don't do anything
return;
}
const json = JSON.stringify({
...message,
otherPerson: sender.name,
});
receiver.socket.send(json);
}
In our handleMessage
, we can login our user and potentially forward their messages to the other person. Note that all call related messages could be combined under the default
statement, but for the sake of more meaningful logging, I explicitly put each channel there:
function handleMessage(socket: WebSocket, message: WebSocketMessage): void {
const sender = findUserBySocket(socket) || {
name: "[unknown]",
socket,
};
switch (message.channel) {
case "login":
console.log(`${message.name} joined`);
connectedUsers.push({ socket, name: message.name });
break;
case "start_call":
console.log(`${sender.name} started a call with ${message.otherPerson}`);
forwardMessageToOtherPerson(sender, message);
break;
case "webrtc_ice_candidate":
console.log(`received ice candidate from ${sender.name}`);
forwardMessageToOtherPerson(sender, message);
break;
case "webrtc_offer":
console.log(`received offer from ${sender.name}`);
forwardMessageToOtherPerson(sender, message);
break;
case "webrtc_answer":
console.log(`received answer from ${sender.name}`);
forwardMessageToOtherPerson(sender, message);
break;
default:
console.log("unknown message", message);
break;
}
}
That's it for the server. When someone connects to the socket, they can login and as soon as they start the WebRTC handshake, messages will be forwarded to the person they're calling.
Web app
The web app consists of the index.html
, and a JavaScript file web.js
. Both are served from the public
directory of the app, as shown above. The most important part of the web app are the two <video />
tags, which will be used to display the local and remote video stream. To get a consistent video feed, autoplay
needs to be set on the video, or it'll be stuck on the initial frame:
<!DOCTYPE html>
<html>
<body>
<button id="call-button">Call someone</button>
<div id="video-container">
<div id="videos">
<video id="remote-video" autoplay></video>
<video id="local-video" autoplay></video>
</div>
</div>
<script type="text/javascript" src="web.js"></script>
</body>
</html>
Connecting to the signalling server
Our WebSocket is listening on the same server as our web app, so we can leverage location.host
, which includes both hostname and port, to build our socket url. Once connected, we need to login, as WebSockets don't provide additional authentication possibilities:
// generates a username like "user42"
const randomUsername = `user${Math.floor(Math.random() * 100)}`;
const username = prompt("What's your name?", randomUsername);
const socketUrl = `ws://${location.host}/ws`;
const socket = new WebSocket(socketUrl);
// convenience method for sending json without calling JSON.stringify everytime
function sendMessageToSignallingServer(message) {
const json = JSON.stringify(message);
socket.send(json);
}
socket.addEventListener("open", () => {
console.log("websocket connected");
sendMessageToSignallingServer({
channel: "login",
name: username,
});
});
socket.addEventListener("message", (event) => {
const message = JSON.parse(event.data.toString());
handleMessage(message);
});
Setting up WebRTC
Now this is what we've been waiting for: WebRTC. In JavaScript, there's a RTCPeerConnection
class, which we can use to create WebRTC connections. We need to provide servers for ICE candidate discovery, for instance stun.stunprotocol.org
:
const webrtc = new RTCPeerConnection({
iceServers: [
{
urls: [
"stun:stun.stunprotocol.org",
],
},
],
});
webrtc.addEventListener("icecandidate", (event) => {
if (!event.candidate) {
return;
}
// when we discover a candidate, send it to the other
// party through the signalling server
sendMessageToSignallingServer({
channel: "webrtc_ice_candidate",
candidate: event.candidate,
otherPerson,
});
});
Sending and receiving media tracks
Video calling works best when there's video, so we need to send our video stream somehow. Here, the user media API comes in handy, which provides a function to retrieve the user's webcam stream.
navigator
.mediaDevices
.getUserMedia({ video: true })
.then((localStream) => {
// display our local video in the respective tag
const localVideo = document.getElementById("local-video");
localVideo.srcObject = localStream;
// our local stream can provide different tracks, e.g. audio and
// video. even though we're just using the video track, we should
// add all tracks to the webrtc connection
for (const track of localStream.getTracks()) {
webrtc.addTrack(track, localStream);
}
});
webrtc.addEventListener("track", (event) => {
// we received a media stream from the other person. as we're sure
// we're sending only video streams, we can safely use the first
// stream we got. by assigning it to srcObject, it'll be rendered
// in our video tag, just like a normal video
const remoteVideo = document.getElementById("remote-video");
remoteVideo.srcObject = event.streams[0];
});
Performing the WebRTC handshake
Our handleMessage
function closely follows the sequence diagram above: When Bob receives a start_call
message, he sends a WebRTC offer to the signalling server. Alice receives this and replies with her WebRTC answer, which Bob also receives through the signalling server. Once this is done, both exchange ICE candidates.
The WebRTC API is built around Promise
s, thus it's easiest to declare an async
function and await
inside it:
// we'll need to have remember the other person we're calling,
// thus we'll store it in a global variable
let otherPerson;
async function handleMessage(message) {
switch (message.channel) {
case "start_call":
// done by Bob: create a webrtc offer for Alice
otherPerson = message.otherPerson;
console.log(`receiving call from ${otherPerson}`);
const offer = await webrtc.createOffer();
await webrtc.setLocalDescription(offer);
sendMessageToSignallingServer({
channel: "webrtc_offer",
offer,
otherPerson,
});
break;
case "webrtc_offer":
// done by Alice: react to Bob's webrtc offer
console.log("received webrtc offer");
// we might want to create a new RTCSessionDescription
// from the incoming offer, but as JavaScript doesn't
// care about types anyway, this works just fine:
await webrtc.setRemoteDescription(message.offer);
const answer = await webrtc.createAnswer();
await webrtc.setLocalDescription(answer);
sendMessageToSignallingServer({
channel: "webrtc_answer",
answer,
otherPerson,
});
break;
case "webrtc_answer":
// done by Bob: use Alice's webrtc answer
console.log("received webrtc answer");
await webrtc.setRemoteDescription(message.answer);
break;
case "webrtc_ice_candidate":
// done by both Alice and Bob: add the other one's
// ice candidates
console.log("received ice candidate");
// we could also "revive" this as a new RTCIceCandidate
await webrtc.addIceCandidate(message.candidate);
break;
default:
console.log("unknown message", message);
break;
}
}
Starting a call from a button
The main thing we're still missing, is starting the call from the "Call someone" button. All we need to do, is send a start_call
message to our signalling server, everything else will be handled by our WebSocket and handleMessage
:
const callButton = document.getElementById("call-button");
callButton.addEventListener("click", () => {
otherPerson = prompt("Who you gonna call?");
sendMessageToSignallingServer({
channel: "start_call",
otherPerson,
});
});
Conclusion
If we open the app on Chrome and Safari at the same time, we can call ourselves on different browsers. That's kinda cool!
But besides calling, there's a lot more to do that wasn't covered by this post, e.g. cleaning up our connection, which I might cover in a future post (i.e. using React Hooks for WebRTC and WebSockets). Feel free to check out the repo, where you can re-trace everything that's presented in this post as well. Thanks for reading!
Top comments (12)
Hi Michael,
Why it is not displaying the camera video? Only black rectangle.
When I open the inspect element, I could see this:
Uncaught TypeError: Cannot read property 'getUserMedia' of undefined
Any thoughts?
Thanks
Jake
it doesn't give error on mozilla, only on chrome
Thank you for your Article! It really helped me to grasp the concepts.
I know it's been a while.
I write a browser based app (game) that does not require a server for its mechanics - it could be delivered with GitHub Pages.
But I want two players to be able to play a PvP match.
I imagine the process like this:
I as a developer am lazy and aware of the responsibilities that come with providing and implementing servers. Therefore I want to avoid to write ANY code that runs on a server.
Here are my questions:
stun.stunprotocol.org
a free and public service, or is it a placeholder for I server I need to provide?Best Wishes!
Hello Michael, I like your article thanks for your effort.
Moreover, I want to mention an open source video conferencing tool that I am using. It is here
It is ready to use! It includes the frontend, all APIs and user interface to build large-scale, ultra-low latency video and audio conferencing and webinars.
You can also ask for a demo request: antmedia.io/portmeet-app/
Hello Michael,
Very nice article!
If I may also mention MiroTalk :)
MiroTalk is an Open-Source Self Hosted WebRTC, Simple, Secure, Scalable, Fast Real-Time Video Conferences Up to 4k and 60fps, compatible with all browsers and platforms.
MiroTalk P2P
GitHub: github.com/miroslavpejic85/mirotalk
Demo: mirotalk.up.railway.app/
MiroTalk SFU
GitHub: github.com/miroslavpejic85/mirotal...
Demo: sfu.mirotalk.org
Difference
github.com/miroslavpejic85/mirotal...
Presentation
canva.com/design/DAE693uLOIU/view
Many thanks!
Good fortune of reading your article regarding WebRTC. You can also go through this article for reference so you can make the best of it with 5000+participant support and 10,000 free minutes every month . For further Read you can prefer this article . dev.to/videosdk/video-calling-in-j....
Hi Michael,
Thank you for nice article. I would like to ask one question, you mentioned in the article that ICE servers are STUN/TURN servers which generates ICE candidates. I just want to know if ICE servers are not specified webrtc client object(assume both parties are in the same network), who creates ICE candidates?
Nice article!
I like the point you made at the top about minimum required steps - its so hard to find in the video and chat space!
Great post, thanks for sharing!
This is great! I was just looking at how to do this, but there are a lack of good up-to-date tutorials out there. Thanks!
Can you please explain how can we achieve this for an iOS and Android application?