Olzhas Kurikov

Posted on Dec 30, 2022

How we build a-la Google Meet with NextJS, PeerJS and SocketIO

#webrtc #socketio #peerjs #nextjs

Hi there 🙂

We are two friends, are going to explain how our journey into the world of WebRTC and WebSockets was. We had a lot of struggles, disappointment and also fun. We hope this post will be helpful or at least informative for whoever is reading.

We both are frontend engineers, hence to build the app we went for well-known NextJS as our core framework.

For styling we chose TailwindCSS, since we haven't had any experience with it, and we wanted to play around with it.

After reading a couple articles and watching some tutorial videos, we realised that dealing with WebRTC natively is quite cumbersome. Due to this, PeerJS came handy to abstract away some configurations around WebRTC. While implementing peer-to-peer communication, there must be a signalling server for the purpose of keeping in sync the state/-s (muted, camera is off and etc.) between peers. Therefore, SocketIO played a role of signalling server. Basic authentication layer was integrated via Auth0 to implement certain features.

In this article we are going to develop following features:

lobby page to setup your initial stream settings
creation of a room to join
meeting host abilities
sharing screen with others
turning off light indicator of your device if video stream is off
visual indication of active speaker in the room
messaging
list of participants and their statuses

Excited? Let’s dive into it

Initial setup

installation and configuration of TailwindCSS - Pull Request, documentation
installation and documentation of Auth0 - Pull Request, documentation
integration of SocketIO with NextJS - Pull Request

Before diving into explanation of features' implementation details, we would like to give you a visual representation of the folders structure. It would help to navigate while reading.

|- app
|----| index.tsx
|- components
|----| lobby.tsx
|----| control-panel.tsx
|----| chat.tsx
|----| status.tsx
|- contexts
|----| users-connection.tsx
|----| users-settings.tsx
|- hooks
|----| use-is-audio-active.ts
|----| use-media-stream.ts
|----| use-peer.ts
|----| use-screen.ts
|- pages
|----| index.tsx
|----| room
|--------| [roomId].tsx

Lobby Page

Once the user has landed on home page, they have two options: create a new room or join an existing one. No matter what is selected, they are going to go through the lobby page to setup their initial stream setting, such as muting their mic or turning their video off before entering.

// pages/[roomId].tsx

export default function Room(): NextPage {
  const [isLobby, setIsLobby] = useState(true);
  const { stream } = useMediaStream();

  return isLobby
    ? <Lobby stream={stream} onJoinRoom={() => setIsLobby(false)} />
    : <Room stream={stream} />;
}

As you noticed, now we have a stream. Stream can be audio and/or video, and it is a chain of data in time period. Protocol MediaCapture and Streams API allows us to create and manipulate the stream. The stream itself consist of multiple tracks, mainly audio and video.

Here is Lobby component:

// components/lobby.tsx
// pseudocode

const Lobby = ({
  stream,
  onJoinRoom,
}: {
  stream: MediaStream;
  onJoinRoom: () => void;
}) => {
  const { toggleAudio, toggleVideo } = useMediaStream(stream);

  return (
    <>
        <video srcObject={stream} />
        <button onClick={toggleVideo}>Toggle video</button>
        <button onClick={toggleAudio}>Toggle audio</button>
        <button onClick={onJoinRoom}>Join</button>
    </>
  );
}

Note (from MDN):
The enabled property on the [MediaStreamTrack](https://developer.mozilla.org/en-US/docs/Web/API/MediaStreamTrack) interface is a Boolean value which is true if the track is allowed to render the source stream or false if it is not. This can be used to intentionally mute a track.
When enabled, a track's data is output from the source to the destination; otherwise, empty frames are output.

With that said, toggling enabled property does not require any syncing process between peers, as it happens automatically.

Going inside the room

To make our life easier, let's imagine the user went with audio and video on. Right after entering the room, Peer and Socket entities are created. Peer to connect and share a stream with other users, and Socket to transport a state of stream.

To create a peer we are going to need roomId (from useRouter) and user (from useUser)

// hooks/use-peer.ts
// core part of the code

// open connection
peer.on('open', (id: PeerId) => {
  // tell others new user joined the room
  socket.emit('room:join', {
    roomId, // which room to connect to
    user: { id, name: user.name, muted, visible } // joining user's data
  });
});

Below you can find pseudo-code realisation of Room page/component:

// app/index.tsx
// pseudocode

// stream comes from Lobby page
export default App({ stream }: { stream: MediaStream }) => {
  const socket = useContext(SocketContext);
  const peer = usePeer(stream);

  return (
    <UsersStateContext.Provider>
      <UsersConnectionContext.Provider value={{ peer, stream }}>
        <MyStream />
        <OthersStream />
        <ControlPanel />
      </UsersConnectionContext.Provider>
    </UsersStateContext.Provider>
    )
};

UsersStateContext takes responsibility for changing and translating the state of user. UsersConnectionContext is all about communication: entering a room, setting up a connection between peers, leaving a room and demonstration of screen. Yeap, screen demonstration is part of communication because of newly created stream for sharing user's screen. We will talk about it in more detail bit later.

So, we are inside the room. Now all the other users that have been here already, have to greed themselves, give their stream and name to display.

// contexts/users-connection.tsx

// event is listened on users who are already in the room
socket.on('user:joined', ({ id, name }: UserConfig) => {
  // call to newly joined user's id with my stream and my name
  const call = peer.call(id, stream, {
    metadata: {
      username: user.name,
    },
  });
});

In here, our socket basically says: "Yoyo, here we have new guest in da house" via event name user:joined, and after it is triggered, every single user comes to new guest to welcome them with stream and name. And in response they take name, stream and id of the guest.

// contexts/users-connection.tsx

// action below happens on the newly joined user's device
peer.on('call', (call) => {
  const { peer, metadata } = call;
  const { user } = metadata;

  // answers incoming call with the stream
  call.answer(stream); 

  // stream, name and id of user who was already in the room
  call.on('stream', (stream) => appendVideoStream({ id: peer, name: user.name })(stream));
});

Success! We have established connection between peers, and now they can see and hear each other 🙂

Panel of control buttons

All good, but at this point no one can manipulate their stream. So it is a time to dig into what can be changed for a given stream:

toggle audio
toggle video
shut down
share display

// app/index.tsx

<ControlPanel
  visible={visible}
  muted={muted}
  onLeave={() => router.push('/')}
  onToggle={onToggle}
/>

There is nothing special about "leaving the room", where we just redirect to home page, and our return function inside useEffect takes care of cleaning up with destroying connection. However, interesting bit comes to onToggle method.

// app/index.tsx

// only related part of the code
const { toggleAudio, toggleVideo } = useMediaStream(stream);
const { myId } = usePeer(stream);
const { startShare, stopShare, screenTrack } = useScreen(stream);

async function onToggle(
  kind: Kind,
  users?: MediaConnection[]
) {
  switch (kind) {
    case 'audio': {
      toggleAudio();
      socket.emit('user:toggle-audio', myId);
      return;
    }
    case 'video': {
      toggleVideo(
        (newVideoTrack: MediaTrack) => {
          users.forEach((user) => replaceTrack(user)(newVideoTrack))
        }
      );
      socket.emit('user:toggle-video', myId);
      return;
    }
    case 'screen': {
      if (screenTrack) {
        stopShare(screenTrack);
        socket.emit('user:stop-share-screen');
      } else {
        await startShare(
          () => socket.emit('user:share-screen'),
          () => socket.emit('user:stop-share-screen')
        );
      }
      return;
    }
    default:
      break;
  }
}

toggleAudio and toggleVideo functions are acting in a similar way but with a tiny difference in toggleVideo, that will be described further below.

Screen sharing

// hooks/use-screen.ts

async function startShare(
  onstarted: () => void,
  onended: () => void
) {

  const screenStream = await navigator.mediaDevices.getDisplayMedia({
    video: true,
    audio: false,
  });
  const [screenTrack] = screenStream.getTracks();
  setScreenTrack(screenTrack);
  stream.addTrack(screenTrack);

  onstarted();

  // once screen is shared, tiny popup will appear with two buttons - Stop sharing, Hide
  // they are NOT custom, and come as they are
  // so .onended is triggered when user clicks "Stop sharing"
  screenTrack.onended = () => {
    stopShare(screenTrack);
    onended();
  };
}

To start sharing the screen, we would need to create new stream, take out its video track and extend our current stream with it. Eventually, we are going to have three tracks: audio, webcam track and screen track. Next we notify other users with event user:shared-screen so they can reset peer connection to receive additional video track.

// contexts/users-connection.tsx

socket.on('user:shared-screen', () => {
  // peer connection reset
  peer.disconnect();
  peer.reconnect();
});

To stop sharing, we would need to stop the video track and remove it.

// hooks/use-screen.ts

function stopShare(screenTrack: MediaStreamTrack) {
  screenTrack.stop();
  stream.removeTrack(screenTrack);
}

Control actions of host user

The host user has permission to mute and disconnect other users. Those actions are visible once the user hovers over the video stream.

// components/video-container/index.tsx
// pseudocode

// wrapper around the stream takes a responsibility to render
// corresponding component or icon depending on the state of stream
function VideoContainer({
  children,
  id,
  onMutePeer,
  onRemovePeer
}: {
  children: React.ReactNode,
  id: PeerId,
  onMutePeer: (id: PeerId) => void,
  onRemovePeer: (id: PeerId) => void
}) {
  return (
    <>
      <div>
        /* here goes video stream component */
        {children}
      </div>

      /* show host control panel if I created the room */
      {isHost && (myId !== id) && (
        <HostControlPanel
          onMutePeer={() => onMutePeer && onMutePeer(id)}
          onRemovePeer={() => onRemovePeer && onRemovePeer(id)}
          isMuted={muted}
        />
      )}
    </>
  )
}

To mute some other user is trivial, since MediaStreamTrack API handles that for us, but in order to visually represent that the host has muted someone, we are triggering socket event with payload of the muted user's id.

However, there are multiple actions that get called once onRemovePeer is executed:

send others removing peer’s id, so they can show respective icon or toaster
remove the user from “my” room and update the state of streams
close peer connection

// contexts/users-connection.tsx

function leaveRoom(id: PeerId) {
  // notify everyone
  socket.emit('user:leave', id);

  // closing a peer connection
  users[id].close();

  // remove the user in ui
  setStreams((streams) => {
    const copy = {...streams};
    delete copy[id];
    return copy;
  });
}

Turning off web-cam light indicator

Here comes the tiny difference between toggleAudio and toggleVideo.

While we are turning the video stream off, we have to make sure that the indicator light goes off. That guarantees that the web camera is currently switched off.

// hooks/use-media-stream.ts

// @param onTurnVideoOn - optional callback that takes newly created video track
async function toggleVideo(onTurnVideoOn?: (track: MediaTrack) => void) {
  const videoTrack = stream.getVideoTracks()[0];

  if (videoTrack.readyState === 'live') {
    videoTrack.enabled = false;
    videoTrack.stop(); // turns off web cam light indicator
  } else {
    const newStream = await navigator.mediaDevices.getUserMedia({
      video: true,
      audio: false,
    });
    const newVideoTrack = newStream.getVideoTracks()[0];

    if (typeof onTurnVideoOn === 'function') onTurnVideoOn(newVideoTrack);

    stream.removeTrack(videoTrack);
    stream.addTrack(newVideoTrack);

    setStream(stream);
  }
}

The false value of property enabled does not know anything about the indicator light, hence it does not turn it off. There is a method called stop on the interface of MediaStreamTrack, which tells your browser that the stopping track is not anymore needed and changes readyState to ended. Unfortunately, MediaTrack does not have a start or restart method as you may have thought of. Therefore to turn the camera on, we create a new stream, take the video track from it and insert it into the old stream.

Wait, we are changing video track back and forth here without notifying other users in the room. Relax, replaceTrack got your back.

// app/index.tsx

// @param track - new track to replace old track
function replaceTrack(track: MediaStreamTrack) {
  return (peer: MediaConnection) => {
    const sender = peer.peerConnection
      .getSenders()
      .find((s) => s.track.kind === track.kind);

    sender?.replaceTrack(track);
  }
}

Consider our web-cam is off. Now what happens when we turn it back on? An optional callback is passed inside toggleVideo, that takes a single parameter, new video track. In the body of the callback we can change our old track to the new one for each user in the room. In order to achieve this we use getSenders method of RTCPeerConnection interface that returns list of RTCRtpSender. RTCRtpSender - object that gives you the opportunity to manipulate mediatrack sending all other users.

Indicator of active speaker

Maybe you have noticed in google-meet when a user is speaking, there is a small icon in the corner of the video container, indicating that the person is currently speaking. All this logic is encapsulated inside the custom hook useIsAudioActive.

Since we are dealing here with stream of media data and it is hard to visualise it, we think of it as nodes via WebAudio API AudioContext.

// hooks/use-is-audio-active.ts

const audioContext = new AudioContext();
const analyser = new AnalyserNode(audioContext, { fftSize });

// source is a stream (MediaStream)
const audioSource = audioContext.createMediaStreamSource(source);

// connect your audio source to output (usually laptop's mic), here it is analyser in terms of time domain
audioSource.connect(analyser);

Depending on passed value of FFT (Fast Fourier Transform) and using requestAnimationFrame we know whether a person is speaking returning boolean value on each frame. More detailed explanation on FFT and AnalyzerNode.

// hooks/use-is-audio-active.ts

// buffer length gives us how many different frequencies we are going to be measuring
const bufferLength = analyser.frequencyBinCount;

// array with 512 length (half of FFT) and filled with 0-s
const dataArray = new Uint8Array(bufferLength);
update();

function update() {
  // fills up dataArray with ~128 samples for each index
  analyser.getByteTimeDomainData(dataArray);

  const sum = dataArray.reduce((a, b) => a + b, 0);

  if (sum / dataArray.length / 128.0 >= 1) {
    setIsSpeaking(true);
    setTimeout(() => setIsSpeaking(false), 1000);
  }

  requestAnimationFrame(update);
}

Chat

There is no magic behind the chat feature, we just leverage the usage of socket events such as chat:post with the payload of message and chat:get to receive a new message and append it to the list.

// components/chat/index.tsx

function Chat() {
  const [text, setText] = useState('');
  const [messages, setMessages] = useState<UserMessage[]>([]);

  useEffect(() => {
    socket.on('chat:get', (message: UserMessage) =>
      setMessages(append(message))
    );
  }, []);

  return (
    <>
      <MessagesContainer messages={messages}/>
      <Input
        value={text}
        onChange={(e) => setText(e.target.value)}
        onKeyDown={sendMessage}
      />
    </>
  );
}

// components/chat/index.tsx

function sendMessage(e: React.KeyboardEvent<HTMLInputElement>) {
  if (e.key === 'Enter' && text) {
    const message = {
      user: username,
      text,
      time: formatTimeHHMM(Date.now()),
    };

    socket.emit('chat:post', message);
    setMessages(append(message));
    setText('');
  }
}

List of users with their statuses

As a bonus feature, we implemented sidebar component to show each users' status in real-time.

// components/status/index.tsx

const Status = ({ muted, visible }: { muted: boolean; visible: boolean }) => {
  const { avatars, muted, visible, names } = useContext(UsersStateContext);
  const usersIds = Object.keys(names);

  return (
    <>
      {usersIds.map((id) => (
        <div>
          <img src={avatars[id]} alt="User image" />
          <span>{names[id]}</span>
          <Icon variant={muted[id] ? 'muted' : 'not-muted'} />
          <Icon variant={visible[id] ? 'visible' : 'not-visible'} />
        </div>
      ))}
    </>
  );
};

That is it. We covered core features of standard video chat application. We hope you enjoyed and got some knowledge out of it.

Conclusion

After finishing the app, we came to know that we have just scratched the surface of WebRTC and WebSockets. Nevertheless, the core features are done, and now we have our own playground to experiment further. The source code is here

Thank you

P.S. The app is little laggish and it has some bugs that we are aware of. We are going to fix them :)