DEV Community

Cover image for Video Conferencing Systems Architecture: P2P vs MCU vs SFU 🥊
Fora Soft
Fora Soft

Posted on • Originally published at forasoft.com

Video Conferencing Systems Architecture: P2P vs MCU vs SFU 🥊

Even though WebRTC is a protocol developed to do one job, to establish low-ping high-security multimedia connections, one of its best features is flexibility. Even if you are up for a pretty articulated task of creating a video conferencing app, the options are open.

You can go full p2p, deploy a media server backend (there is quite a variety of those), or combine these approaches to your liking. You can pick desired features and find a handful of ways to implement them. Finally, you can freely choose between a solid backend or a scalable media server grid created by one of the many patterns. 

With all of these freedoms at your disposal, picking the best option might - and will - be tricky. Let us clear the fight P2P vs MCU vs SFU a bit for you.

Video Conferencing Systems types

What is P2P?

Let’s imagine it’s Christmas. Or any other mass gift giving opportunity of your choice. You’ve got quite a bunch of friends scattered across town, but everyone is too busy to throw a gift exchange party. So, you and your besties agree everyone will get their presents once they drop by each other.

When you want to exchange something with your peers, no matter whether it’s a Christmas gift or a live video feed, the obvious way is definitely peer to peer. Each of your friends comes to knock on your door and get their box, and then you visit everyone in return. Plain and simple.

For a WebRTC chat, that means all call parties are directly connected to each other, with the host only serving as a meeting point (or an address book, in our Christmas example).

This pattern works great, while:

  • your group is rather small
  • everyone is physically able to reach each other

Every gift from our example requires some time and effort to be dealt with: at least, you have to drive to a location (or to wait for someone to come to you), open the door, give the box and say merry Christmas.

  • If there are 4 members in a group, each of you needs time to handle 6 gifts - 3 to give, 3 to take.
  • When there are 5 of you, 8 gifts per person are to be taken care of.
  • Once your group increases to 6 members, your  Christmas to-do list now features 10 gifts.

At one point, there will be too many balls in the air: the amount of incoming and outgoing gifts will be too massive to handle comfortably.

Same for video calls: every single P2P stream should be encoded and sent or decoded and displayed in real time  - each operation requiring a fraction of your system’s performance, network bandwidth, and battery capacity. This fraction might be quite sensible for higher-quality video: if a 2-on-2 or even a 5-on-5 conference will work decently on any relatively up-to-date device, 10-on-10 peer to peer FullHD  call would eat up give or take 50 Mbps in bandwidth and put quite of a load even on a mid-to-high tier CPU.

peer-to-peer architecture

Now regarding the physical ability to reach. Imagine one of your friends has recently moved to an upscale gated community. They are free to drive in and out – so, they’ll get to your front door for their presents, but your chances to reach their home for your gift are scarce.

WebRTC-wise, we are talking about corporate networks with NATs and\or VPNs. You can reach most hosts from inside the network, while being as good as unreachable from outside. In any case, your peers might be unable to see you, or vice versa, or both.

And finally – if all of you decide to pile up the gifts for a fancy Instagram photo, everyone will have to create both the box heap and the picture themselves: the presents are at the recipients’ homes.

WebRTC: peer-to-peer means no server side recording (or any other once-per-call features). At all.

Peer-to-Peer applications examples

Google Meet and secure mobile video calling apps without any server-side features like video recording.

That’s where media servers come to save the day.

WebRTC media servers: MCU and SFU

Back to the imaginary Christmas. Your bunch of friends is huge, so you figure out you’ll spend the whole holiday season waiting for someone or driving somewhere. To save your time, you pay your local coffee shop to serve as a gift distribution node. 

From now on, everyone in your group needs to reach a single location to leave or get gifts – the coffee shop.

That’s how the WebRTC media servers work. They accept calling parties’ multimedia streams and deliver them to everyone in a conference room.

A while ago, WebRTC media servers used to come in two flavors: SFU (selective forwarding unit) and MCU (Multipoint conferencing \ Multipoint control unit). As of today, most commercial and open-source solutions offer both SFU and MCU features, so both terms now describe features and usage patterns instead of product types.

What are those?

SFU / Selective Forwarding Unit

What SFU is?

SFU sends separate video streams of everyone to everyone.

The bartender at the coffee shop keeps track of all the gifts arriving at the place, and calls their recipients if there’s something new waiting for them. Once you receive such a call, you drop by the shop, have a ‘chino, get your box and head back home.

The bad news is: the bartender gives you calls about one gift at a time. So, if there are three new presents, you’ll have to hit the road three times in a row. If there are twenty… you probably got the point. Alternatively, you can visit the place at a certain periodicity, checking for new arrivals yourself.

Also, as your gift marathon flourishes, coffee quality degrades: the more people join in, the more time and effort the bartender dedicates to distributing gifts instead of caffeine. Remember: one gift – one call from the shop.

Media Server working as a Selective Forwarding Unit allows call participants to send their video streams once only – to the server itself. The backend will clone this stream and deliver it to every party involved in a call.

With SFU, every client consumes almost two times less bandwidth, CPU capacity, and battery power than it would in a peer-to-peer call:

  • for a 4-user call: 1 outgoing stream, 3 incoming (instead of 3 in, 3 out for p2p)
  • for a 5-user call: 1 outgoing stream, 4 incoming (would be 4 and 4 in p2p)
  • for a 10-user call: 1 out, 9 in (9 in, 9 out – p2p)

SFU architecture

The drawback kicks in with users per call ratio approaching 20. “Selective” in SFU stands for the fact that this unit doesn’t forward media in bulk – it delivers media on a per-request basis. And, since WebRTC is always a p2p protocol, even if there is a server involved, every concurrent stream is a separate connection. So, for a 10-user video meetup a server has to maintain 10 ingest (“video receiving”) and 90 outgoing connections, each requiring computing power, bandwidth and, ultimately, money. But…

SFU Scalability

Once the coffee shop owner grows angry with the gift exchange intensity, you can take the next step, and pay some more shops in the neighborhood to join in.

Depending on a particular shop’s load, some of the gift givers or receivers can be routed to another, less crowded one. The grid might grow almost infinitely, since every shop can either forward a package to their addressee, or to an alternative pick up location.

Forwarding rules are perfectly flexible. Like, Johnson’s coffee keeps gifts for your friends with first names starting A – F, and Smartducks is dedicated to parcels for downtown residents, while Randy’s Cappuccino will forward your merry Christmas to anyone who sent their own first gift since last Thursday.

One stream – one connection approach provided by SFU pattern has a feature to beat almost all of its cons. The feature is scalability software development.

Just like you forward a user’s stream to another participant, you can forward it to another server. With this in mind, the back end WebRTC architecture can be adjusted to grow and shrink depending on the number of users, conferences and traffic intensity.

E.g., if too many clients request a particular stream from one host, you can spawn a new one, clone the stream there and distribute it from a new unoccupied location.

Or, in case if you expect rush entrance on a massive conference (e.g., some 20-30 streaming users, and hundreds of view-only subscribers), you can assign two separate media server groups, one to handle incoming streams and the other to deliver it to subscribers. In this case, any load spikes on the viewing side will have zero effect on video ingest, and vice versa.

SFU applications examples

Skype and almost every other mobile messenger with a video conference and video call recording capabilities employs SFU pattern on the backend.

Receiving other users’ video as separate streams provides capabilities for adaptive UX, allows per-stream quality adjustment and improves overall call stability in a volatile environment of a cellular network.

MCU / Multipoint Conferencing Unit

What MCU is?

MCU unites all streams into 1 and sends just 1 stream to each participant.

Giving gifts is making friends, right? Now almost everyone in town is your buddy and participates in this gift exchange. The coffee shop hosting the exchange comes up with a great idea: why don’t we put all the presents for a particular person in a huge crate with their name on it. Moreover, some holiday magic is now involved: once there are new gifts for anyone, they appear in their respective boxes on their own.

Still, making Christmas magic seems to be harder work than making coffee: they might even need to hire more people to cast spells on the gift crates. And even with extra wizards on duty, there is zero chance you can rearrange the crates’ content order for your significant other to see your gift first – everyone gets the same pattern.

mcu architecture

Well, some of the MCU-related features do really ask for puns over an acronym shared with Marvel Cinematic Universe. Something marvelous is definitely involved. Media server in an MCU role has to keep only 20 connections for a 10 user conference, instead of 100 links of an SFU – one ingest, one output per user. How come? It merges all the videos and audios a user needs to receive into a single stream, and delivers it to a particular client. That’s how Zoom’s conferences are made: with MCU, even a lower-tier computer is capable of handling a 100-user live call.

Magic obviously comes at a price, though. Compositing multiple video and audio streams in real time is much more of a performance guzzler than any forwarding pattern. Even more, if you have to somehow exclude one’s own voice and picture from the merged grid they receive – for each of the users.

Another drawback, though mitigable, is that a composited grid is the same for everyone who receives the video – no matter what is their screen resolution, or aspect ratio, or whatever. If there is a need to have different layouts for mobile and desktop devices – you’ll have to composite the video twice.

MCU scalability

In WebRTC video call, compared to SFU, MCU pattern has considerably less scaling potential: video compositing with sub-second delays does not allow on-the-fly load redistribution for a particular conference. Still, one can autospawn additional server instances for new calls in a virtualized environment, or, for even better efficiency, assign an additional SFU unit to redistribute composited video.

MCU applications examples

Zoom and a majority of its alternatives for massive video conferencing run off MCU-like backends. Otherwise, WebRTC video calls for 25+ participants would only be available for high-end devices.

TL;DR: what and when do I use?

video-conferencing-system-architecture

~1-4 users per call – P2P

Pros:

  • lowest idling costs
  • easiest scaling
  • shortest TTM (time to market)
  • potentially the most secure

Cons:

  • for 5+ user calls – quality might deteriorate on weaker devices
  • highest bandwidth usage (may be critical for mobile users)
  • no server side recording, video analytics or other advanced features

Applications:

  • private \ group calls
  • video assistance and sales

5-20 users per call – SFU

Pros:

  • easily scalable with simultaneous call number growth
  • retains UX flexibility while providing server side features
  • can have node redundancy by design: thus, most rush-proof

Cons:

  • pretty traffic- and performance intensive on the client side
  • might still require a compositing MCU-like service to record calls
    Use cases:

  • E-learning: workshops and virtual classrooms

  • Corporate communications: meeting and pressrooms

20+ users per call – MCU \ MCU + SFU

Pros:

  • least load on client side devices
  • capable of serving the biggest audiences
  • easily recordable (server side / client side)

Cons:

  • biggest idling and running costs
  • one call capacity is limited to performance of a particular server
  • least customizable layout

Use cases:

  • Large event streaming
  • Social networking
  • Online media

Conclusion

P2P, MCU, and SFU are parts of WebRTC. We have more articles about this technology on our blog.
Got another question not covered here? Drop a comment here ⬇️

Top comments (0)