Jatin Gupta

Posted on Jun 12

Have you ever wondered how Zoom works?

#systemdesign #techtalks #coding #interview

If only two people are talking, does Zoom send the video?
And if 100 people join a meeting, how can everyone see each other without your laptop sending 99 separate videos?
Let's understand the complete architecture (Peer-to-Peer (P2P), WebRTC, and SFU) in a simple way.

The Basic Idea

Whenever you join a Zoom meeting, there are actually multiple types of data traveling over the internet:

🎤 Audio
🎥 Video
💬 Chat messages
📄 Live transcript
🖥️ Screen sharing

These are called Real-Time Media Streams because they need to reach everyone almost instantly.

You
   │
   ├── Audio
   ├── Video
   ├── Chat
   ├── Screen Share
   └── Transcript
        │
        ▼
      Zoom
        │
        ▼
Other Participants

What happens in a 1-on-1 call?

Suppose:

You
Your friend

Only two participants exist.
In this case, Zoom can establish a Peer-to-Peer (P2P) connection.

You  ------------------> Friend
        Direct Connection

Your laptop sends one video stream directly to your friend's laptop.
Similarly,

Friend -------------> You

One stream each.
No complicated routing is required.

How do two devices connect?
They usually use WebRTC (Web Real-Time Communication).

WebRTC allows browsers and applications to exchange:

Video
Audio
Data in real time. However, before sending media, both devices need to discover each other.

This process is called Signaling.

Example:

You
   │
Signaling
   │
STUN Server
   │
Signaling
   │
Friend

After exchanging connection information, media can flow directly.

You  ==========> Friend
        Video

Friend ========> You
        Video

What is a STUN Server?
A STUN Server helps devices discover their public IP address and determine how they can communicate through NAT (Network Address Translation).
Think of it like this:

You ask,

"What address does the outside world see for me?"

The STUN server replies with that information so the two peers can attempt a direct connection.

Why P2P is great.

Advantages:

Very low latency
Less server cost
Fast communication
Direct transmission

You ---> Friend

Simple and efficient.

But what happens when more people join?
Imagine:

You
Person B
Person C

Now you must send your video to both people.

You now upload 2 streams.

Suppose there are 10 participants.

        You
      / / | \ \
     A B C D E F G H I

Now your laptop must upload 9 different video streams.
That is a huge problem.

Why is this impossible?

Because your:

Internet upload bandwidth is limited
CPU is limited
Memory is limited
Battery is limited If each stream is 2 Mbps,

For 10 people:

2 Mbps × 9

= 18 Mbps upload

Most users cannot continuously upload this much.
If 100 participants exist:

2 Mbps × 99

≈198 Mbps upload

Practically impossible.

So how does Zoom solve this?
Instead of sending separate streams to everyone,
You upload only one stream.

You
   │
   │
   ▼
 Zoom Server

The Zoom server then distributes that stream to everyone else.

             You
              │
              ▼
        Zoom Server
        /    |     \
       /     |      \
      B      C       D

Your laptop uploads only once.
The server does the rest.

This server is called SFU
SFU stands for
Selective Forwarding Unit

It is a media server.
Its job is simple:

Receive your stream
Forward it to other participants It does not create new videos. It simply forwards them intelligently.

How SFU works
Step 1

You upload one stream.

You
   │
   ▼
 SFU

Step 2

SFU receives it.

You
   │
   ▼
 SFU

Step 3

SFU forwards it.

        SFU
      /  |  \
     /   |   \
    B    C    D

Now each participant does the same.

Everyone uploads only one stream.
The SFU distributes everything.

Why is it called "Selective" Forwarding?
Because it forwards only what is needed.
Example:

If your screen shows only 9 participants,
The SFU may send only those visible videos in high quality.
If someone is hidden or inactive,
The server can reduce quality or stop forwarding temporarily.

This saves:

Bandwidth
CPU
Battery

Does SFU mix videos together?
No.
It simply forwards.

You ---> SFU ---> B

You ---> SFU ---> C

You ---> SFU ---> D

No video mixing happens.

Then what is MCU?
Another architecture is MCU (Multipoint Control Unit).
MCU receives all streams,

A
 \
  \
B ---> MCU
  /
 /
C

It decodes everything,
mixes videos,
creates one combined video,
then sends that back.

        MCU
          │
          ▼
     Combined Video
          │
          ▼
     All Participants

SFU vs MCU

| SFU                              | MCU                              |
| -------------------------------- | -------------------------------- |
| Simply forwards streams          | Mixes streams                    |
| Faster                           | More processing                  |
| Lower latency                    | Higher latency                   |
| Less CPU on server               | Heavy server computation         |
| Scales well                      | More expensive                   |
| Used by many modern meeting apps | Used in some specialized systems |

Why Zoom uses SFU
Because it allows:

Millions of users
Lower latency
Better scalability
Lower upload requirements
Efficient bandwidth usage Instead of every laptop sending many streams,

❌ You → B
❌ You → C
❌ You → D
❌ You → E

it becomes

✅ You
      │
      ▼
     SFU
   / / | \ \
  B C D E F

Only one upload from your device.

Complete Flow

User

   │

Capture Camera

   │

Encode Video

   │

Upload One Stream

   │

──────────────

       SFU

──────────────

Receives Stream

Forwards Stream

──────────────

Participants

B

C

D

E

F

Final Takeaway

1-on-1 calls can often use Peer-to-Peer (P2P) with WebRTC, where devices communicate directly after signaling and STUN-based connection setup.
Group calls cannot rely on P2P because each participant would need to upload many separate streams, quickly exhausting bandwidth and device resources.
To solve this, Zoom uses an SFU (Selective Forwarding Unit). Each participant uploads one media stream to the SFU, and the SFU intelligently forwards that stream to the other participants.
This architecture keeps latency low, reduces upload bandwidth requirements, and allows Zoom to scale to meetings with many participants efficiently.

In one line:
"P2P works well for 1-to-1 calls, but for group meetings Zoom scales by using an SFU, where you upload one stream and the server forwards it to everyone else."

DEV Community

Have you ever wondered how Zoom works?

Top comments (0)