DEV Community

Jatin Gupta
Jatin Gupta

Posted on

Have you ever wondered how Zoom works?

If only two people are talking, does Zoom send the video?
And if 100 people join a meeting, how can everyone see each other without your laptop sending 99 separate videos?

Let's understand the complete architecture (Peer-to-Peer (P2P), WebRTC, and SFU) in a simple way.

Zoom architecture explained in detail

The Basic Idea

Whenever you join a Zoom meeting, there are actually multiple types of data traveling over the internet:

  • 🎤 Audio

  • 🎥 Video

  • 💬 Chat messages

  • 📄 Live transcript

  • 🖥️ Screen sharing

These are called Real-Time Media Streams because they need to reach everyone almost instantly.

You
   │
   ├── Audio
   ├── Video
   ├── Chat
   ├── Screen Share
   └── Transcript
        │
        ▼
      Zoom
        │
        ▼
Other Participants
Enter fullscreen mode Exit fullscreen mode

What happens in a 1-on-1 call?

Suppose:

You
Your friend
Enter fullscreen mode Exit fullscreen mode

Only two participants exist.
In this case, Zoom can establish a Peer-to-Peer (P2P) connection.

You  ------------------> Friend
        Direct Connection
Enter fullscreen mode Exit fullscreen mode

Your laptop sends one video stream directly to your friend's laptop.
Similarly,

Friend -------------> You
Enter fullscreen mode Exit fullscreen mode

One stream each.
No complicated routing is required.

How do two devices connect?
They usually use WebRTC (Web Real-Time Communication).

WebRTC allows browsers and applications to exchange:

  • Video
  • Audio
  • Data in real time. However, before sending media, both devices need to discover each other.

This process is called Signaling.

Example:

You
   │
Signaling
   │
STUN Server
   │
Signaling
   │
Friend
Enter fullscreen mode Exit fullscreen mode

After exchanging connection information, media can flow directly.

You  ==========> Friend
        Video

Friend ========> You
        Video
Enter fullscreen mode Exit fullscreen mode

What is a STUN Server?
A STUN Server helps devices discover their public IP address and determine how they can communicate through NAT (Network Address Translation).
Think of it like this:

You ask,

"What address does the outside world see for me?"
Enter fullscreen mode Exit fullscreen mode

The STUN server replies with that information so the two peers can attempt a direct connection.

Why P2P is great.

Advantages:

  • Very low latency
  • Less server cost
  • Fast communication
  • Direct transmission
You ---> Friend
Enter fullscreen mode Exit fullscreen mode

Simple and efficient.

But what happens when more people join?
Imagine:

You
Person B
Person C
Enter fullscreen mode Exit fullscreen mode

Now you must send your video to both people.

      You
     /   \
    /     \
   B       C
Enter fullscreen mode Exit fullscreen mode

You now upload 2 streams.

Suppose there are 10 participants.

        You
      / / | \ \
     A B C D E F G H I
Enter fullscreen mode Exit fullscreen mode

Now your laptop must upload 9 different video streams.
That is a huge problem.

Why is this impossible?

Because your:

  • Internet upload bandwidth is limited
  • CPU is limited
  • Memory is limited
  • Battery is limited If each stream is 2 Mbps,

For 10 people:

2 Mbps × 9

= 18 Mbps upload
Enter fullscreen mode Exit fullscreen mode

Most users cannot continuously upload this much.
If 100 participants exist:

2 Mbps × 99

≈198 Mbps upload
Enter fullscreen mode Exit fullscreen mode

Practically impossible.

So how does Zoom solve this?
Instead of sending separate streams to everyone,
You upload only one stream.

You
   │
   │
   ▼
 Zoom Server
Enter fullscreen mode Exit fullscreen mode

The Zoom server then distributes that stream to everyone else.

             You
              │
              ▼
        Zoom Server
        /    |     \
       /     |      \
      B      C       D
Enter fullscreen mode Exit fullscreen mode

Your laptop uploads only once.
The server does the rest.

This server is called SFU
SFU stands for
Selective Forwarding Unit

It is a media server.
Its job is simple:

  • Receive your stream
  • Forward it to other participants It does not create new videos. It simply forwards them intelligently.

How SFU works
Step 1

You upload one stream.

You
   │
   ▼
 SFU
Enter fullscreen mode Exit fullscreen mode

Step 2

SFU receives it.

You
   │
   ▼
 SFU
Enter fullscreen mode Exit fullscreen mode

Step 3

SFU forwards it.

        SFU
      /  |  \
     /   |   \
    B    C    D
Enter fullscreen mode Exit fullscreen mode

Now each participant does the same.

        A
         \
          \
           SFU
         / |  \
        B  C   D
Enter fullscreen mode Exit fullscreen mode

Everyone uploads only one stream.
The SFU distributes everything.

Why is it called "Selective" Forwarding?
Because it forwards only what is needed.
Example:

If your screen shows only 9 participants,
The SFU may send only those visible videos in high quality.
If someone is hidden or inactive,
The server can reduce quality or stop forwarding temporarily.

This saves:

  • Bandwidth
  • CPU
  • Battery

Does SFU mix videos together?
No.
It simply forwards.

You ---> SFU ---> B

You ---> SFU ---> C

You ---> SFU ---> D
Enter fullscreen mode Exit fullscreen mode

No video mixing happens.

Then what is MCU?
Another architecture is MCU (Multipoint Control Unit).
MCU receives all streams,

A
 \
  \
B ---> MCU
  /
 /
C
Enter fullscreen mode Exit fullscreen mode

It decodes everything,
mixes videos,
creates one combined video,
then sends that back.

        MCU
          │
          ▼
     Combined Video
          │
          ▼
     All Participants
Enter fullscreen mode Exit fullscreen mode

SFU vs MCU

| SFU                              | MCU                              |
| -------------------------------- | -------------------------------- |
| Simply forwards streams          | Mixes streams                    |
| Faster                           | More processing                  |
| Lower latency                    | Higher latency                   |
| Less CPU on server               | Heavy server computation         |
| Scales well                      | More expensive                   |
| Used by many modern meeting apps | Used in some specialized systems |
Enter fullscreen mode Exit fullscreen mode

Why Zoom uses SFU
Because it allows:

  • Millions of users
  • Lower latency
  • Better scalability
  • Lower upload requirements
  • Efficient bandwidth usage Instead of every laptop sending many streams,
❌ You → B
❌ You → C
❌ You → D
❌ You → E
Enter fullscreen mode Exit fullscreen mode

it becomes

✅ You
      │
      ▼
     SFU
   / / | \ \
  B C D E F
Enter fullscreen mode Exit fullscreen mode

Only one upload from your device.

Complete Flow

User

   │

Capture Camera

   │

Encode Video

   │

Upload One Stream

   │

──────────────

       SFU

──────────────

Receives Stream

Forwards Stream

──────────────

Participants

B

C

D

E

F
Enter fullscreen mode Exit fullscreen mode

Final Takeaway

  • 1-on-1 calls can often use Peer-to-Peer (P2P) with WebRTC, where devices communicate directly after signaling and STUN-based connection setup.
  • Group calls cannot rely on P2P because each participant would need to upload many separate streams, quickly exhausting bandwidth and device resources.
  • To solve this, Zoom uses an SFU (Selective Forwarding Unit). Each participant uploads one media stream to the SFU, and the SFU intelligently forwards that stream to the other participants.
  • This architecture keeps latency low, reduces upload bandwidth requirements, and allows Zoom to scale to meetings with many participants efficiently.

In one line:
"P2P works well for 1-to-1 calls, but for group meetings Zoom scales by using an SFU, where you upload one stream and the server forwards it to everyone else."

Top comments (0)