If only two people are talking, does Zoom send the video?
And if 100 people join a meeting, how can everyone see each other without your laptop sending 99 separate videos?
Let's understand the complete architecture (Peer-to-Peer (P2P), WebRTC, and SFU) in a simple way.
The Basic Idea
Whenever you join a Zoom meeting, there are actually multiple types of data traveling over the internet:
🎤 Audio
🎥 Video
💬 Chat messages
📄 Live transcript
🖥️ Screen sharing
These are called Real-Time Media Streams because they need to reach everyone almost instantly.
You
│
├── Audio
├── Video
├── Chat
├── Screen Share
└── Transcript
│
▼
Zoom
│
▼
Other Participants
What happens in a 1-on-1 call?
Suppose:
You
Your friend
Only two participants exist.
In this case, Zoom can establish a Peer-to-Peer (P2P) connection.
You ------------------> Friend
Direct Connection
Your laptop sends one video stream directly to your friend's laptop.
Similarly,
Friend -------------> You
One stream each.
No complicated routing is required.
How do two devices connect?
They usually use WebRTC (Web Real-Time Communication).
WebRTC allows browsers and applications to exchange:
- Video
- Audio
- Data in real time. However, before sending media, both devices need to discover each other.
This process is called Signaling.
Example:
You
│
Signaling
│
STUN Server
│
Signaling
│
Friend
After exchanging connection information, media can flow directly.
You ==========> Friend
Video
Friend ========> You
Video
What is a STUN Server?
A STUN Server helps devices discover their public IP address and determine how they can communicate through NAT (Network Address Translation).
Think of it like this:
You ask,
"What address does the outside world see for me?"
The STUN server replies with that information so the two peers can attempt a direct connection.
Why P2P is great.
Advantages:
- Very low latency
- Less server cost
- Fast communication
- Direct transmission
You ---> Friend
Simple and efficient.
But what happens when more people join?
Imagine:
You
Person B
Person C
Now you must send your video to both people.
You
/ \
/ \
B C
You now upload 2 streams.
Suppose there are 10 participants.
You
/ / | \ \
A B C D E F G H I
Now your laptop must upload 9 different video streams.
That is a huge problem.
Why is this impossible?
Because your:
- Internet upload bandwidth is limited
- CPU is limited
- Memory is limited
- Battery is limited If each stream is 2 Mbps,
For 10 people:
2 Mbps × 9
= 18 Mbps upload
Most users cannot continuously upload this much.
If 100 participants exist:
2 Mbps × 99
≈198 Mbps upload
Practically impossible.
So how does Zoom solve this?
Instead of sending separate streams to everyone,
You upload only one stream.
You
│
│
▼
Zoom Server
The Zoom server then distributes that stream to everyone else.
You
│
▼
Zoom Server
/ | \
/ | \
B C D
Your laptop uploads only once.
The server does the rest.
This server is called SFU
SFU stands for
Selective Forwarding Unit
It is a media server.
Its job is simple:
- Receive your stream
- Forward it to other participants It does not create new videos. It simply forwards them intelligently.
How SFU works
Step 1
You upload one stream.
You
│
▼
SFU
Step 2
SFU receives it.
You
│
▼
SFU
Step 3
SFU forwards it.
SFU
/ | \
/ | \
B C D
Now each participant does the same.
A
\
\
SFU
/ | \
B C D
Everyone uploads only one stream.
The SFU distributes everything.
Why is it called "Selective" Forwarding?
Because it forwards only what is needed.
Example:
If your screen shows only 9 participants,
The SFU may send only those visible videos in high quality.
If someone is hidden or inactive,
The server can reduce quality or stop forwarding temporarily.
This saves:
- Bandwidth
- CPU
- Battery
Does SFU mix videos together?
No.
It simply forwards.
You ---> SFU ---> B
You ---> SFU ---> C
You ---> SFU ---> D
No video mixing happens.
Then what is MCU?
Another architecture is MCU (Multipoint Control Unit).
MCU receives all streams,
A
\
\
B ---> MCU
/
/
C
It decodes everything,
mixes videos,
creates one combined video,
then sends that back.
MCU
│
▼
Combined Video
│
▼
All Participants
SFU vs MCU
| SFU | MCU |
| -------------------------------- | -------------------------------- |
| Simply forwards streams | Mixes streams |
| Faster | More processing |
| Lower latency | Higher latency |
| Less CPU on server | Heavy server computation |
| Scales well | More expensive |
| Used by many modern meeting apps | Used in some specialized systems |
Why Zoom uses SFU
Because it allows:
- Millions of users
- Lower latency
- Better scalability
- Lower upload requirements
- Efficient bandwidth usage Instead of every laptop sending many streams,
❌ You → B
❌ You → C
❌ You → D
❌ You → E
it becomes
✅ You
│
▼
SFU
/ / | \ \
B C D E F
Only one upload from your device.
Complete Flow
User
│
Capture Camera
│
Encode Video
│
Upload One Stream
│
──────────────
SFU
──────────────
Receives Stream
Forwards Stream
──────────────
Participants
B
C
D
E
F
Final Takeaway
- 1-on-1 calls can often use Peer-to-Peer (P2P) with WebRTC, where devices communicate directly after signaling and STUN-based connection setup.
- Group calls cannot rely on P2P because each participant would need to upload many separate streams, quickly exhausting bandwidth and device resources.
- To solve this, Zoom uses an SFU (Selective Forwarding Unit). Each participant uploads one media stream to the SFU, and the SFU intelligently forwards that stream to the other participants.
- This architecture keeps latency low, reduces upload bandwidth requirements, and allows Zoom to scale to meetings with many participants efficiently.
In one line:
"P2P works well for 1-to-1 calls, but for group meetings Zoom scales by using an SFU, where you upload one stream and the server forwards it to everyone else."

Top comments (0)