Video calling has become an essential part of modern applications, from casual social apps to enterprise communication platforms. While creating a basic video calling app locally might look simple (thanks to WebRTC and libraries like LiveKit, Agora, or Twilio), deploying it in production introduces a whole new set of challenges, which I also faced while creating a production-grade video calling application. So sometimes just deploying an app teaches you a lot of things.
In this blog, we’ll explore:
- How video streaming works at its core
- Deployment challenges in production
- STUN servers
- TURN servers
- STUN vs TURN servers
- Bandwidth and bitrate considerations
- Local vs. Production video calling app comparison
But before starting, let's read some of the definitions that can be useful in this blog.
Bandwidth:
Alright, let’s keep it very simple
Bandwidth is like the width of a water pipe.
- A thicker pipe = more water can flow at once.
- A thinner pipe = less water can flow, so it takes longer to fill a bucket.
In the internet world:
- Bandwidth = how much data (audio, video, files) your internet can send or receive per second.
- It’s usually measured in Mbps (megabits per second).
Example:
- If your video call needs 2 Mbps and your internet bandwidth is only 1 Mbps, the video will lag or look blurry.
- If you have 50 Mbps bandwidth, you can easily run calls, stream movies, and browse at the same time.
In short: Bandwidth = how much data your internet pipe can carry at one time.
Now you may be thinking, "OHH MAANNN, So my internet speed is my bandwidth!!" and the answer is "NO".
Let me clear these two analogies for you:
Bandwidth vs Internet Speed
-
Bandwidth → The maximum capacity of your internet connection.
- Like the width of the road (how many cars can go at once).
- Measured in Mbps (Megabits per second).
-
Internet Speed → How fast data actually moves at a given moment.
- Like the speed of cars on that road.
- Affected by bandwidth, network congestion, server distance, etc.
Example:
- Your ISP plan = 100 Mbps bandwidth (the road can carry 100 cars).
- But if the website’s server is slow or many people in your house are streaming, you might only get 40 Mbps internet speed at that time (cars are moving slower).
In short:
- Bandwidth = maximum lane capacity.
- Internet speed = how fast traffic is actually moving now.
High bandwidth allows many devices to use the internet simultaneously without slowing down, and high speed makes individual tasks, such as downloading a file, complete faster.
Bitrate:
Bitrate means how much data is being sent or received every second in a video/audio stream.
- It’s measured in kbps (kilobits per second) or Mbps (megabits per second).
- Higher bitrate = better quality (clearer video/audio), but it needs more bandwidth.
- Lower bitrate = smaller data, works on slow internet, but quality looks worse.
Example with Video Call
- 360p video → ~300–600 kbps
- 720p video (HD) → ~1.2–1.5 Mbps
- 1080p video (Full HD) → ~2–3 Mbps
- 4K video → ~10–15 Mbps
If your internet bandwidth is less than the required bitrate, your call will lag, blur, or drop.
Simple Analogy
Bitrate is like the speed of water flowing through the pipe.
- Bandwidth = pipe size (maximum it can carry).
- Bitrate = how much water you’re actually pouring in.
If your bitrate demand is more than your bandwidth supply → overflow (lag).
Bandwidth vs Bitrate
Term | What it Means | Measured In | Analogy | Example in Video Calls |
---|---|---|---|---|
Bandwidth | Maximum capacity of your internet connection (how much data can flow at once). | Mbps (Megabits per second) | Pipe size (wide pipe = more flow) | Your Wi-Fi plan gives you 50 Mbps bandwidth. |
Bitrate | Actual data used by video/audio per second. | Kbps / Mbps | Water flowing inside the pipe | A 720p video call needs ~1.5 Mbps bitrate. |
In short:
- Bandwidth = supply (how much your internet can handle).
- Bitrate = demand (how much your video/audio needs).
If bitrate > bandwidth, you’ll see lag, buffering, or blurry video.
NAT:
NAT (Network Address Translation) is a method used by routers to allow multiple devices in a private network (like your home Wi-Fi) to share a single public IP address on the internet.
It's is a process done by your router that:
Takes the private IP of your device (like your phone/laptop inside home Wi-Fi).
Converts it into the public IP of your internet connection when you go online.
When a reply comes back, it again converts that public IP to your device’s private IP.
This way, many devices in your home can share one internet connection and one public IP.
It hides private IPs (if you are thinking that your private IP is not exposed through NAT like wifi router, then yes, you are correct at some points) and translates them into the public IP when sending data outside, and back into private IPs when receiving data inside.
SFU = Selective Forwarding Unit
It’s a special type of server used in video calling apps.
In a normal peer-to-peer call (mesh): every participant sends their video to everyone else directly. If 5 people are in a call, each person sends 4 copies of their video → heavy on upload bandwidth.
With an SFU:
Each participant sends only one copy of their video to the SFU server.
The SFU then forwards (routes) that video to all the other participants.
The SFU doesn’t decode/re-encode video (it just forwards it), so it’s efficient.
Benefits of SFU:
Saves a lot of upload bandwidth.
Works better for group calls.
Supports features like simulcast (sending multiple quality levels so others can get 720p, 360p, etc., based on their internet).
Downside:
Still needs a server (extra cost).
Each participant still has to download multiple streams (one for each person in the call).
How Video Streaming Works in Video Calls
At the core, a video calling app involves:
- Capturing Media → Device camera and microphone capture audio/video.
- Encoding & Compression → Using codecs like VP8, H.264, or Opus (audio) to reduce bandwidth usage.
- Packetization → Splitting data into small packets for network transmission.
- Transport Layer → Typically WebRTC, which ensures real-time, encrypted peer-to-peer communication.
-
Rendering → On the receiving side, packets are reassembled, decoded, and rendered in
<video>
tags.
This flow is straightforward on a local network but faces network, scaling, and reliability challenges in production.
Deployment Challenges in Production
You may have seen a lot of video tutorials on YouTube teaching you how to make a video calling application in nodejs and other stuff which looks quite simple, but is it really that simple?
When moving from a local test app to a production-ready system, you face:
- NAT and Firewall Traversal
- Local: Devices on the same Wi-Fi often connect directly.
- Production: Users are behind NATs, corporate firewalls, or symmetric NATs. Direct peer-to-peer connections often fail.
- Scalability
- Local: 1:1 calls are manageable.
- Production: Group calls (5, 10, 50+) require SFUs (Selective Forwarding Units) or MCUs (Multipoint Control Units) to efficiently handle streams.
- Reliability & Quality
- Packet loss, jitter, and fluctuating bandwidth affect call quality.
- Adaptive bitrate algorithms (Simulcast, SVC) are required for a smooth experience.
- Security & Privacy
- End-to-end encryption must be enforced.
- Compliance (HIPAA, GDPR) may apply for enterprise or healthcare apps.
What is a STUN Server?
STUN = Session Traversal Utilities for NAT
- A STUN server helps your device find out its public IP and port when it’s behind a NAT (like your Wi-Fi router).
- With this info, two devices (peers) can try to connect directly to each other.
- Example: Your laptop asks the STUN server, “What’s my public address?” The server replies, “It’s 49.35.21.10:54555.” Now your laptop can share this with another peer for direct communication.
In short: STUN is like asking Google Maps “What’s my current location?” so you can tell your friend where to meet you.
What is a TURN Server?
TURN = Traversal Using Relays around NAT
- Sometimes, direct connection fails (for example, in offices with strict firewalls or symmetric NAT).
- In that case, a TURN server acts like a post office (relay).
- Instead of sending video/audio directly to the other peer, your data goes through the TURN server → TURN forwards it to the other person.
- This guarantees the connection works, but it increases latency and server cost because all media traffic flows through TURN.
In short: TURN is like saying, “We can’t meet directly, so let’s send all our letters through the post office.”
Key difference:
- STUN → Helps peers connect directly.
- TURN → Relays traffic when direct connection is impossible.
STUN vs TURN: What’s the Difference?
When peers try to connect, they need help to “discover” each other’s public IPs and establish connectivity. That’s where STUN and TURN servers come in.
Feature | STUN (Session Traversal Utilities for NAT) | TURN (Traversal Using Relays around NAT) |
---|---|---|
Purpose | Finds your public IP and port | Relays traffic if direct P2P fails |
Usage | Lightweight, used in 70-80% of cases | Heavy, used when strict firewalls/NATs block P2P |
Latency | Low (direct P2P connection) | Higher (adds relay hop) |
Cost | Cheap / Free (Google STUN servers available) | Expensive (requires high bandwidth relay servers) |
When Needed | Simple NAT traversal may be blocked by strong firewalls. | Enterprise networks, corporate firewalls, symmetric NATs |
- In local apps: STUN often works fine.
- In production apps: You must deploy TURN servers (like Coturn) for guaranteed connectivity.
Simple Analogy
- STUN = Asking a friend, “Where am I on the map?” → so you can tell others how to reach you.
- TURN = Post office → if you and your friend can’t meet directly, you both send letters to the post office, and it forwards them.
Flow of STUN/TURN in a Video Call
- Device captures media
- Your app captures video/audio from the camera and mic.
- Signaling phase
- Both users (peers) exchange “offers” and “answers” (SDP – Session Description Protocol) through a signaling server (e.g., WebSocket, REST).
- This describes codecs, capabilities, and candidate connection options.
- ICE Candidate Gathering
- Each device tries to figure out how it can be reached over the internet.
-
Here, the STUN server is contacted:
- Peer asks: “What’s my public IP & port?”
- STUN replies with: “Your public address is 49.35.21.10:54555.”
This info is added as an ICE candidate.
- Peer-to-Peer Connection Attempt
- Both peers exchange their ICE candidates (possible network paths).
- They try to connect directly using these candidates (NAT traversal).
- If successful → direct P2P connection established.
- Fallback to TURN (if direct fails)
- If peers cannot connect directly (because of strict NAT/firewall), they use a TURN server.
-
In this case:
- Peer A sends data to TURN server.
- TURN relays it to Peer B (and vice versa).
This increases latency and server cost but guarantees connection.
Flow Summary in One Line
App → Signaling → STUN (find public IP) → Try direct P2P → If fail, use TURN relay.
Bandwidth and Bitrate Calculations
Video quality directly impacts bandwidth usage. Approximate bitrates:
- Audio (Opus codec) → 30–100 kbps
-
Video (VP8 / H.264):
- 360p (low) → 300–600 kbps
- 720p (HD) → 1.2–1.5 Mbps
- 1080p (Full HD) → 2–3 Mbps
- 4K → 10–15 Mbps
Example: A 4-person call at 720p
- Without SFU (mesh): Each peer sends 3 outgoing streams → ~4.5 Mbps upstream.
- With SFU: Each peer sends 1 stream, receives 3 streams → ~3.6 Mbps total.
SFUs save a huge amount of bandwidth and are necessary for production.
Local vs Production Video Calling App
Aspect | Local Setup (Testing) | Production Setup (Real-world) |
---|---|---|
Connectivity | Direct peer-to-peer (via STUN) | Needs TURN for reliability |
Scale | Works for 1:1 or small calls | Needs SFU/MCU for group calls |
Latency | Low (same Wi-Fi) | Variable (depends on networks) |
Bandwidth Handling | Fixed, no adaptive bitrate | Adaptive bitrate + Simulcast |
Security | Basic encryption (WebRTC default) | Enterprise-grade, compliance |
Cost | Minimal | Higher (TURN servers, infra) |
Key Takeaways
- Building a local video calling app is easy, but production requires TURN, SFU/MCU, adaptive bitrate, and strong security.
- STUN helps 70–80% of cases, but TURN is mandatory for full reliability.
- Bandwidth grows quickly with participants; SFU saves bandwidth and improves performance.
- Always calculate expected bitrate × participants to estimate infrastructure needs.
Top comments (0)