A developer's deep dive into what actually happens when you make a phone call
So you're building a voice call feature in your app. You pick up a library, maybe WebRTC or a third-party SDK, and things just... work. But then a question hits you mid-implementation:
"Wait β how is voice data actually being sent? And how is this different from a regular phone call?"
That exact thought led me down a rabbit hole. This article breaks it all down β in plain English, with real technical depth underneath.
The Big Picture First
When you speak into a phone, your voice is just air vibrations (analog signal). Before it can travel anywhere β through towers or internet β it must be converted into digital data. Both call types do this. The difference is how that data travels afterward.
Your Voice (Analog)
β
Digitize + Compress
β
βββββββββββββ βββββββββββββββββββ
β No Internetβ β With Internet β
β GSM/VoLTE β β WebRTC/VoIP β
βββββββββββββ βββββββββββββββββββ
Part 1: Normal Phone Calls (No Internet) π
What's happening under the hood?
A regular phone call uses your telecom operator's infrastructure β towers, cables, switching centers β completely independent of the internet.
Step-by-Step Flow
You speak π€
β
Microphone captures analog audio
β
ADC (Analog-to-Digital Converter) β digital signal
β
Codec compresses it (AMR / AMR-WB / EVS)
β
Sent to nearest Cell Tower π‘
β
Telecom Core Network (routes the call)
β
Receiver's Cell Tower π‘
β
Receiver's phone decodes β plays audio π
The Codec: AMR (Adaptive Multi-Rate)
This is the compression algorithm used in traditional calls. It's smart β it adapts the bitrate based on network conditions.
| AMR Mode | Bitrate | Quality |
|---|---|---|
| AMR 4.75 | 4.75 kbps | Low (weak signal) |
| AMR 12.2 | 12.2 kbps | High (strong signal) |
| AMR-WB (HD Voice) | 23.85 kbps | HD quality |
What does the data look like?
Under the hood, voice is not sent as one big audio file. It's split into tiny chunks β each chunk represents about 20 milliseconds of audio.
[20ms chunk] β [20ms chunk] β [20ms chunk] β [20ms chunk] β ...
#1 #2 #3 #4
Each frame looks something like this conceptually:
{
"type": "voice_frame",
"codec": "AMR",
"sequence": 101,
"timestamp": 2003400,
"payload": "<compressed binary audio bytes>"
}
β οΈ In reality it's binary, not JSON β but this structure represents what's inside each packet.
Circuit Switching vs VoLTE
Old GSM (2G/3G) β Circuit Switching
- A dedicated "pipe" is reserved just for your call
- Like booking a private road β no one else uses it during your call
- Very stable, but inefficient (resources wasted during silence)
VoLTE (4G/5G) β Packet Switching (but controlled)
- Voice is broken into packets like internet data
- But the network gives it priority (QoS β Quality of Service)
- Lower latency, HD quality, still uses telecom infrastructure
Part 2: Internet Calls (WhatsApp, WebRTC) π
What's happening under the hood?
Apps like WhatsApp, Google Meet, and Discord use the internet to carry voice. The key technology here is WebRTC (Web Real-Time Communication) β an open standard built into browsers and mobile OSes.
Step-by-Step Flow
You speak π€
β
Microphone captures analog audio
β
ADC β digital signal
β
Opus Codec compresses it
β
Packetized into UDP packets
β
Sent via Internet (WiFi / 4G / 5G)
β
STUN/TURN Server (for NAT traversal)
β
Peer-to-Peer connection (WebRTC)
β
Receiver reassembles packets β decodes β plays audio π
The Codec: Opus
Opus is the go-to codec for internet voice/audio. It's open-source, low-latency, and adaptive.
| Feature | Opus |
|---|---|
| Bitrate range | 6 kbps β 510 kbps |
| Latency | ~20ms |
| Handles packet loss? | β Yes (built-in FEC) |
| Quality at low bitrate | Excellent |
| Used by | WhatsApp, Discord, Zoom, WebRTC |
Opus has Forward Error Correction (FEC) built in β meaning it sends redundant data so if a packet is lost, it can still reconstruct the audio. That's why internet calls still sound okay even with minor packet loss.
Why UDP and not TCP?
This is one of the most important decisions in real-time audio.
TCP (used in HTTP, file downloads):
- Guarantees delivery β if a packet is lost, it resends it
- Problem: Resending takes time β delay β unacceptable in real-time voice
UDP (used in WebRTC voice):
- No guarantee of delivery
- No resending lost packets
- But it's fast β packets go out and don't wait
In voice calls, a 200ms old audio packet is useless anyway. Better to skip it and keep playing forward than wait for a retry.
TCP mindset: "Wait, I need packet #47 before I continue" β (for voice)
UDP mindset: "Packet #47 is gone? Fine, move on." β
(for voice)
How WebRTC Establishes Connection (Simplified)
- Signaling β Both peers exchange metadata (IP, codec support) via a server
- ICE (Interactive Connectivity Establishment) β Finding the best network path
- STUN Server β Figures out your public IP (you're usually behind a router/NAT)
- TURN Server β Relays traffic if direct P2P fails (firewall situations)
- DTLS Handshake β Encrypted connection established
- SRTP β Voice packets flow securely, peer-to-peer
Caller Signaling Server Receiver
| | |
|----offer (SDP)----------->| |
| |-------offer (SDP)--------->|
| |<------answer (SDP)---------|
|<---answer (SDP)-----------| |
| | |
|<==================ICE Candidates exchanged============>|
| |
|<================P2P Voice (SRTP/UDP)==================>|
What does the data look like?
{
"type": "audio_packet",
"codec": "opus",
"ssrc": 3892741023,
"sequence": 4821,
"timestamp": 96000,
"payload": "<opus encoded binary>"
}
This is an RTP (Real-time Transport Protocol) packet. WebRTC wraps it in SRTP (Secure RTP) for encryption.
Part 3: Side-by-Side Comparison
| Feature | Normal Call π | Internet Call π |
|---|---|---|
| Network | Telecom (Jio, Airtel) | Internet (WiFi / Mobile data) |
| Protocol | GSM / VoLTE | WebRTC (RTP over UDP) |
| Codec | AMR / AMR-WB / EVS | Opus |
| Latency | ~100β150ms | ~150β300ms (network-dependent) |
| Data path | Operator controlled | Peer-to-peer (mostly) |
| Delivery | Guaranteed (circuit/priority) | Best-effort (UDP) |
| Encryption | Limited (operator can see) | E2E Encrypted (DTLS + SRTP) |
| Packet loss handling | Network-level QoS | Opus FEC + NACK |
| Works without data? | β Yes | β No |
| Cost | Per minute or bundled | Uses ~0.3β0.5 MB/min |
| Emergency calls | β Works | β Cannot call 112/911 |
Part 4: Why Voice Sometimes Breaks on Internet Calls π€
Ever heard someone sound like a robot during a WhatsApp call? Here's exactly why:
1. Packet Loss
Some UDP packets don't arrive. If too many are lost in a row, the audio decoder has gaps β robotic or stuttering sound.
2. Jitter
Packets arrive out of order or unevenly spaced. WebRTC uses a jitter buffer to smooth this out β but if jitter is too high, the buffer overflows or the audio gets chopped.
Sent: [P1]--[P2]--[P3]--[P4]--[P5]
Received: [P1]------[P3][P2]----[P5] β P4 lost, P2 P3 swapped
3. Network Handoff
When you're moving (driving, walking), your phone switches between towers or WiFi β 4G. During handoff, packets drop β brief audio glitch.
4. Congestion
Your internet is shared. If someone starts a big download in parallel, your voice packets compete for bandwidth β delay spikes.
Part 5: As a Developer β What Should You Know?
If you're building a voice feature, here are the key decisions:
Choosing your approach
Use WebRTC if:
- Building for web/mobile app
- Need P2P, low cost at scale
- Want E2E encryption
- Don't need emergency call support
Use VoIP / SIP if:
- Need PSTN (real phone number) integration
- Need to call regular phones
- Enterprise telephony
Use a managed SDK if:
- Fast shipping matters
- Examples: Twilio, Agora, Daily.co, Vonage
Key WebRTC APIs to know
// Get user's microphone
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// Create peer connection
const pc = new RTCPeerConnection({
iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});
// Add audio track to connection
stream.getTracks().forEach(track => pc.addTrack(track, stream));
// Create and send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
// β Send offer to other peer via your signaling server
// When you receive their answer:
await pc.setRemoteDescription(new RTCSessionDescription(answer));
Monitor call quality in real time
// Get audio stats
const stats = await pc.getStats();
stats.forEach(report => {
if (report.type === 'inbound-rtp' && report.kind === 'audio') {
console.log('Packets lost:', report.packetsLost);
console.log('Jitter:', report.jitter);
console.log('Round trip time:', report.roundTripTime);
}
});
Quick Summary
Both call types:
Voice β Digitize β Compress β Send in 20ms chunks β Decode β Play
Without Internet (Normal Call):
Codec: AMR | Path: Telecom towers | Protocol: GSM/VoLTE | Stable + Guaranteed
With Internet (WhatsApp/WebRTC):
Codec: Opus | Path: Internet P2P | Protocol: RTP over UDP | Flexible + Encrypted
The biggest conceptual difference:
- Normal call = a dedicated pipe reserved just for you (like booking a private road)
- Internet call = many small packets racing through shared roads, reassembled on arrival
Further Reading
- WebRTC Official Docs
- RFC 3550 β RTP Specification
- Opus Codec
- How NAT Traversal Works (STUN/TURN/ICE)
- MDN β RTCPeerConnection
If this helped you understand what's actually happening under the hood when you make a call, drop a β€οΈ. And if you're building something with WebRTC, feel free to ask questions in the comments!
Tags: #webrtc #voip #networking #javascript #webdev #beginners

Top comments (0)