“SDP made the rules, RTP plays the game.”
In the previous episode of SIP GAMES, we peeked inside the SDP invite that tells your opponent how you'd like to play: what codecs, what ports, and what IPs. But who actually carries the media?
🎮 Enter RTP — Real-time Transport Protocol.
🧳 What is RTP?
Think of RTP as the courier that carries your voice across the network — broken into little time-stamped, sequence-numbered packages.
- SIP sets up the call
- SDP describes the media setup
- RTP sends the actual media (voice/video)
RTP runs on top of UDP (User Datagram Protocol) because it’s fast and tolerant of occasional loss — just like a real conversation.
🧬 RTP Packet Structure
Here’s the basic layout of an RTP packet:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|CC|M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) Identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Contributing Source (CSRC) Identifiers (optional) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload (audio/video) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Let’s decode this header:
🔍 RTP Header Fields
Field | What It Means |
---|---|
V |
Version (always 2) |
P |
Padding (if extra bytes added) |
X |
Extension header present |
CC |
CSRC count (used in conferencing) |
M |
Marker bit (e.g. start of a talkspurt) |
PT |
Payload type (codec, e.g., 0 = PCMU, 96 = dynamic) |
Sequence Number |
Increments by 1 per packet — used to detect loss |
Timestamp |
Used for media playback timing |
SSRC |
Sender’s unique ID |
CSRC |
IDs of other contributing streams (optional) |
Payload |
Actual audio or video data |
🕒 Packetization Time (a.k.a. ptime)
What is packetization time?
It’s the duration of audio in each RTP packet, often advertised in SDP using a=ptime:20
(means 20 ms per packet).
Common values:
Codec | Typical ptime | Result |
---|---|---|
PCMU | 20 ms | 50 packets/sec |
Opus | Variable | Can do 20–60 ms |
G.729 | 20 ms | Small, compressed |
🧮 Frequency of RTP Transmission
The number of RTP packets per second depends on the codec’s ptime.
Example:
- If ptime is 20ms, that’s 50 packets/second
- If it’s 30ms, ~33.3 packets/sec
- Higher ptime = fewer packets = less overhead
- Lower ptime = smoother audio but more packets
🧠 Why Do I Care?
If you're implementing RTP or trying to debug call quality:
- Jitter? Check packet arrival times and timestamps
- Audio out of sync? Sequence or timestamp mismatch
- Silence or gaps? Packets lost or arriving too late
-
Wrong codec? Check the Payload Type (
PT
) field
RTP is everywhere in VoIP — and understanding this header lets you trace, debug, and build your own media streamers.
🛠️ Example: A Real RTP Packet (with G.711)
Let’s say we're using G.711 with 20ms ptime.
- Payload Type:
0
(PCMU) - Sequence Number:
10567
- Timestamp:
160000
- SSRC:
0x789ABC
- Payload: 160 bytes of G.711 data (8-bit PCM at 8000 Hz)
That’s 160 samples × 8 kHz × 20ms = 160 bytes
🎮 TL;DR
- RTP carries media after SIP/SDP sets things up
- Each RTP packet has headers: version, PT, seq, timestamp, etc.
- Ptime defines how much media is in each packet
- Frequency of packets is based on ptime
- Use RTP headers to debug and analyze VoIP issues
📦 Up Next in SIP GAMES:
“Spy Tools for VoIP Agents” 🕵️♂️
We’ll break down the best open-source tools like Wireshark, sipp, and rtpengine, and show you how to capture, simulate, and troubleshoot your VoIP calls like a pro.
Follow @sip_games to keep leveling up your VoIP game.
Top comments (0)