hariharan

Posted on Jun 15 • Originally published at github.com

Bridging RTSP to WebRTC on iOS: Feeding GStreamer Frames into a WebRTC Pipeline

#rtps #ios #webrtc #gstreamer

Most WebRTC tutorials on iOS assume your video comes from the device camera. You call RTCCameraVideoCapturer, point it at the front lens, and the framework handles the rest. But a large class of real applications doesn't have a camera as the source at all — it has an IP camera, a dashcam, a drone feed, or some other device speaking RTSP, and the iOS app's job is to ingest that stream and republish it over WebRTC for low-latency delivery (often into a managed SFU like Amazon Kinesis Video Streams).

That problem turns out to be more interesting than it looks, because the two halves of the system have fundamentally different ideas about where pixels come from. RTSP gives you an encoded elementary stream pulled off the network. WebRTC's iOS SDK wants to own the capture path and hand you an encoder. Getting one to feed the other means stepping outside the comfortable parts of both frameworks and meeting in the middle — at the level of raw I420 frame buffers.

This article walks through a working architecture for exactly that bridge: RTSP ingestion and decoding via GStreamer, conversion to raw YUV, and injection into the WebRTC pipeline through a custom capturer. The signaling and ICE negotiation are handled against Kinesis Video Streams, but the core technique is provider-agnostic and applies to any libwebrtc-based stack.

Why you can't just point WebRTC at an RTSP URL

The WebRTC native SDK has no concept of a network video source. Its capture abstraction, RTCVideoCapturer, is designed around the assumption that frames originate locally and synchronously — a camera, the screen, a file. The RTCCameraVideoCapturer subclass that most apps use wires AVCaptureSession output straight into a video source. There is no built-in "open this RTSP URL" path, and there's a good reason for that: RTSP involves session setup, RTP depayloading, jitter buffering, and decoding, none of which belongs inside a real-time media engine whose entire design goal is to push locally captured frames out the door as fast as possible.

So the ingestion side has to be solved separately, and GStreamer is the pragmatic choice. It has mature RTSP support, hardware-accelerated decoders where available, and a clean way to extract decoded frames from the end of a pipeline via appsink. The architecture becomes:

RTSP camera ──▶ GStreamer (depay + decode + convert) ──▶ appsink
│
raw I420 frames
│
▼
custom RTCVideoCapturer ──▶ WebRTC ──▶ SFU/peer

GStreamer handles everything up to and including "I now have a decoded YUV frame in memory." WebRTC handles everything from "here is a frame, encode and transmit it." The bridge between them is the part you have to build, and it's where most of the subtlety lives.

The GStreamer pipeline

The ingestion pipeline is constructed with gst_parse_launch, which lets you describe the whole chain as a string:

rtspsrc location=rtsp://<host>/<path> !
rtph264depay ! avdec_h264 ! videoconvert ! videoscale !
video/x-raw, format=I420, width=640, height=480 !
appsink name=mysink emit-signals=true sync=false max-buffers=1 drop=true

Reading left to right: rtspsrc handles RTSP session negotiation and produces RTP packets. rtph264depay extracts the H.264 elementary stream from RTP. avdec_h264 decodes it to raw frames. videoconvert and videoscale normalize the pixel format and dimensions, and the explicit video/x-raw, format=I420 caps filter forces the output into planar YUV 4:2:0 — which, not coincidentally, is exactly what WebRTC's RTCI420Buffer expects. Pinning the resolution at this stage (here 640×480) means the downstream copy logic can assume fixed plane sizes rather than re-reading caps on every frame.

The appsink properties are where the real-time character of the system gets enforced, and they deserve individual attention because the defaults are wrong for this use case:

sync=false tells the sink not to throttle to the pipeline clock. For playback you want clock sync so video plays at the correct speed; for a live bridge you want frames out the instant they're decoded, because the WebRTC layer downstream is doing its own pacing.
max-buffers=1 caps the internal queue at a single frame.
drop=true says that when that one-frame buffer is full, throw away the oldest frame rather than blocking the pipeline.

Together, max-buffers=1 drop=true implement a "latest frame wins" policy. This is the single most important latency decision in the whole ingestion path. If you let frames queue up, any momentary stall downstream — a slow encode, a main-thread hiccup — turns into a growing backlog, and the stream drifts further and further behind real time with no way to recover. By dropping rather than queuing, you trade the occasional skipped frame for a hard ceiling on accumulated latency. For a live monitoring or dashcam scenario, that's exactly the right trade: nobody wants a perfectly smooth video that's eight seconds stale.

Threading and the GLib main loop

One non-obvious requirement: GStreamer's bus and signal machinery need a running GLib main loop, and that loop blocks. You cannot run it on the main thread of an iOS app. The pipeline is therefore set up and run inside a dispatched background block, with its own GMainContext pushed as the thread-default:

context = g_main_context_new();
g_main_context_push_thread_default(context);
// ... build pipeline, attach bus watch ...
main_loop = g_main_loop_new(context, FALSE);
g_main_loop_run(main_loop);   // blocks here until quit

Creating a private GMainContext rather than using the default one matters in a mixed app: it keeps this pipeline's event sources isolated, so multiple pipelines (or other GLib-using code) don't fight over a shared context. Teardown then runs in the reverse order — quit the loop, pop and unref the context, set the pipeline to GST_STATE_NULL, and unref it.

A note on dynamic pads

rtspsrc doesn't expose its source pads until it has connected to the server and discovered the stream's contents, so you can't statically link it at construction time. If you build the pipeline element-by-element rather than with gst_parse_launch, you have to connect to its pad-added signal and do the linking in the callback:

static void on_rtsp_pad_added(GstElement *src, GstPad *new_pad, gpointer user_data) {
    GstElement *depay = GST_ELEMENT(user_data);
    GstPad *sink_pad = gst_element_get_static_pad(depay, "sink");
    if (!gst_pad_is_linked(sink_pad)) {
        gst_pad_link(new_pad, sink_pad);
    }
    gst_object_unref(sink_pad);
}

gst_parse_launch hides this for you — it handles the deferred linking internally — which is one reason the string form is worth preferring for a fixed pipeline. The manual approach is only worth the extra code when you need runtime control over individual elements.

The bridge: pulling frames out of GStreamer

Every decoded frame triggers the appsink's new-sample signal, wired to a C callback. The callback's job is to get the frame data out of GStreamer's memory and onto iOS's main queue for handoff to WebRTC, and the lifetime management here is where it's easy to introduce a crash or a subtle corruption bug.

objc
static GstFlowReturn on_new_sample(GstAppSink *sink, gpointer user_data) {
    GStreamerBackend *self = (__bridge GStreamerBackend *)user_data;
    GstSample *sample = gst_app_sink_pull_sample(sink);
    if (!sample) return GST_FLOW_ERROR;

    GstBuffer *buffer = gst_sample_get_buffer(sample);
    GstMapInfo map;
    if (gst_buffer_map(buffer, &map, GST_MAP_READ)) {
        void *copiedData = malloc(map.size);
        if (copiedData) {
            memcpy(copiedData, map.data, map.size);
            size_t dataSize = map.size;
            NSData *frameData = [NSData dataWithBytes:copiedData length:dataSize];
            dispatch_async(dispatch_get_main_queue(), ^{
                [self.delegate processRawNSDataFrame:frameData];
                free(copiedData);
            });
        }
        gst_buffer_unmap(buffer, &map);
    }
    gst_sample_unref(sample);
    return GST_FLOW_OK;
}

The critical detail is the defensive copy. The GstBuffer you get from gst_app_sink_pull_sample is only valid until you unref the sample, and the mapped memory is only valid between gst_buffer_map and gst_buffer_unmap. But the handoff to WebRTC happens asynchronously on the main queue — by the time that block runs, the sample is long gone. You cannot hand the mapped pointer across a dispatch_async boundary. So the frame bytes are memcpy'd into a heap allocation that the dispatched block owns and frees, and the GStreamer buffer is unmapped and unreffed synchronously before the callback returns.

This copy is the one unavoidable allocation per frame in the path. It's tempting to try to eliminate it, but the asynchronous boundary makes it necessary unless you redesign around a frame pool — and at 640×480 I420 the copy is a fraction of a megabyte, cheap relative to the decode that produced it. (If you do find this allocation showing up in profiling under high frame rates, a fixed-size ring of preallocated buffers is the standard answer, but measure before reaching for it.)

Crossing onto the main queue is a deliberate choice rather than an accident: it gives a single, predictable thread on which frames enter the WebRTC layer, which sidesteps a class of data races around the capturer's delegate. It does couple frame delivery to main-thread responsiveness, which is the cost side of that decision — if your UI thread stalls, frame injection stalls with it. For an app whose main job is displaying this video, that coupling is usually acceptable, but it's worth knowing it's there.

The injection side: a synthetic capturer

On the WebRTC side, the trick is to create a video source and a bare RTCVideoCapturer — not the camera subclass — and then feed it frames manually. The capturer normally has a real implementation behind it that produces frames; here it's essentially a handle whose only purpose is to carry the source's delegate.

private func createVideoTrack() -> RTCVideoTrack {
    let videoSource = WebRTCClient.factory.videoSource()
    videoSource.adaptOutputFormat(toWidth: 1280, height: 720, fps: 30)
    videoCapturer = RTCVideoCapturer(delegate: videoSource)
    return WebRTCClient.factory.videoTrack(with: videoSource, trackId: "KvsVideoTrack")
}

The video source is the capturer's delegate. That's the hook. Anything delivered to videoCapturer.delegate as an RTCVideoFrame enters the WebRTC encode-and-transmit path exactly as if a camera had produced it. So frame injection becomes a matter of building an RTCVideoFrame from the raw bytes and calling the delegate:

func processRawNSDataFrame(_ data: NSData) {
    let width = 640, height = 480
    let i420Buffer = RTCMutableI420Buffer(width: Int32(width), height: Int32(height))

    let base = data.bytes.assumingMemoryBound(to: UInt8.self)
    let ySize = width * height
    let uSize = ySize / 4

    memcpy(i420Buffer.mutableDataY, base, ySize)
    memcpy(i420Buffer.mutableDataU, base + ySize, uSize)
    memcpy(i420Buffer.mutableDataV, base + ySize + uSize, uSize)

    let timestampNs = Int64(Date().timeIntervalSince1970 * Double(NSEC_PER_SEC))
    let rtcFrame = RTCVideoFrame(buffer: i420Buffer, rotation: ._0, timeStampNs: timestampNs)
    videoCapturer?.delegate?.capturer(videoCapturer!, didCapture: rtcFrame)
}

The I420 memory layout is the thing to get right. A 4:2:0 planar frame stores all the luma (Y) samples first — one per pixel — followed by the two chroma planes (U then V) at quarter resolution, since chroma is subsampled both horizontally and vertically. So for a W×H frame, Y is W*H bytes, and U and V are each W*H/4 bytes. The three memcpy calls split the flat buffer GStreamer produced back into the three plane pointers WebRTC's RTCMutableI420Buffer exposes. Because the GStreamer caps filter already guaranteed I420 at this exact resolution, the offsets are fixed and known; this is the payoff for pinning the format upstream.

A guard on the incoming length is worth adding before any of the memcpys — a frame that's shorter than W*H*3/2 means something upstream changed format or truncated, and copying plane data out of a short buffer reads past the end:

guard data.length >= width * height * 3 / 2 else {
    print("Frame too small — unexpected format or truncation")
    return
}

A word on timestamps

The frame timestamp here is derived from wall-clock time at the moment of injection (Date().timeIntervalSince1970). That's the simplest thing that works, and for a live bridge where every frame is "now" it's defensible. But it discards the timing information GStreamer actually had — the PTS on the original buffer — and uses arrival time as a proxy for capture time. If the ingestion path ever introduces variable delay, monotonic-but-uneven wall-clock stamps can confuse the encoder's rate control and the far end's jitter buffer. A more robust version propagates the GStreamer buffer PTS through the bridge and converts it into the WebRTC timebase. For a first working system, wall-clock is fine; it's a known place to tighten up later.

Signaling and connection management

The media bridge is the novel part, but a working system still needs signaling and ICE, and a couple of details in that layer are worth calling out because they bite people.

The connection is established against Kinesis Video Streams, which means resolving a signaling channel ARN (creating the channel if it doesn't exist), fetching the WSS and HTTPS endpoints, signing the WebSocket URL with SigV4 using Cognito credentials, gathering ICE servers, and finally opening the signaling socket. The offer/answer and ICE candidate exchange then follow the standard WebRTC handshake over that socket.

The detail that catches people is ICE candidate ordering. Candidates can arrive over the signaling channel before the SDP offer/answer exchange has completed — that is, before there's a peer connection ready to accept them. Adding a candidate to a peer connection that hasn't had its remote description set yet is an error. The fix is a pending-candidate queue: if no peer connection exists for a given client ID yet, candidates get held in a per-client set; once the SDP exchange completes and the connection is established, the queue is drained in order.

func checkAndAddIceCandidate(remoteCandidate: RTCIceCandidate, clientId: String) {
    if peerConnectionFoundMap.index(forKey: clientId) == nil {
        // SDP exchange not done yet — hold the candidate
        var pending = pendingIceCandidatesMap[clientId] ?? Set<RTCIceCandidate>()
        pending.insert(remoteCandidate)
        pendingIceCandidatesMap[clientId] = pending
    } else {
        // connection is up — add directly
        peerConnectionFoundMap[clientId]!.add(remoteCandidate)
    }
}

This is the same problem every WebRTC implementation eventually solves; it just shows up early here because the signaling and media paths run at different speeds. Keying the queue by client ID means the same logic handles a session with multiple remote peers without them stepping on each other.

The peer connection configuration is otherwise conventional, with a few choices worth noting: unifiedPlan semantics (the modern default and required for newer SDKs), gatherContinually so candidates keep flowing as network conditions change, maxBundle to multiplex all media over a single transport, and tcpCandidatePolicy = .enabled so the connection can fall back to TCP candidates when UDP is blocked — relevant for cameras or clients behind restrictive networks.

What this architecture buys you, and what it costs

The strength of this design is separation of concerns. GStreamer is genuinely good at network ingest and decode, with broad codec and transport support you'd spend months reimplementing. WebRTC is genuinely good at adaptive, low-latency, NAT-traversing real-time transport. By meeting them at the raw-frame boundary, each does what it's best at, and the bridge between them is small enough to reason about completely.

The costs are real and worth stating plainly. There's a full decode-then-re-encode cycle: GStreamer decodes H.264 to raw YUV, and WebRTC re-encodes that YUV (typically to VP8/VP9/H.264 again) for transport. That's CPU and battery you wouldn't spend if you could forward the encoded stream directly — but forwarding encoded RTSP straight into WebRTC isn't generally possible, because WebRTC controls its own encoder for congestion-responsive bitrate adaptation. The per-frame copies (one out of GStreamer, three into the I420 planes) add allocation pressure that matters at high resolutions or frame rates. And the main-queue handoff couples media flow to UI responsiveness.

For the live-monitoring and dashcam-style use cases this targets — where the alternative is no low-latency path at all — those costs are the price of admission, and they're manageable. Where they'd start to hurt is high-resolution, high-frame-rate, multi-stream scenarios on battery-constrained hardware; that's the point at which you'd want to profile the copy and re-encode overhead seriously and consider a frame pool, PTS propagation, and moving the injection off the main queue.

Latency: where it actually comes from

A note on performance claims, since this is the question everyone asks. Glass-to-glass latency in this system is the sum of several stages: RTSP/RTP transport and jitter buffering, H.264 decode, the format conversion and copy, the main-queue hop, WebRTC encode, network transport to the far end, and finally far-end decode and render. The max-buffers=1 drop=true policy bounds the accumulated latency on the ingest side, and sync=false removes playback-clock delay, but the absolute number depends heavily on the camera's encoder GOP structure, network conditions, and the far-end client.

I'm not going to quote a figure I haven't measured on a fixed setup, because a latency number without a stated measurement method is noise. If you're reporting this, measure end-to-end with a known method — a millisecond clock or timestamped pattern visible in-frame, captured at both ends — and report the conditions alongside the number.

Closing

The reusable insight here is the frame-level bridge pattern: when two media frameworks each insist on owning their end of the pipeline, you can often join them at the raw-buffer boundary by creating a bare RTCVideoCapturer and driving its delegate yourself. The same technique works for screen capture, generated/synthetic video, ML-processed frames, or any source WebRTC doesn't natively support — anywhere you can produce an RTCVideoFrame, you can inject into WebRTC. The RTSP case is just the most common reason to need it.

The parts that took the most care weren't the parts the tutorials cover. They were the buffer lifetime across the async boundary, the I420 plane arithmetic, the appsink queue policy, and the ICE candidate ordering — the unglamorous edges where two systems with different assumptions actually have to agree.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.