Lalit Mishra

Posted on Feb 16

Trust No One: Implementing True End-to-End Encryption with Insertable Streams

#python #webrtc #security #encryption

The Broken Trust Model of Standard SFU Encryption

In the standard WebRTC architecture, "encryption" is a transport-level guarantee, not a payload-level guarantee. When we deploy a Selective Forwarding Unit (SFU) like Janus, Mediasoup, or LiveKit, we rely on DTLS-SRTP (Datagram Transport Layer Security - Secure Real-time Transport Protocol). While this protocol effectively secures the media against passive wiretapping on the public internet, it introduces a critical compromise in the trust model: the Privileged Decryption Point.

The topology operates on a hop-by-hop basis:

Client A encrypts media using a key negotiated via DTLS with the SFU.
SFU decrypts the media to plaintext in memory to inspect RTP headers, rewrite sequence numbers, or perform simulcast layer selection.
SFU re-encrypts the media using a different key negotiated with Client B.
Client B decrypts the media.

In this model, the SFU possesses the keys to the kingdom. If the SFU infrastructure is compromised—whether by a malicious insider, a zero-day vulnerability in the media server, or a lawful interception warrant served to the cloud provider—the audio and video streams are exposed in cleartext.

True End-to-End Encryption (E2EE) in a routed topology necessitates a zero-trust approach to the infrastructure. The SFU must be demoted from a trusted participant to a blind packet courier. It should see routing metadata (sequence numbers, timestamps, SSRCs) to perform congestion control and forwarding, but the media payload itself must remain opaque blobs of high-entropy noise, decryptable only by the authenticated participants.

Insertable Streams and the Encoded Transform API

To achieve true E2EE without abandoning the scalability of SFUs (which would force a regression to unscalable Mesh topologies), browser vendors introduced Insertable Streams (officially the WebRTC Encoded Transform API).

This API exposes a manipulation hook deep within the WebRTC pipeline: after the encoder but before the packetizer.

Standard WebRTC pipeline:
Camera → Encoder → Packetizer (SRTP) → Network

Insertable Streams pipeline:
Camera → Encoder → **[TRANSFORM HOOK]** → Packetizer (SRTP) → Network

By intercepting the frame at this specific point, we operate on complete encoded frames (e.g., a full VP8 keyframe or an Opus audio frame) rather than fragmented RTP packets. This allows us to apply a secondary layer of encryption (payload encryption) that the SRTP layer subsequently wraps in its own transport encryption.

The API exposes the stream via RTCRtpSender.createEncodedStreams(). This returns a ReadableStream (source of encoded frames) and a WritableStream (destination for processed frames). By piping the readable stream through a TransformStream, we can modify the payload byte-by-byte.

Implementation: Injecting the Transform

The following code demonstrates how to intercept the encoded stream on the sender side. Note that we immediately transfer control to a Web Worker to avoid blocking the main thread.

// main.js - Sender Side setup

const pc = new RTCPeerConnection(config);
const sender = pc.addTrack(track, stream);

// 1. Force the specialized API for encoded transforms
if (sender.createEncodedStreams) {
  // Read the encoded streams
  const streams = sender.createEncodedStreams();

  // 2. Initialize the Crypto Worker
  const worker = new Worker("crypto-worker.js");

  // 3. Define the encryption configuration (e.g., Key ID, algorithm)
  const meta = {
    operation: 'encrypt',
    participantId: 'user-1234',
    keyId: currentKeyId 
  };

  // 4. Transfer the streams to the worker
  // We use postMessage with transferables to zero-copy move the streams
  worker.postMessage({
    operation: 'encrypt',
    readable: streams.readable,
    writable: streams.writable,
    config: meta
  }, [streams.readable, streams.writable]);

} else {
  console.error("Insertable Streams not supported in this browser.");
}

The Crypto Worker – Implementing AES-GCM Frame Encryption

Performing cryptographic operations on video frames (30 to 60 times per second) on the main JavaScript thread is architecturally unsound. It competes with UI rendering and garbage collection, leading to visible jitter. We must offload this to a Web Worker.

We use AES-GCM (Galois/Counter Mode) via the SubtleCrypto API. GCM is preferred over CBC because it provides authenticated encryption (AEAD)—ensuring that the payload has not been tampered with—and requires no padding.

The Payload Structure

We cannot simply replace the frame with ciphertext. The receiver needs to know the Initialization Vector (IV) and potentially the Key ID to decrypt it. A common header format for the encrypted payload is:

[ Unencrypted Header (1 byte: KeyID) ] [ IV (12 bytes) ] [ Ciphertext (N bytes) ] [ Auth Tag (16 bytes - implicit in GCM append) ]

Worker Implementation

// crypto-worker.js

let currentKey = null; // CryptoKey object
let ivCounter = 0;     // Monotonically increasing counter for IV construction

self.onmessage = async (event) => {
  const { operation, readable, writable, keyBytes } = event.data;

  if (operation === 'setKey') {
    // Import raw key bytes into a usable CryptoKey
    currentKey = await crypto.subtle.importKey(
      "raw",
      keyBytes,
      { name: "AES-GCM", length: 128 },
      false,
      ["encrypt", "decrypt"]
    );
    return;
  }

  if (operation === 'encrypt') {
    const transformStream = new TransformStream({
      transform: encryptFrame
    });
    // Pipe: Encoded Source -> Encryption Transform -> Packetizer Sink
    readable.pipeThrough(transformStream).pipeTo(writable);
  }
};

async function encryptFrame(chunk, controller) {
  if (!currentKey) {
    // Fail-open or drop? In E2EE, we usually drop or passthrough if handshake fails.
    // Here we drop to ensure security.
    return; 
  }

  const view = new DataView(chunk.data);

  // 1. Construct IV (12 bytes for GCM)
  // Using a combination of a synchronization source (SSRC context) and counter
  const iv = new Uint8Array(12);
  crypto.getRandomValues(iv); // In production, use a counter-based IV for uniqueness

  // 2. Encrypt the payload
  // chunk.data is the raw encoded video/audio bytes
  const ciphertext = await crypto.subtle.encrypt(
    {
      name: "AES-GCM",
      iv: iv,
      additionalData: new Uint8Array([chunk.type === 'key' ? 1 : 0]) // Authenticate frame type
    },
    currentKey,
    chunk.data
  );

  // 3. Construct the new payload
  // Format: [IV (12)] + [Ciphertext (N + 16 tag)]
  const newData = new Uint8Array(12 + ciphertext.byteLength);
  newData.set(iv, 0);
  newData.set(new Uint8Array(ciphertext), 12);

  // 4. Reassign data to the chunk
  chunk.data = newData.buffer;

  // 5. Forward the chunk
  controller.enqueue(chunk);
}

Key Management and Secure Distribution

The encryption logic is trivial compared to the complexity of key management. If you hardcode keys or fetch them from a server in cleartext, you have achieved nothing.

In a true E2EE system, the keys must be derived out-of-band or via a secure key exchange mechanism where the signaling server acts only as a relay. We typically use ECDH (Elliptic Curve Diffie-Hellman) to establish shared secrets between participants, or a simplified Per-Sender Key (Ratchet) model.

In the Per-Sender model (similar to MLS or SFrame concepts):

Alice generates a symmetric AES key (Content Key).
Alice encrypts this key with Bob's Public Key (and Carol's, etc.).
Alice broadcasts the encrypted keys via signaling.
Bob decrypts the Content Key using his Private Key.

ECDH Key Exchange Logic

// key-manager.js

// 1. Generate Identity Key Pair (ECDH)
async function generateIdentity() {
  return await crypto.subtle.generateKey(
    { name: "ECDH", namedCurve: "P-256" },
    true,
    ["deriveKey", "deriveBits"]
  );
}

// 2. Derive Shared Secret (Alice uses Bob's Public Key)
async function deriveSharedSecret(localPrivateKey, remotePublicKey) {
  return await crypto.subtle.deriveKey(
    { name: "ECDH", public: remotePublicKey },
    localPrivateKey,
    { name: "AES-GCM", length: 256 },
    true,
    ["encrypt", "decrypt", "wrapKey", "unwrapKey"]
  );
}

// 3. Rotation Logic
// When a participant leaves, we MUST rotate the room key/sender key
// to ensure Forward Secrecy.
async function rotateSenderKey() {
  const newKey = await crypto.subtle.generateKey(
    { name: "AES-GCM", length: 128 },
    true,
    ["encrypt", "decrypt"]
  );
  // Distribute newKey wrapped with shared secrets...
  return newKey;
}

This exchange happens over the WebSocket signaling channel. The SFU sees the signaling messages, but because the payloads are wrapped in the recipient's public key, the SFU cannot access the underlying AES content keys.

Decryption Pipeline on Receiver Side

The receiver's pipeline is the inverse of the sender's. We intercept RTCRtpReceiver, extract the encoded stream, and pass it to a worker. The worker parses the header, extracts the IV, and decrypts the payload.

Crucially, decryption failures must not crash the connection. If a key is missing (e.g., during a rotation race condition), the frame should be dropped or buffered, but the stream must persist.

Receiver Worker Implementation

// crypto-worker.js (Receiver Logic)

async function decryptFrame(chunk, controller) {
  if (!currentKey) return; // Drop frame if no key

  const data = new Uint8Array(chunk.data);

  // 1. Parse Header
  // Assuming our simple format: [IV (12)] [Ciphertext]
  if (data.byteLength < 12) return; // Malformed

  const iv = data.slice(0, 12);
  const ciphertext = data.slice(12);

  try {
    // 2. Decrypt
    const plaintextBuffer = await crypto.subtle.decrypt(
      {
        name: "AES-GCM",
        iv: iv,
        additionalData: new Uint8Array([chunk.type === 'key' ? 1 : 0])
      },
      currentKey,
      ciphertext
    );

    // 3. Restore Payload
    chunk.data = plaintextBuffer;
    controller.enqueue(chunk);

  } catch (err) {
    console.warn("Decryption failed - packet loss or wrong key?", err);
    // Do NOT enqueue the chunk. Dropping it prevents the decoder 
    // from choking on garbage data.
  }
}

Compatibility with SFUs (Janus, Mediasoup, LiveKit)

One might ask: "If the SFU cannot see the media, how does it route it?"

SFUs generally operate on the RTP header, not the payload.

Routing: The SFU reads the SSRC and RID (RTP Stream ID) to know who sent the packet and where it should go. These remain unencrypted (by us) but protected by standard DTLS.
Congestion Control: Transport-wide sequence numbers and timestamps are visible. The SFU can still estimate bandwidth usage.
Temporal Scalability: If the video encoder uses SVC (Scalable Video Coding) or Simulcast, the layer dependency information is sometimes in the payload descriptor.
- Warning: If the SFU relies on parsing the payload header (e.g., VP8 descriptor) to drop temporal layers, generic payload encryption will break this.
- Solution: We must use "Unencrypted header bytes" or specific dependency descriptor extensions (AV1 Dependency Descriptor) that sit outside the encrypted frame payload but inside the RTP packet, allowing the SFU to make smart dropping decisions without seeing the pixel data.

For basic routing, standard SFUs treat the encrypted payload as an opaque blob. They forward it blindly.

What breaks?

Transcoding: The SFU cannot convert VP8 to H.264.
Recording: The SFU can only record encrypted files. Playback requires the keys to be stored separately and securely.
Compositing: The SFU cannot merge audio streams or create video grids (MCU functionality).

Performance and Operational Trade-offs

Security always comes with a cost.

CPU Overhead: AES-GCM is hardware-accelerated on most modern CPUs (AES-NI), but the overhead of structured cloning data between the Main Thread and Web Worker is non-zero. On low-end mobile devices, this can introduce 2-5ms of latency per frame. At 60fps (16ms budget), this is significant.
Bandwidth: Appending a 12-byte IV and a 16-byte Authentication Tag to every audio packet (which might only be 100 bytes) introduces a ~25% overhead for audio. For video, the overhead is negligible (<1%).
Start-up Latency: Users cannot see video until the key exchange completes. If the signaling for keys lags behind the media connection, the user sees a black screen.

Operational Flagging:
Do not enable E2EE globally. Use a feature flag enableE2EE in the room configuration. Use it only when necessary (e.g., doctor-patient consults) and fall back to standard transport encryption for general use cases to save battery life.

Production Architecture Blueprint

To deploy this in production, your architecture must evolve:

Client (Browser):
- Initializes RTCPeerConnection.
- Spawns CryptoWorker.
- Generates Ephemeral Identity Key (ECDH).
- Publishes Public Key to Signaling.
Signaling Server (Node/Go/Rust):
- Acts as a Key Relay.
- Stores Public Keys in memory for the duration of the session.
- Crucially: Does NOT store any Symmetric Content Keys. It only relays encrypted blobs between clients.
SFU (Media Server):
- Configured for "Forwarding Only."
- Disable server-side audio mixing.
- Disable server-side recording (or accept that recordings are encrypted blobs).
Observability:
- You lose server-side quality analysis based on content (e.g., "black screen detection").
- You must rely heavily on client-side getStats() to report decode failures (which now indicate decryption failures).

Real-World Deployment Scenario

Consider a Telehealth platform compliant with HIPAA regulations.

The Workflow:

Dr. Smith and Patient Doe join Room Consult-99.
Signaling: They exchange public keys.
Key Derivation: Both clients derive a shared secret (or exchange rotated content keys).
Media Flow:
- Patient's camera captures video.
- Browser encodes to VP8.
- Worker encrypts VP8 frame with AES-GCM.
- Browser sends RTP packet to AWS-hosted SFU.
- SFU forwards packet to Doctor.
- Doctor's browser receives packet.
- Worker decrypts VP8 frame.
- Decoder renders video.
Threat Mitigation: An attacker gains root access to the AWS EC2 instance running the SFU. They run tcpdump. They capture terabytes of traffic.
- Result: They have useless, high-entropy noise. They cannot view the consult. The keys never touched the server disk or memory in plaintext.

Conclusion – The Cost of Zero Trust

Implementing Insertable Streams is a paradigm shift. You are voluntarily blinding your infrastructure to protect your users. You lose the ability to transcode, verify content quality on the server, and easily debug stream artifacts. In exchange, you gain mathematical certainty that your platform is a neutral carrier, immune to data mining and surveillance.

For standard video chat, DTLS-SRTP is sufficient. For sensitive infrastructure—finance, health, legal, and defense—Insertable Streams are not optional; they are the only architecture that satisfies the definition of privacy.

DEV Community