YukiOhira0416

Posted on Jan 13

Under-60ms End-to-End RealTime Remote Desktop on Windows — NVENC/CUDA/FEC

#architecture #networking #performance #showdev

TL:DR

Clam:

On a wired LAN with server and client on separate PCs, disabling all-intra (IDR every frame) and using low-latency IP (B=0) delivers p50 53 ms, p95 71 ms, p99 93 ms end-to-end.
Most frames sit in the 50–80 ms band; exceedance is low (>80 ms 1.69%, >120 ms 0.67%). A rare ~1 s burst (max ~1.08 s) appeared but isn’t perceptible.

How:

D3D11 Capture → CUDA Resize → NVENC → UDP/FEC → NVDEC → D3D12 Render

Evidence:

Clock-aligned frame-level CSV with percentiles/exceedance/1-s worst

Demo Link:

Why it matters

It feels in control. With p50 53 ms / p95 71 ms / p99 93 ms and most frames in 50–80 ms, common desktop actions (typing, cursor, window drags) stay within a sub-100 ms envelope. That preserves the user’s sense of immediacy on a wired LAN.
Predictability over peaks. Exceedance is low (>80 ms 1.69%, >120 ms 0.67%), and the rare ~1 s burst wasn’t perceptible. Day-to-day, that means fewer micro-stutters that break flow—even under mixed loads.
Engineering trade-off that pays off. Dropping all-intra (IDR every frame) for low-latency IP (B=0) with a small VBV reduces encoder/network burstiness and queue buildup. You get tighter tails without sacrificing typical latency or quality.
Headroom for real networks. A stable 50–80 ms typical path on LAN leaves budget for Wi-Fi/WAN jitter later, while keeping interactions natural. It’s a practical baseline for VDI, remote creation, and light game control.
Evidence you can trust. Results are backed by clock-aligned frame-level CSV (WGC→RenderEnd) with percentiles, exceedance, longest-streak, and 1-second worst-case—metrics that track human perception better than averages alone.

Testbed

Server
OS: Windows 11 Pro 24H2
CPU: Ryzen 5 4500
GPU: NVIDIA RTX 4070 12GB (Driver 32.0.15.8129)
Memory: 32GB
NIC: Realtek RTL8125 2.5GbE Controller (Driver 1125.21.903.2024)
Client
OS: Windows 11 Home 24H2
CPU: Intel Core i5-11400H
GPU: NVIDIA GTX 1650 4GB (Driver 32.015.8129)
Intel UHD Graphics (Driver 30.0.101.1340)
Memory: 16GB
NIC: Realtek PCIe GbE Family Controller (Driver 10.72.524.2024)

What “E2E latency” means

Definition (per frame)

We measure E2E latency from server capture complete to client present just after rendering.
E2E_i = client_present_ts[i] - (server_capture_ts[i] + clock_offset_ms_at_i)

server_capture_ts[i]: the moment the captured frame becomes ready for encoding on the server (e.g., WGC/D3D11 frame acquired).
client_present_ts[i]: the moment right after the swap/present returns on the client (render finished).

What it includes / excludes

Includes:capture → convert → encode → transport → decode → render → present.
Excludes:input device latency (keyboard/mouse), panel scan-out, display pixel response.

Clock alignment across hosts (simple NTP-style)

Server and client run on different PCs, so we estimate a clock offset and correct timestamps.

Client sends ping at t0 (client clock)

Server receives at t1 (server clock), replies at t2 (server clock)

Client receives reply at t3 (client clock)

RTT = (t3 - t0) - (t2 - t1) offset (server→client) ≈ ((t1 - t0) + (t2 - t3)) / 2

We sample this periodically (e.g., every 0.5–2 s), keep the lowest-RTT samples, and use a median/low-pass to get clock_offset_ms_at_i.
On a wired LAN the residual error is typically small (a few ms), negligible for our reported scales.

Data hygiene

Exclude warm-up frames (e.g., first N frames) and any frames without both timestamps.
Keep ≥20k frames when possible to stabilize tail stats.
Report: p50 / p95 / p99, exceedance rates (e.g., >80 ms, >120 ms), longest over-threshold streak, and worst 1-second window (count and mean).

Why this definition

It matches what users feel: the time it takes for a captured desktop change to actually appear on the remote screen, with host clocks aligned so we can measure it accurately across two PCs.

Measurement Methodology

We do not redefine E2E here (see What E2E means). We simply take the clock-corrected per-frame E2E latencies already logged (e.g., WGC to RenderEnd: ms, paired by frame_id) and run straightforward statistics:

Preprocessing: drop warm-up frames, rows with missing/mismatched frame_id, and any negative/invalid values.
Statistics: p50 / p95 / p99 / p99.9, min / max / mean; exceedance rates (e.g., >80 ms, >120 ms); longest over-threshold streak; worst 1-second window (count, mean, and max within any 1 s window).
Sampling: at least ~20k frames per run; we report the exact sample count alongside results.
Reporting: a short table of the metrics (plus optional histogram & exceedance plot in the appendix).

In short: we compute descriptive statistics over the already clock-aligned E2E latency log.

Pipeline Architecture

Overview

Two operating modes share the same transport and render path but select different capture sources based on workload characteristics.

Normal mode: desktop/window capture optimized for general productivity, multi-window, and mixed-DWM scenarios.
Game mode: full-screen / flip-model / high-refresh scenarios where the game’s swapchain pacing dominates.
Fallback path: a CUDA-free conversion/encode path for environments where CUDA interop is unavailable or unstable.

Normal mode

Capture — D3D11 + Windows Graphics Capture (WGC)

Chosen because WGC is a first-party, compositor-aware capture API with low overhead and good isolation (no injection/hooking).
It provides stable frame delivery under DWM, handles HiDPI / scaling / occlusion cleanly, and offers an event-driven frame pool that fits a low-latency loop.

Convert — CUDA interop (~4 ms)

Zero-copy interop: BGRA frames are mapped into CUDA and converted to YUV 4:4:4 in a single GPU pass.
The kernel is tuned for coalesced reads/writes, yielding ~4 ms end-to-end per 1080p frame on typical RTX-class GPUs.

Encode — NVENC (HEVC 4:4:4, Low-Latency, B-frames 0)

Input: YUV 4:4:4.
Settings: Intra QP 25, PQP 25, LowLatency profile, B-frames = 0 to avoid re-ordering delay and keep decoder output latency deterministic.

Game mode

Capture — D3D11 + Desktop Duplication (DXGI Output Duplication)

Selected for game workloads where flip-model / exclusive-full-screen and VRR/high-Hz present patterns benefit from scan-out–aligned duplication.
Offers dirty-rects and refresh-locked cadence, improving predictability when the title drives the GPU hard or when overlays and compositor heuristics would otherwise interfere.

Convert — CUDA interop (~4 ms)

Same CUDA path as normal mode: BGRA → YUV 4:4:4 in ~4 ms using shared resources.

Encode — NVENC (HEVC 4:4:4, Low-Latency, B-frames 0)

Same low-latency configuration to keep encode jitter bounded under sustained GPU load.

Fallback (CUDA-free) path

Used automatically when CUDA interop is unavailable or unstable.

Normal mode (fallback)

Capture: D3D11 + WGC
Convert: ComputeShader path (~4 ms), BGRA → B8G8R8A8 (device-local).
Encode: NVENC takes B8G8R8A8 and produces HEVC (Intra QP 25 / PQP 25 / LowLatency / B=0). Vendor conversion runs inside the encode path.

Game mode (fallback)

Capture: D3D11 + Desktop Duplication
Convert: ComputeShader (~4 ms), BGRA → B8G8R8A8
Encode: NVENC → HEVC (Intra QP 25 / PQP 25 / LowLatency / B=0)

Cross-cutting controls

Dynamic QP nudging

Runtime logic adjusts QP within a small band around the baseline (Intra/PQP 25) based on queue depth, exceedance rate (>80/120 ms), and short-term bitrate headroom.
Goal: trim tail latency without degrading the p50–p95 region.

Network — UDP + Adaptive FEC (Reed–Solomon)

Transport is UDP with selective retransmit disabled (latency-first).
Adaptive RS parity tunes protection ratio from recent loss/RTT and reorder statistics; a small jitter buffer keeps playout bounded while preferring latest-frame wins under stress.

Client path

Decode & Present

NVDEC decodes HEVC 4:4:4 via interop into GPU memory.
D3D12 Present (waitable swapchain) composites and presents immediately after decode; the timestamp after present is used for E2E accounting.

Notes on timing & budgets

Both modes target a GPU-resident, zero-copy path from capture to present.
The conversion stage is held near ~4 ms per frame; encode is configured to avoid reorder queues (B=0) and minimize VBV accumulation.
Under load, game mode prioritizes cadence predictability; normal mode prioritizes compositor friendliness and windowing hygiene.

Pitfalls & fixes

CUDA × DirectX interop

Symptom

Sparse, inconsistent examples; many code snippets crash or stall under load.
Took months to reach stable zero-copy; occasional tearing or black frames when stressed.

Cause

GPU–GPU sharing needs exact ownership and sync: wrong fence/barrier scope, or mixing D3D11/12 semantics.
Hidden CPU round-trips (staging copies, implicit Map/Unmap) sneaking into the path.

Fix

Keep frames GPU-resident end-to-end; no CPU readbacks.
Use explicit fences/barriers for each hop (WGC/DX → CUDA → NVENC), and verify resource state transitions.
Standardize on a single interop path (e.g., D3D11 <-> CUDA or D3D12 <-> CUDA) and audit every transition with debug layers enabled.

NVENC “traps” in graphics mode vs CUDA mode

Symptom

D3D12 → (need D3D11 for NVENC) → format gymnastics; pipeline complexity and latency spikes.
NV12 as a UAV not available on consumer RTX; attempts to write NV12 from compute led to dead ends.
Passing a single linear NV12 buffer from CUDA caused NVENC to reject/garble input.

Cause

NVENC D3D mode expectations (resource types/flags) didn’t match the compute path.
Typed UAV for NV12 is not supported on consumer GPUs; direct UAV writes to NV12 aren’t viable.
In CUDA mode, NVENC expects per-plane pointers + correct pitches, not an ad-hoc monolithic layout.

Fix

Switch to NVENC CUDA mode to remove D3D12↔D3D11 impedance.
Produce Y and UV planes separately in CUDA; set exact pitches/strides per plane; hand those to NVENC.
Keep B-frames = 0, low-latency profile, and a small VBV to avoid reorder queues and buffer buildup.

Synchronization & stability of the frame pipeline

Symptom

Occasional jitter or micro-stalls despite low averages; bursts when load changes.
“Feels smooth most of the time” but rare clumps raise p99.

Cause

Over- or under-deep pipelines (capture → convert → encode → send) causing queue dilation or starvation.
Blocking calls in hot paths (sync logging, allocations, implicit flushes) and over-wide critical sections.

Fix

Zero-CPU-wait design: move blocking work off the frame thread; async logging; pooled allocators.
Tune queue depths to the minimum that avoids starvation; drop oldest frames under pressure (“latest-wins”).
Track per-stage enqueue/start/finish + queue length and tune the slowest stage first.

Text unreadability after NV12 path

Symptom

After implementing NV12, small text and UI edges became hard to read; users reported blur/blocking.

Cause

NV12 is 4:2:0 chroma subsampling; desktop content is chroma-sensitive (fine color edges, subpixel AA).
Chroma loss + scaling/present can amplify artifacts.

Fix

Switch to YUV 4:4:4 even at a small latency/bitrate cost; prioritize readability for desktop use.
Keep low-latency encode settings (HEVC 4:4:4, Intra QP 25 / PQP 25, B=0) to control tail latency.

Limitations & Next

Limitations

This build and evaluation target wired LAN only. The video pipeline is HEVC-only (4:4:4), and the client path relies on NVDEC/D3D12, so it currently requires an NVIDIA GPU on the client.

We’ll add AV1 support, enable the client on Intel iGPU (Quick Sync) as a first non-NVIDIA target, and implement peer-to-peer (P2P) transport to bypass relays where possible and further reduce end-to-end latency.