DEV Community: YukiOhira0416

Under-60ms End-to-End RealTime Remote Desktop on Windows — NVENC/CUDA/FEC

YukiOhira0416 — Tue, 13 Jan 2026 14:10:22 +0000

TL:DR

Clam:

On a wired LAN with server and client on separate PCs, disabling all-intra (IDR every frame) and using low-latency IP (B=0) delivers p50 53 ms, p95 71 ms, p99 93 ms end-to-end.
Most frames sit in the 50–80 ms band; exceedance is low (>80 ms 1.69%, >120 ms 0.67%). A rare ~1 s burst (max ~1.08 s) appeared but isn’t perceptible.

How:

D3D11 Capture → CUDA Resize → NVENC → UDP/FEC → NVDEC → D3D12 Render

Evidence:

Clock-aligned frame-level CSV with percentiles/exceedance/1-s worst

Demo Link:

Why it matters

It feels in control. With p50 53 ms / p95 71 ms / p99 93 ms and most frames in 50–80 ms, common desktop actions (typing, cursor, window drags) stay within a sub-100 ms envelope. That preserves the user’s sense of immediacy on a wired LAN.
Predictability over peaks. Exceedance is low (>80 ms 1.69%, >120 ms 0.67%), and the rare ~1 s burst wasn’t perceptible. Day-to-day, that means fewer micro-stutters that break flow—even under mixed loads.
Engineering trade-off that pays off. Dropping all-intra (IDR every frame) for low-latency IP (B=0) with a small VBV reduces encoder/network burstiness and queue buildup. You get tighter tails without sacrificing typical latency or quality.
Headroom for real networks. A stable 50–80 ms typical path on LAN leaves budget for Wi-Fi/WAN jitter later, while keeping interactions natural. It’s a practical baseline for VDI, remote creation, and light game control.
Evidence you can trust. Results are backed by clock-aligned frame-level CSV (WGC→RenderEnd) with percentiles, exceedance, longest-streak, and 1-second worst-case—metrics that track human perception better than averages alone.

Testbed

Server
OS: Windows 11 Pro 24H2
CPU: Ryzen 5 4500
GPU: NVIDIA RTX 4070 12GB (Driver 32.0.15.8129)
Memory: 32GB
NIC: Realtek RTL8125 2.5GbE Controller (Driver 1125.21.903.2024)
Client
OS: Windows 11 Home 24H2
CPU: Intel Core i5-11400H
GPU: NVIDIA GTX 1650 4GB (Driver 32.015.8129)
Intel UHD Graphics (Driver 30.0.101.1340)
Memory: 16GB
NIC: Realtek PCIe GbE Family Controller (Driver 10.72.524.2024)

What “E2E latency” means

Definition (per frame)

We measure E2E latency from server capture complete to client present just after rendering.
E2E_i = client_present_ts[i] - (server_capture_ts[i] + clock_offset_ms_at_i)

server_capture_ts[i]: the moment the captured frame becomes ready for encoding on the server (e.g., WGC/D3D11 frame acquired).
client_present_ts[i]: the moment right after the swap/present returns on the client (render finished).

What it includes / excludes

Includes:capture → convert → encode → transport → decode → render → present.
Excludes:input device latency (keyboard/mouse), panel scan-out, display pixel response.

Clock alignment across hosts (simple NTP-style)

Server and client run on different PCs, so we estimate a clock offset and correct timestamps.

Client sends ping at t0 (client clock)

Server receives at t1 (server clock), replies at t2 (server clock)

Client receives reply at t3 (client clock)

RTT = (t3 - t0) - (t2 - t1) offset (server→client) ≈ ((t1 - t0) + (t2 - t3)) / 2

We sample this periodically (e.g., every 0.5–2 s), keep the lowest-RTT samples, and use a median/low-pass to get clock_offset_ms_at_i.
On a wired LAN the residual error is typically small (a few ms), negligible for our reported scales.

Data hygiene

Exclude warm-up frames (e.g., first N frames) and any frames without both timestamps.
Keep ≥20k frames when possible to stabilize tail stats.
Report: p50 / p95 / p99, exceedance rates (e.g., >80 ms, >120 ms), longest over-threshold streak, and worst 1-second window (count and mean).

Why this definition

It matches what users feel: the time it takes for a captured desktop change to actually appear on the remote screen, with host clocks aligned so we can measure it accurately across two PCs.

Measurement Methodology

We do not redefine E2E here (see What E2E means). We simply take the clock-corrected per-frame E2E latencies already logged (e.g., WGC to RenderEnd: ms, paired by frame_id) and run straightforward statistics:

Preprocessing: drop warm-up frames, rows with missing/mismatched frame_id, and any negative/invalid values.
Statistics: p50 / p95 / p99 / p99.9, min / max / mean; exceedance rates (e.g., >80 ms, >120 ms); longest over-threshold streak; worst 1-second window (count, mean, and max within any 1 s window).
Sampling: at least ~20k frames per run; we report the exact sample count alongside results.
Reporting: a short table of the metrics (plus optional histogram & exceedance plot in the appendix).

In short: we compute descriptive statistics over the already clock-aligned E2E latency log.

Pipeline Architecture

Overview

Two operating modes share the same transport and render path but select different capture sources based on workload characteristics.

Normal mode: desktop/window capture optimized for general productivity, multi-window, and mixed-DWM scenarios.
Game mode: full-screen / flip-model / high-refresh scenarios where the game’s swapchain pacing dominates.
Fallback path: a CUDA-free conversion/encode path for environments where CUDA interop is unavailable or unstable.

Normal mode

Capture — D3D11 + Windows Graphics Capture (WGC)

Chosen because WGC is a first-party, compositor-aware capture API with low overhead and good isolation (no injection/hooking).
It provides stable frame delivery under DWM, handles HiDPI / scaling / occlusion cleanly, and offers an event-driven frame pool that fits a low-latency loop.

Convert — CUDA interop (~4 ms)

Zero-copy interop: BGRA frames are mapped into CUDA and converted to YUV 4:4:4 in a single GPU pass.
The kernel is tuned for coalesced reads/writes, yielding ~4 ms end-to-end per 1080p frame on typical RTX-class GPUs.

Encode — NVENC (HEVC 4:4:4, Low-Latency, B-frames 0)

Input: YUV 4:4:4.
Settings: Intra QP 25, PQP 25, LowLatency profile, B-frames = 0 to avoid re-ordering delay and keep decoder output latency deterministic.

Game mode

Capture — D3D11 + Desktop Duplication (DXGI Output Duplication)

Selected for game workloads where flip-model / exclusive-full-screen and VRR/high-Hz present patterns benefit from scan-out–aligned duplication.
Offers dirty-rects and refresh-locked cadence, improving predictability when the title drives the GPU hard or when overlays and compositor heuristics would otherwise interfere.

Convert — CUDA interop (~4 ms)

Same CUDA path as normal mode: BGRA → YUV 4:4:4 in ~4 ms using shared resources.

Encode — NVENC (HEVC 4:4:4, Low-Latency, B-frames 0)

Same low-latency configuration to keep encode jitter bounded under sustained GPU load.

Fallback (CUDA-free) path

Used automatically when CUDA interop is unavailable or unstable.

Normal mode (fallback)

Capture: D3D11 + WGC
Convert: ComputeShader path (~4 ms), BGRA → B8G8R8A8 (device-local).
Encode: NVENC takes B8G8R8A8 and produces HEVC (Intra QP 25 / PQP 25 / LowLatency / B=0). Vendor conversion runs inside the encode path.

Game mode (fallback)

Capture: D3D11 + Desktop Duplication
Convert: ComputeShader (~4 ms), BGRA → B8G8R8A8
Encode: NVENC → HEVC (Intra QP 25 / PQP 25 / LowLatency / B=0)

Cross-cutting controls

Dynamic QP nudging

Runtime logic adjusts QP within a small band around the baseline (Intra/PQP 25) based on queue depth, exceedance rate (>80/120 ms), and short-term bitrate headroom.
Goal: trim tail latency without degrading the p50–p95 region.

Network — UDP + Adaptive FEC (Reed–Solomon)

Transport is UDP with selective retransmit disabled (latency-first).
Adaptive RS parity tunes protection ratio from recent loss/RTT and reorder statistics; a small jitter buffer keeps playout bounded while preferring latest-frame wins under stress.

Client path

Decode & Present

NVDEC decodes HEVC 4:4:4 via interop into GPU memory.
D3D12 Present (waitable swapchain) composites and presents immediately after decode; the timestamp after present is used for E2E accounting.

Notes on timing & budgets

Both modes target a GPU-resident, zero-copy path from capture to present.
The conversion stage is held near ~4 ms per frame; encode is configured to avoid reorder queues (B=0) and minimize VBV accumulation.
Under load, game mode prioritizes cadence predictability; normal mode prioritizes compositor friendliness and windowing hygiene.

Pitfalls & fixes

CUDA × DirectX interop

Symptom

Sparse, inconsistent examples; many code snippets crash or stall under load.
Took months to reach stable zero-copy; occasional tearing or black frames when stressed.

Cause

GPU–GPU sharing needs exact ownership and sync: wrong fence/barrier scope, or mixing D3D11/12 semantics.
Hidden CPU round-trips (staging copies, implicit Map/Unmap) sneaking into the path.

Fix

Keep frames GPU-resident end-to-end; no CPU readbacks.
Use explicit fences/barriers for each hop (WGC/DX → CUDA → NVENC), and verify resource state transitions.
Standardize on a single interop path (e.g., D3D11 <-> CUDA or D3D12 <-> CUDA) and audit every transition with debug layers enabled.

NVENC “traps” in graphics mode vs CUDA mode

Symptom

D3D12 → (need D3D11 for NVENC) → format gymnastics; pipeline complexity and latency spikes.
NV12 as a UAV not available on consumer RTX; attempts to write NV12 from compute led to dead ends.
Passing a single linear NV12 buffer from CUDA caused NVENC to reject/garble input.

Cause

NVENC D3D mode expectations (resource types/flags) didn’t match the compute path.
Typed UAV for NV12 is not supported on consumer GPUs; direct UAV writes to NV12 aren’t viable.
In CUDA mode, NVENC expects per-plane pointers + correct pitches, not an ad-hoc monolithic layout.

Fix

Switch to NVENC CUDA mode to remove D3D12↔D3D11 impedance.
Produce Y and UV planes separately in CUDA; set exact pitches/strides per plane; hand those to NVENC.
Keep B-frames = 0, low-latency profile, and a small VBV to avoid reorder queues and buffer buildup.

Synchronization & stability of the frame pipeline

Symptom

Occasional jitter or micro-stalls despite low averages; bursts when load changes.
“Feels smooth most of the time” but rare clumps raise p99.

Cause

Over- or under-deep pipelines (capture → convert → encode → send) causing queue dilation or starvation.
Blocking calls in hot paths (sync logging, allocations, implicit flushes) and over-wide critical sections.

Fix

Zero-CPU-wait design: move blocking work off the frame thread; async logging; pooled allocators.
Tune queue depths to the minimum that avoids starvation; drop oldest frames under pressure (“latest-wins”).
Track per-stage enqueue/start/finish + queue length and tune the slowest stage first.

Text unreadability after NV12 path

Symptom

After implementing NV12, small text and UI edges became hard to read; users reported blur/blocking.

Cause

NV12 is 4:2:0 chroma subsampling; desktop content is chroma-sensitive (fine color edges, subpixel AA).
Chroma loss + scaling/present can amplify artifacts.

Fix

Switch to YUV 4:4:4 even at a small latency/bitrate cost; prioritize readability for desktop use.
Keep low-latency encode settings (HEVC 4:4:4, Intra QP 25 / PQP 25, B=0) to control tail latency.

Limitations & Next

Limitations

This build and evaluation target wired LAN only. The video pipeline is HEVC-only (4:4:4), and the client path relies on NVDEC/D3D12, so it currently requires an NVIDIA GPU on the client.

We’ll add AV1 support, enable the client on Intel iGPU (Quick Sync) as a first non-NVIDIA target, and implement peer-to-peer (P2P) transport to bypass relays where possible and further reduce end-to-end latency.

Links & Contact

Code & profile

GitHub: https://github.com/YukiOhira0416

Components:

remote_server_capture — D3D11/WGC or Duplication → CUDA → NVENC

remote_client — NVDEC(HEVC 4:4:4) → D3D12 present, frame telemetry

remote_server_tasktray — daemon/control

Contact

Email: xylish.hyper.cool [at] icloud [dot] com
Please use subject: Hiring • Remote Desktop E2E •

Development of Real-Time Remote Desktop App

YukiOhira0416 — Tue, 14 Oct 2025 13:35:41 +0000

Achieving a Real-Time Gaming Remote Desktop

Overview

On Windows, I combined CUDA / DirectX / NVENC to build a home-grown remote desktop that’s fast enough for gaming (currently at test-rig stage).
Perceived end-to-end latency is about ~40 ms for non-gaming use, and ~50–60 ms while running games.

Why I Built It

“I want to play PC games away from home—like at a café—without lugging around a heavy gaming laptop.”
Gaming laptops are expensive, bulky, and come with theft risk. So the idea was to turn my home gaming PC into a server and play comfortably from a lightweight client elsewhere. Off-the-shelf remote desktops are generally too laggy for games—they’re built for general office use, not for fast-changing, high-motion content like games.
That’s why I set out to build a remote desktop purpose-built for gaming from day one.

Goals and Design Principles

Latency goal: Use 100 ms as the threshold where humans start to feel something’s “off,” and aim for total latency under 50 ms (on LAN).

Server (game machine): Target ~15 ms from capture → process → send. Keep work off the CPU; finish on the GPU.

Client: Lightweight rendering that runs even on typical PCs / iGPUs. It’s NVIDIA-only for now, but I’m adapting it for Intel iGPU (Quick Sync / D3D).

Premise: Leverage the high-end GPU in a gaming PC; prioritize real-time over “lightweight processing.”

Tech Stack & System Overview

OS: Windows

GPU compute: CUDA

Display/Sharing: DirectX (mainly D3D11/12)

Encoding: NVENC (using CUDA mode)

Objective: Keep everything—capture → processing → compression—inside GPU memory to avoid CPU round-trips.

Processing Flow

Server

Screen capture (GPU)

CUDA image processing (color space conversion, scaling, etc.)

Encode with NVENC in CUDA mode

Network transmission

Client

Decode + render (currently NVIDIA-focused, iGPU support in progress)

Server Architecture for Real-Time Performance

GPU-only pipeline: From capture to NVENC submission, keep everything in GPU memory; minimize CPU copies to near zero.

CUDA × DirectX interop: Hand off CUDA kernel results directly to DirectX without returning to the CPU.

Synchronization design: No CPU blocking. Use queuing and timeline control so the GPU work queue stays smooth and ordered.

Analogy

“CUDA = paint-mixing artisan, CPU = paint tube, DirectX = painter.”
If the artisan has to stuff paint into a tube (CPU) every time before handing it to the painter, it’s slow.
If they work from the same palette (GPU memory), the artisan mixes and the painter paints directly—that’s the value of interop.

Client Side: Lightweight and Compatible

Policy: Must run on laptops and iGPUs when out and about.

Current status: Initially built on a single machine, so early dev was NVIDIA-assumed.

In progress: Redesigning to support Intel iGPU for decode + rendering.

Demonstration: Ran FF14 at 4K / max settings on the server (RTX 4070) and rendered successfully on the client (GTX 1650).
A standalone GTX 1650 can’t realistically run FF14 at 4K max, but it’s feasible over remote rendering.

Pain Points During Development (Details)

1) CUDA × DirectX Interop

Sparse references: Almost no production-grade samples in NVIDIA’s official examples, books, or the web.

Implementation difficulty: I tried “AI-assisted ‘vibe coding,’” but lots of examples were wrong; took ~3 months to become stable.

Crux: Don’t bounce through the CPU. Accurate sharing and synchronization in GPU memory (fences/barriers/ownership transitions) makes or breaks it.

2) NVENC “Traps” and Workarounds

Initially I planned to use NVENC in DirectX (D3D) mode, but hit these walls:

D3D12 → D3D11 bridging needed: With a CUDA → D3D12 → NVENC path, you often need to convert to D3D11 for NVENC submission.

NV12 cannot be UAV: Making a standard NV12 resource a UAV (Unordered Access View) isn’t allowed on consumer RTX cards. This can force enterprise GPUs and thus made DirectX mode a non-starter for me.

Switch to CUDA mode: I pivoted to running NVENC in CUDA mode.
But CUDA mode had its own pitfall:

“Write NV12 into a single linear buffer → NVENC rejects”:
Writing NV12 directly into one linear buffer from a CUDA kernel led to NVENC refusing the input.

What worked:

Generate Y plane and UV plane separately.

Using the correct computed pitch, copy each plane into a linear GPU buffer.

Submit those plane buffers to NVENC.
→ Stable operation achieved. This wasn’t clearly documented; took ~2 months to figure out.

3) Synchronization and Stabilization

Zero CPU-wait design: Avoid GPU queue stalls and unnecessary syncs—let everything that can progress keep progressing.

Frame pipeline depth: Tune the capture → processing → encode → send stages so it’s neither too deep nor too shallow. Find the sweet spot between latency and stability.

Result: Smooth perceived motion and robust, glitch-free rendering.

A Standout Feature: Multi-Monitor

Up to 4 monitors captured simultaneously, with instant switching.

Great for gaming on one display while browsing/reading guides/using streaming tools on another—strong differentiation for gaming use.

Off-the-shelf products have limited multi-monitor support, so this is very practical.

What I Learned

The power of a “GPU-complete” pipeline: Eliminating CPU round-trips dramatically improves both latency and stability.

Undocumented behaviors are real: The version that “just works in practice” doesn’t always match the official samples.

AI is a great partner, but not gospel: You still need your own testing. Persistence is required to squash inaccuracies.

What’s Next

Intel iGPU support: Broaden compatibility for decode & rendering across many laptops.

Further latency cuts:

Optimize capture path and memory layout

Adjust frame pipeline depth

Simplify render passes and push zero-copy further

Network resilience: Refine VBR, jitter absorption, and flow control.

User features: Quality/latency presets, hotkeys, streaming mode, recording, and more.

Summary

Goal: Play comfortably from lightweight PCs outside the home.

Method: Build a GPU-complete pipeline with CUDA × DirectX × NVENC.

Result: ~40–60 ms perceived latency, 4-display capture with instant switching.

Challenges: Interop and NVENC info is scarce, and specs can be tricky—but experimentation revealed the path through.

Next: iGPU support and more latency reduction to move from “usable” into truly comfortable territory.

Test Environment

Server

CPU: Ryzen 5 4500 (6 cores)

Memory: 32 GB

GPU: NVIDIA RTX 4070 12 GB

SSD: 1 TB

Client

CPU: Core i5-11400H

Memory: 16 GB

GPU: GTX 1650

SSD: 500 GB

Demo video of the app running Final Fantasy XIV (FFXIV) at high quality on the server.

Demo video