<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: YukiOhira0416</title>
    <description>The latest articles on DEV Community by YukiOhira0416 (@yukiohira0416).</description>
    <link>https://dev.to/yukiohira0416</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3563884%2F04131ad5-8056-494f-94ea-e487be24882f.png</url>
      <title>DEV Community: YukiOhira0416</title>
      <link>https://dev.to/yukiohira0416</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yukiohira0416"/>
    <language>en</language>
    <item>
      <title>Under-60ms End-to-End RealTime Remote Desktop on Windows — NVENC/CUDA/FEC</title>
      <dc:creator>YukiOhira0416</dc:creator>
      <pubDate>Tue, 13 Jan 2026 14:10:22 +0000</pubDate>
      <link>https://dev.to/yukiohira0416/under-60ms-end-to-end-realtime-remote-desktop-on-windows-nvenccudafec-de7</link>
      <guid>https://dev.to/yukiohira0416/under-60ms-end-to-end-realtime-remote-desktop-on-windows-nvenccudafec-de7</guid>
      <description>&lt;h2&gt;
  
  
  TL:DR
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Clam:
&lt;/h3&gt;

&lt;p&gt;On a wired LAN with server and client on separate PCs, disabling all-intra (IDR every frame) and using low-latency IP (B=0) delivers p50 53 ms, p95 71 ms, p99 93 ms end-to-end.&lt;br&gt;
Most frames sit in the 50–80 ms band; exceedance is low (&amp;gt;80 ms 1.69%, &amp;gt;120 ms 0.67%). A rare ~1 s burst (max ~1.08 s) appeared but isn’t perceptible.&lt;/p&gt;

&lt;h3&gt;
  
  
  How:
&lt;/h3&gt;

&lt;p&gt;D3D11 Capture → CUDA Resize → NVENC → UDP/FEC → NVDEC → D3D12 Render&lt;/p&gt;

&lt;h3&gt;
  
  
  Evidence:
&lt;/h3&gt;

&lt;p&gt;Clock-aligned frame-level CSV with percentiles/exceedance/1-s worst&lt;/p&gt;

&lt;h3&gt;
  
  
  Demo Link:
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://youtu.be/eH123oOcDY8" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it matters
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It feels in control. With p50 53 ms / p95 71 ms / p99 93 ms and most frames in 50–80 ms, common desktop actions (typing, cursor, window drags) stay within a sub-100 ms envelope. That preserves the user’s sense of immediacy on a wired LAN.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predictability over peaks. Exceedance is low (&amp;gt;80 ms 1.69%, &amp;gt;120 ms 0.67%), and the rare ~1 s burst wasn’t perceptible. Day-to-day, that means fewer micro-stutters that break flow—even under mixed loads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Engineering trade-off that pays off. Dropping all-intra (IDR every frame) for low-latency IP (B=0) with a small VBV reduces encoder/network burstiness and queue buildup. You get tighter tails without sacrificing typical latency or quality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Headroom for real networks. A stable 50–80 ms typical path on LAN leaves budget for Wi-Fi/WAN jitter later, while keeping interactions natural. It’s a practical baseline for VDI, remote creation, and light game control.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Evidence you can trust. Results are backed by clock-aligned frame-level CSV (WGC→RenderEnd) with percentiles, exceedance, longest-streak, and 1-second worst-case—metrics that track human perception better than averages alone.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Testbed
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Server&lt;br&gt;
OS: Windows 11 Pro 24H2&lt;br&gt;
CPU: Ryzen 5 4500&lt;br&gt;
GPU: NVIDIA RTX 4070 12GB (Driver 32.0.15.8129)&lt;br&gt;
Memory: 32GB&lt;br&gt;
NIC: Realtek RTL8125 2.5GbE Controller (Driver 1125.21.903.2024)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Client&lt;br&gt;
OS: Windows 11 Home 24H2&lt;br&gt;
CPU: Intel Core i5-11400H&lt;br&gt;
GPU: NVIDIA GTX 1650 4GB (Driver 32.015.8129)&lt;br&gt;
   Intel UHD Graphics (Driver 30.0.101.1340)&lt;br&gt;
Memory: 16GB&lt;br&gt;
NIC: Realtek PCIe GbE Family Controller (Driver 10.72.524.2024)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What “E2E latency” means
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Definition (per frame)
&lt;/h3&gt;

&lt;p&gt;We measure E2E latency from server capture complete to client present just after rendering.&lt;br&gt;
&lt;code&gt;E2E_i = client_present_ts[i] - (server_capture_ts[i] + clock_offset_ms_at_i)&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;server_capture_ts[i]: the moment the captured frame becomes ready for encoding on the server (e.g., WGC/D3D11 frame acquired).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;client_present_ts[i]: the moment right after the swap/present returns on the client (render finished).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What it includes / excludes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Includes:&lt;/strong&gt;capture → convert → encode → transport → decode → render → present.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Excludes:&lt;/strong&gt;input device latency (keyboard/mouse), panel scan-out, display pixel response.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Clock alignment across hosts (simple NTP-style)
&lt;/h3&gt;

&lt;p&gt;Server and client run on different PCs, so we estimate a clock offset and correct timestamps.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Client sends ping at t0 (client clock)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Server receives at t1 (server clock), replies at t2 (server clock)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Client receives reply at t3 (client clock)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RTT = (t3 - t0) - (t2 - t1) offset (server→client) ≈ ((t1 - t0) + (t2 - t3)) / 2&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We sample this periodically (e.g., every 0.5–2 s), keep the lowest-RTT samples, and use a median/low-pass to get clock_offset_ms_at_i.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;On a wired LAN the residual error is typically small (a few ms), negligible for our reported scales.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data hygiene
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Exclude warm-up frames (e.g., first N frames) and any frames without both timestamps.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep ≥20k frames when possible to stabilize tail stats.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Report: p50 / p95 / p99, exceedance rates (e.g., &amp;gt;80 ms, &amp;gt;120 ms), longest over-threshold streak, and worst 1-second window (count and mean).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why this definition
&lt;/h3&gt;

&lt;p&gt;It matches what users feel: the time it takes for a captured desktop change to actually appear on the remote screen, with host clocks aligned so we can measure it accurately across two PCs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measurement Methodology
&lt;/h2&gt;

&lt;p&gt;We do not redefine E2E here (see What E2E means). We simply take the clock-corrected per-frame E2E latencies already logged (e.g., WGC to RenderEnd:  ms, paired by frame_id) and run straightforward statistics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Preprocessing: drop warm-up frames, rows with missing/mismatched frame_id, and any negative/invalid values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Statistics: p50 / p95 / p99 / p99.9, min / max / mean; exceedance rates (e.g., &amp;gt;80 ms, &amp;gt;120 ms); longest over-threshold streak; worst 1-second window (count, mean, and max within any 1 s window).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Sampling: at least ~20k frames per run; we report the exact sample count alongside results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reporting: a short table of the metrics (plus optional histogram &amp;amp; exceedance plot in the appendix).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short: we compute descriptive statistics over the already clock-aligned E2E latency log.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;Two operating modes share the same transport and render path but select different capture sources based on workload characteristics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Normal mode: desktop/window capture optimized for general productivity, multi-window, and mixed-DWM scenarios.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Game mode: full-screen / flip-model / high-refresh scenarios where the game’s swapchain pacing dominates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fallback path: a CUDA-free conversion/encode path for environments where CUDA interop is unavailable or unstable.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Normal mode
&lt;/h4&gt;

&lt;p&gt;Capture — D3D11 + Windows Graphics Capture (WGC)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Chosen because WGC is a first-party, compositor-aware capture API with low overhead and good isolation (no injection/hooking).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It provides stable frame delivery under DWM, handles HiDPI / scaling / occlusion cleanly, and offers an event-driven frame pool that fits a low-latency loop.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Convert — CUDA interop (~4 ms)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Zero-copy interop: BGRA frames are mapped into CUDA and converted to YUV 4:4:4 in a single GPU pass.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The kernel is tuned for coalesced reads/writes, yielding ~4 ms end-to-end per 1080p frame on typical RTX-class GPUs.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encode — NVENC (HEVC 4:4:4, Low-Latency, B-frames 0)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Input: YUV 4:4:4.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Settings: Intra QP 25, PQP 25, LowLatency profile, B-frames = 0 to avoid re-ordering delay and keep decoder output latency deterministic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Game mode
&lt;/h4&gt;

&lt;p&gt;Capture — D3D11 + Desktop Duplication (DXGI Output Duplication)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Selected for game workloads where flip-model / exclusive-full-screen and VRR/high-Hz present patterns benefit from scan-out–aligned duplication.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Offers dirty-rects and refresh-locked cadence, improving predictability when the title drives the GPU hard or when overlays and compositor heuristics would otherwise interfere.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Convert — CUDA interop (~4 ms)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same CUDA path as normal mode: BGRA → YUV 4:4:4 in ~4 ms using shared resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encode — NVENC (HEVC 4:4:4, Low-Latency, B-frames 0)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same low-latency configuration to keep encode jitter bounded under sustained GPU load.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Fallback (CUDA-free) path
&lt;/h4&gt;

&lt;p&gt;Used automatically when CUDA interop is unavailable or unstable.&lt;/p&gt;

&lt;h5&gt;
  
  
  Normal mode (fallback)
&lt;/h5&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Capture: D3D11 + WGC&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Convert: ComputeShader path (~4 ms), BGRA → B8G8R8A8 (device-local).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Encode: NVENC takes B8G8R8A8 and produces HEVC (Intra QP 25 / PQP 25 / LowLatency / B=0). Vendor conversion runs inside the encode path.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h5&gt;
  
  
  Game mode (fallback)
&lt;/h5&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Capture: D3D11 + Desktop Duplication&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Convert: ComputeShader (~4 ms), BGRA → B8G8R8A8&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Encode: NVENC → HEVC (Intra QP 25 / PQP 25 / LowLatency / B=0)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cross-cutting controls
&lt;/h4&gt;

&lt;p&gt;Dynamic QP nudging&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Runtime logic adjusts QP within a small band around the baseline (Intra/PQP 25) based on queue depth, exceedance rate (&amp;gt;80/120 ms), and short-term bitrate headroom.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Goal: trim tail latency without degrading the p50–p95 region.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Network — UDP + Adaptive FEC (Reed–Solomon)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Transport is UDP with selective retransmit disabled (latency-first).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adaptive RS parity tunes protection ratio from recent loss/RTT and reorder statistics; a small jitter buffer keeps playout bounded while preferring latest-frame wins under stress.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Client path
&lt;/h4&gt;

&lt;h5&gt;
  
  
  Decode &amp;amp; Present
&lt;/h5&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NVDEC decodes HEVC 4:4:4 via interop into GPU memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;D3D12 Present (waitable swapchain) composites and presents immediately after decode; the timestamp after present is used for E2E accounting.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Notes on timing &amp;amp; budgets
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Both modes target a GPU-resident, zero-copy path from capture to present.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The conversion stage is held near ~4 ms per frame; encode is configured to avoid reorder queues (B=0) and minimize VBV accumulation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Under load, game mode prioritizes cadence predictability; normal mode prioritizes compositor friendliness and windowing hygiene.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pitfalls &amp;amp; fixes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CUDA × DirectX interop
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Symptom
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Sparse, inconsistent examples; many code snippets crash or stall under load.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Took months to reach stable zero-copy; occasional tearing or black frames when stressed.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cause
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GPU–GPU sharing needs exact ownership and sync: wrong fence/barrier scope, or mixing D3D11/12 semantics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hidden CPU round-trips (staging copies, implicit Map/Unmap) sneaking into the path.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Keep frames GPU-resident end-to-end; no CPU readbacks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use explicit fences/barriers for each hop (WGC/DX → CUDA → NVENC), and verify resource state transitions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Standardize on a single interop path (e.g., D3D11 &amp;lt;-&amp;gt; CUDA or D3D12 &amp;lt;-&amp;gt; CUDA) and audit every transition with debug layers enabled.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  NVENC “traps” in graphics mode vs CUDA mode
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Symptom
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;D3D12 → (need D3D11 for NVENC) → format gymnastics; pipeline complexity and latency spikes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NV12 as a UAV not available on consumer RTX; attempts to write NV12 from compute led to dead ends.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Passing a single linear NV12 buffer from CUDA caused NVENC to reject/garble input.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cause
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NVENC D3D mode expectations (resource types/flags) didn’t match the compute path.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Typed UAV for NV12 is not supported on consumer GPUs; direct UAV writes to NV12 aren’t viable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In CUDA mode, NVENC expects per-plane pointers + correct pitches, not an ad-hoc monolithic layout.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Switch to NVENC CUDA mode to remove D3D12↔D3D11 impedance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Produce Y and UV planes separately in CUDA; set exact pitches/strides per plane; hand those to NVENC.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep B-frames = 0, low-latency profile, and a small VBV to avoid reorder queues and buffer buildup.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Synchronization &amp;amp; stability of the frame pipeline
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Symptom
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Occasional jitter or micro-stalls despite low averages; bursts when load changes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“Feels smooth most of the time” but rare clumps raise p99.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cause
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Over- or under-deep pipelines (capture → convert → encode → send) causing queue dilation or starvation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blocking calls in hot paths (sync logging, allocations, implicit flushes) and over-wide critical sections.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Zero-CPU-wait design: move blocking work off the frame thread; async logging; pooled allocators.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tune queue depths to the minimum that avoids starvation; drop oldest frames under pressure (“latest-wins”).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Track per-stage enqueue/start/finish + queue length and tune the slowest stage first.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Text unreadability after NV12 path
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Symptom
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;After implementing NV12, small text and UI edges became hard to read; users reported blur/blocking.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cause
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NV12 is 4:2:0 chroma subsampling; desktop content is chroma-sensitive (fine color edges, subpixel AA).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Chroma loss + scaling/present can amplify artifacts.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Switch to YUV 4:4:4 even at a small latency/bitrate cost; prioritize readability for desktop use.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep low-latency encode settings (HEVC 4:4:4, Intra QP 25 / PQP 25, B=0) to control tail latency.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Limitations &amp;amp; Next
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;p&gt;This build and evaluation target wired LAN only. The video pipeline is HEVC-only (4:4:4), and the client path relies on NVDEC/D3D12, so it currently requires an NVIDIA GPU on the client.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next
&lt;/h3&gt;

&lt;p&gt;We’ll add AV1 support, enable the client on Intel iGPU (Quick Sync) as a first non-NVIDIA target, and implement peer-to-peer (P2P) transport to bypass relays where possible and further reduce end-to-end latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links &amp;amp; Contact
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code &amp;amp; profile
&lt;/h3&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/YukiOhira0416" rel="noopener noreferrer"&gt;https://github.com/YukiOhira0416&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Components:&lt;/p&gt;

&lt;p&gt;remote_server_capture — D3D11/WGC or Duplication → CUDA → NVENC&lt;/p&gt;

&lt;p&gt;remote_client — NVDEC(HEVC 4:4:4) → D3D12 present, frame telemetry&lt;/p&gt;

&lt;p&gt;remote_server_tasktray — daemon/control&lt;/p&gt;

&lt;h3&gt;
  
  
  Contact
&lt;/h3&gt;

&lt;p&gt;Email: xylish.hyper.cool [at] icloud [dot] com&lt;br&gt;
Please use subject: Hiring • Remote Desktop E2E •&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>networking</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Development of Real-Time Remote Desktop App</title>
      <dc:creator>YukiOhira0416</dc:creator>
      <pubDate>Tue, 14 Oct 2025 13:35:41 +0000</pubDate>
      <link>https://dev.to/yukiohira0416/development-of-real-time-remote-desktop-app-4eak</link>
      <guid>https://dev.to/yukiohira0416/development-of-real-time-remote-desktop-app-4eak</guid>
      <description>&lt;h2&gt;
  
  
  Achieving a Real-Time Gaming Remote Desktop
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;On Windows, I combined CUDA / DirectX / NVENC to build a home-grown remote desktop that’s fast enough for gaming (currently at test-rig stage).&lt;br&gt;
Perceived end-to-end latency is about ~40 ms for non-gaming use, and ~50–60 ms while running games.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built It
&lt;/h2&gt;

&lt;p&gt;“I want to play PC games away from home—like at a café—without lugging around a heavy gaming laptop.”&lt;br&gt;
Gaming laptops are expensive, bulky, and come with theft risk. So the idea was to turn my home gaming PC into a server and play comfortably from a lightweight client elsewhere. Off-the-shelf remote desktops are generally too laggy for games—they’re built for general office use, not for fast-changing, high-motion content like games.&lt;br&gt;
That’s why I set out to build a remote desktop purpose-built for gaming from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Goals and Design Principles
&lt;/h2&gt;

&lt;p&gt;Latency goal: Use 100 ms as the threshold where humans start to feel something’s “off,” and aim for total latency under 50 ms (on LAN).&lt;/p&gt;

&lt;p&gt;Server (game machine): Target ~15 ms from capture → process → send. Keep work off the CPU; finish on the GPU.&lt;/p&gt;

&lt;p&gt;Client: Lightweight rendering that runs even on typical PCs / iGPUs. It’s NVIDIA-only for now, but I’m adapting it for Intel iGPU (Quick Sync / D3D).&lt;/p&gt;

&lt;p&gt;Premise: Leverage the high-end GPU in a gaming PC; prioritize real-time over “lightweight processing.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack &amp;amp; System Overview
&lt;/h2&gt;

&lt;p&gt;OS: Windows&lt;/p&gt;

&lt;p&gt;GPU compute: CUDA&lt;/p&gt;

&lt;p&gt;Display/Sharing: DirectX (mainly D3D11/12)&lt;/p&gt;

&lt;p&gt;Encoding: NVENC (using CUDA mode)&lt;/p&gt;

&lt;p&gt;Objective: Keep everything—capture → processing → compression—inside GPU memory to avoid CPU round-trips.&lt;/p&gt;

&lt;h2&gt;
  
  
  Processing Flow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Server
&lt;/h3&gt;

&lt;p&gt;Screen capture (GPU)&lt;/p&gt;

&lt;p&gt;CUDA image processing (color space conversion, scaling, etc.)&lt;/p&gt;

&lt;p&gt;Encode with NVENC in CUDA mode&lt;/p&gt;

&lt;p&gt;Network transmission&lt;/p&gt;

&lt;h3&gt;
  
  
  Client
&lt;/h3&gt;

&lt;p&gt;Decode + render (currently NVIDIA-focused, iGPU support in progress)&lt;/p&gt;

&lt;h2&gt;
  
  
  Server Architecture for Real-Time Performance
&lt;/h2&gt;

&lt;p&gt;GPU-only pipeline: From capture to NVENC submission, keep everything in GPU memory; minimize CPU copies to near zero.&lt;/p&gt;

&lt;p&gt;CUDA × DirectX interop: Hand off CUDA kernel results directly to DirectX without returning to the CPU.&lt;/p&gt;

&lt;p&gt;Synchronization design: No CPU blocking. Use queuing and timeline control so the GPU work queue stays smooth and ordered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Analogy
&lt;/h3&gt;

&lt;p&gt;“CUDA = paint-mixing artisan, CPU = paint tube, DirectX = painter.”&lt;br&gt;
If the artisan has to stuff paint into a tube (CPU) every time before handing it to the painter, it’s slow.&lt;br&gt;
If they work from the same palette (GPU memory), the artisan mixes and the painter paints directly—that’s the value of interop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Client Side: Lightweight and Compatible
&lt;/h2&gt;

&lt;p&gt;Policy: Must run on laptops and iGPUs when out and about.&lt;/p&gt;

&lt;p&gt;Current status: Initially built on a single machine, so early dev was NVIDIA-assumed.&lt;/p&gt;

&lt;p&gt;In progress: Redesigning to support Intel iGPU for decode + rendering.&lt;/p&gt;

&lt;p&gt;Demonstration: Ran FF14 at 4K / max settings on the server (RTX 4070) and rendered successfully on the client (GTX 1650).&lt;br&gt;
A standalone GTX 1650 can’t realistically run FF14 at 4K max, but it’s feasible over remote rendering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pain Points During Development (Details)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) CUDA × DirectX Interop
&lt;/h3&gt;

&lt;p&gt;Sparse references: Almost no production-grade samples in NVIDIA’s official examples, books, or the web.&lt;/p&gt;

&lt;p&gt;Implementation difficulty: I tried “AI-assisted ‘vibe coding,’” but lots of examples were wrong; took ~3 months to become stable.&lt;/p&gt;

&lt;p&gt;Crux: Don’t bounce through the CPU. Accurate sharing and synchronization in GPU memory (fences/barriers/ownership transitions) makes or breaks it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) NVENC “Traps” and Workarounds
&lt;/h3&gt;

&lt;p&gt;Initially I planned to use NVENC in DirectX (D3D) mode, but hit these walls:&lt;/p&gt;

&lt;p&gt;D3D12 → D3D11 bridging needed: With a CUDA → D3D12 → NVENC path, you often need to convert to D3D11 for NVENC submission.&lt;/p&gt;

&lt;p&gt;NV12 cannot be UAV: Making a standard NV12 resource a UAV (Unordered Access View) isn’t allowed on consumer RTX cards. This can force enterprise GPUs and thus made DirectX mode a non-starter for me.&lt;/p&gt;

&lt;p&gt;Switch to CUDA mode: I pivoted to running NVENC in CUDA mode.&lt;br&gt;
But CUDA mode had its own pitfall:&lt;/p&gt;

&lt;p&gt;“Write NV12 into a single linear buffer → NVENC rejects”:&lt;br&gt;
Writing NV12 directly into one linear buffer from a CUDA kernel led to NVENC refusing the input.&lt;/p&gt;

&lt;h4&gt;
  
  
  What worked:
&lt;/h4&gt;

&lt;p&gt;Generate Y plane and UV plane separately.&lt;/p&gt;

&lt;p&gt;Using the correct computed pitch, copy each plane into a linear GPU buffer.&lt;/p&gt;

&lt;p&gt;Submit those plane buffers to NVENC.&lt;br&gt;
→ Stable operation achieved. This wasn’t clearly documented; took ~2 months to figure out.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Synchronization and Stabilization
&lt;/h3&gt;

&lt;p&gt;Zero CPU-wait design: Avoid GPU queue stalls and unnecessary syncs—let everything that can progress keep progressing.&lt;/p&gt;

&lt;p&gt;Frame pipeline depth: Tune the capture → processing → encode → send stages so it’s neither too deep nor too shallow. Find the sweet spot between latency and stability.&lt;/p&gt;

&lt;p&gt;Result: Smooth perceived motion and robust, glitch-free rendering.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Standout Feature: Multi-Monitor
&lt;/h2&gt;

&lt;p&gt;Up to 4 monitors captured simultaneously, with instant switching.&lt;/p&gt;

&lt;p&gt;Great for gaming on one display while browsing/reading guides/using streaming tools on another—strong differentiation for gaming use.&lt;/p&gt;

&lt;p&gt;Off-the-shelf products have limited multi-monitor support, so this is very practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The power of a “GPU-complete” pipeline: Eliminating CPU round-trips dramatically improves both latency and stability.&lt;/p&gt;

&lt;p&gt;Undocumented behaviors are real: The version that “just works in practice” doesn’t always match the official samples.&lt;/p&gt;

&lt;p&gt;AI is a great partner, but not gospel: You still need your own testing. Persistence is required to squash inaccuracies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;Intel iGPU support: Broaden compatibility for decode &amp;amp; rendering across many laptops.&lt;/p&gt;

&lt;p&gt;Further latency cuts:&lt;/p&gt;

&lt;p&gt;Optimize capture path and memory layout&lt;/p&gt;

&lt;p&gt;Adjust frame pipeline depth&lt;/p&gt;

&lt;p&gt;Simplify render passes and push zero-copy further&lt;/p&gt;

&lt;p&gt;Network resilience: Refine VBR, jitter absorption, and flow control.&lt;/p&gt;

&lt;p&gt;User features: Quality/latency presets, hotkeys, streaming mode, recording, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Goal: Play comfortably from lightweight PCs outside the home.&lt;/p&gt;

&lt;p&gt;Method: Build a GPU-complete pipeline with CUDA × DirectX × NVENC.&lt;/p&gt;

&lt;p&gt;Result: ~40–60 ms perceived latency, 4-display capture with instant switching.&lt;/p&gt;

&lt;p&gt;Challenges: Interop and NVENC info is scarce, and specs can be tricky—but experimentation revealed the path through.&lt;/p&gt;

&lt;p&gt;Next: iGPU support and more latency reduction to move from “usable” into truly comfortable territory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test Environment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Server
&lt;/h3&gt;

&lt;p&gt;CPU: Ryzen 5 4500 (6 cores)&lt;/p&gt;

&lt;p&gt;Memory: 32 GB&lt;/p&gt;

&lt;p&gt;GPU: NVIDIA RTX 4070 12 GB&lt;/p&gt;

&lt;p&gt;SSD: 1 TB&lt;/p&gt;

&lt;h3&gt;
  
  
  Client
&lt;/h3&gt;

&lt;p&gt;CPU: Core i5-11400H&lt;/p&gt;

&lt;p&gt;Memory: 16 GB&lt;/p&gt;

&lt;p&gt;GPU: GTX 1650&lt;/p&gt;

&lt;p&gt;SSD: 500 GB&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo video of the app running Final Fantasy XIV (FFXIV) at high quality on the server.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://youtu.be/eH123oOcDY8" rel="noopener noreferrer"&gt;Demo video&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>productivity</category>
      <category>architecture</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
