Achieving a Real-Time Gaming Remote Desktop
Overview
On Windows, I combined CUDA / DirectX / NVENC to build a home-grown remote desktop that’s fast enough for gaming (currently at test-rig stage).
Perceived end-to-end latency is about ~40 ms for non-gaming use, and ~50–60 ms while running games.
Why I Built It
“I want to play PC games away from home—like at a café—without lugging around a heavy gaming laptop.”
Gaming laptops are expensive, bulky, and come with theft risk. So the idea was to turn my home gaming PC into a server and play comfortably from a lightweight client elsewhere. Off-the-shelf remote desktops are generally too laggy for games—they’re built for general office use, not for fast-changing, high-motion content like games.
That’s why I set out to build a remote desktop purpose-built for gaming from day one.
Goals and Design Principles
Latency goal: Use 100 ms as the threshold where humans start to feel something’s “off,” and aim for total latency under 50 ms (on LAN).
Server (game machine): Target ~15 ms from capture → process → send. Keep work off the CPU; finish on the GPU.
Client: Lightweight rendering that runs even on typical PCs / iGPUs. It’s NVIDIA-only for now, but I’m adapting it for Intel iGPU (Quick Sync / D3D).
Premise: Leverage the high-end GPU in a gaming PC; prioritize real-time over “lightweight processing.”
Tech Stack & System Overview
OS: Windows
GPU compute: CUDA
Display/Sharing: DirectX (mainly D3D11/12)
Encoding: NVENC (using CUDA mode)
Objective: Keep everything—capture → processing → compression—inside GPU memory to avoid CPU round-trips.
Processing Flow
Server
Screen capture (GPU)
CUDA image processing (color space conversion, scaling, etc.)
Encode with NVENC in CUDA mode
Network transmission
Client
Decode + render (currently NVIDIA-focused, iGPU support in progress)
Server Architecture for Real-Time Performance
GPU-only pipeline: From capture to NVENC submission, keep everything in GPU memory; minimize CPU copies to near zero.
CUDA × DirectX interop: Hand off CUDA kernel results directly to DirectX without returning to the CPU.
Synchronization design: No CPU blocking. Use queuing and timeline control so the GPU work queue stays smooth and ordered.
Analogy
“CUDA = paint-mixing artisan, CPU = paint tube, DirectX = painter.”
If the artisan has to stuff paint into a tube (CPU) every time before handing it to the painter, it’s slow.
If they work from the same palette (GPU memory), the artisan mixes and the painter paints directly—that’s the value of interop.
Client Side: Lightweight and Compatible
Policy: Must run on laptops and iGPUs when out and about.
Current status: Initially built on a single machine, so early dev was NVIDIA-assumed.
In progress: Redesigning to support Intel iGPU for decode + rendering.
Demonstration: Ran FF14 at 4K / max settings on the server (RTX 4070) and rendered successfully on the client (GTX 1650).
A standalone GTX 1650 can’t realistically run FF14 at 4K max, but it’s feasible over remote rendering.
Pain Points During Development (Details)
1) CUDA × DirectX Interop
Sparse references: Almost no production-grade samples in NVIDIA’s official examples, books, or the web.
Implementation difficulty: I tried “AI-assisted ‘vibe coding,’” but lots of examples were wrong; took ~3 months to become stable.
Crux: Don’t bounce through the CPU. Accurate sharing and synchronization in GPU memory (fences/barriers/ownership transitions) makes or breaks it.
2) NVENC “Traps” and Workarounds
Initially I planned to use NVENC in DirectX (D3D) mode, but hit these walls:
D3D12 → D3D11 bridging needed: With a CUDA → D3D12 → NVENC path, you often need to convert to D3D11 for NVENC submission.
NV12 cannot be UAV: Making a standard NV12 resource a UAV (Unordered Access View) isn’t allowed on consumer RTX cards. This can force enterprise GPUs and thus made DirectX mode a non-starter for me.
Switch to CUDA mode: I pivoted to running NVENC in CUDA mode.
But CUDA mode had its own pitfall:
“Write NV12 into a single linear buffer → NVENC rejects”:
Writing NV12 directly into one linear buffer from a CUDA kernel led to NVENC refusing the input.
What worked:
Generate Y plane and UV plane separately.
Using the correct computed pitch, copy each plane into a linear GPU buffer.
Submit those plane buffers to NVENC.
→ Stable operation achieved. This wasn’t clearly documented; took ~2 months to figure out.
3) Synchronization and Stabilization
Zero CPU-wait design: Avoid GPU queue stalls and unnecessary syncs—let everything that can progress keep progressing.
Frame pipeline depth: Tune the capture → processing → encode → send stages so it’s neither too deep nor too shallow. Find the sweet spot between latency and stability.
Result: Smooth perceived motion and robust, glitch-free rendering.
A Standout Feature: Multi-Monitor
Up to 4 monitors captured simultaneously, with instant switching.
Great for gaming on one display while browsing/reading guides/using streaming tools on another—strong differentiation for gaming use.
Off-the-shelf products have limited multi-monitor support, so this is very practical.
What I Learned
The power of a “GPU-complete” pipeline: Eliminating CPU round-trips dramatically improves both latency and stability.
Undocumented behaviors are real: The version that “just works in practice” doesn’t always match the official samples.
AI is a great partner, but not gospel: You still need your own testing. Persistence is required to squash inaccuracies.
What’s Next
Intel iGPU support: Broaden compatibility for decode & rendering across many laptops.
Further latency cuts:
Optimize capture path and memory layout
Adjust frame pipeline depth
Simplify render passes and push zero-copy further
Network resilience: Refine VBR, jitter absorption, and flow control.
User features: Quality/latency presets, hotkeys, streaming mode, recording, and more.
Summary
Goal: Play comfortably from lightweight PCs outside the home.
Method: Build a GPU-complete pipeline with CUDA × DirectX × NVENC.
Result: ~40–60 ms perceived latency, 4-display capture with instant switching.
Challenges: Interop and NVENC info is scarce, and specs can be tricky—but experimentation revealed the path through.
Next: iGPU support and more latency reduction to move from “usable” into truly comfortable territory.
Test Environment
Server
CPU: Ryzen 5 4500 (6 cores)
Memory: 32 GB
GPU: NVIDIA RTX 4070 12 GB
SSD: 1 TB
Client
CPU: Core i5-11400H
Memory: 16 GB
GPU: GTX 1650
SSD: 500 GB
Top comments (0)