YukiOhira0416

Posted on Oct 14

Development of Real-Time Remote Desktop App

#programming #productivity #architecture #devops

Achieving a Real-Time Gaming Remote Desktop

Overview

On Windows, I combined CUDA / DirectX / NVENC to build a home-grown remote desktop that’s fast enough for gaming (currently at test-rig stage).
Perceived end-to-end latency is about ~40 ms for non-gaming use, and ~50–60 ms while running games.

Why I Built It

“I want to play PC games away from home—like at a café—without lugging around a heavy gaming laptop.”
Gaming laptops are expensive, bulky, and come with theft risk. So the idea was to turn my home gaming PC into a server and play comfortably from a lightweight client elsewhere. Off-the-shelf remote desktops are generally too laggy for games—they’re built for general office use, not for fast-changing, high-motion content like games.
That’s why I set out to build a remote desktop purpose-built for gaming from day one.

Goals and Design Principles

Latency goal: Use 100 ms as the threshold where humans start to feel something’s “off,” and aim for total latency under 50 ms (on LAN).

Server (game machine): Target ~15 ms from capture → process → send. Keep work off the CPU; finish on the GPU.

Client: Lightweight rendering that runs even on typical PCs / iGPUs. It’s NVIDIA-only for now, but I’m adapting it for Intel iGPU (Quick Sync / D3D).

Premise: Leverage the high-end GPU in a gaming PC; prioritize real-time over “lightweight processing.”

Tech Stack & System Overview

OS: Windows

GPU compute: CUDA

Display/Sharing: DirectX (mainly D3D11/12)

Encoding: NVENC (using CUDA mode)

Objective: Keep everything—capture → processing → compression—inside GPU memory to avoid CPU round-trips.

Processing Flow

Server

Screen capture (GPU)

CUDA image processing (color space conversion, scaling, etc.)

Encode with NVENC in CUDA mode

Network transmission

Client

Decode + render (currently NVIDIA-focused, iGPU support in progress)

Server Architecture for Real-Time Performance

GPU-only pipeline: From capture to NVENC submission, keep everything in GPU memory; minimize CPU copies to near zero.

CUDA × DirectX interop: Hand off CUDA kernel results directly to DirectX without returning to the CPU.

Synchronization design: No CPU blocking. Use queuing and timeline control so the GPU work queue stays smooth and ordered.

Analogy

“CUDA = paint-mixing artisan, CPU = paint tube, DirectX = painter.”
If the artisan has to stuff paint into a tube (CPU) every time before handing it to the painter, it’s slow.
If they work from the same palette (GPU memory), the artisan mixes and the painter paints directly—that’s the value of interop.

Client Side: Lightweight and Compatible

Policy: Must run on laptops and iGPUs when out and about.

Current status: Initially built on a single machine, so early dev was NVIDIA-assumed.

In progress: Redesigning to support Intel iGPU for decode + rendering.

Demonstration: Ran FF14 at 4K / max settings on the server (RTX 4070) and rendered successfully on the client (GTX 1650).
A standalone GTX 1650 can’t realistically run FF14 at 4K max, but it’s feasible over remote rendering.

Pain Points During Development (Details)

1) CUDA × DirectX Interop

Sparse references: Almost no production-grade samples in NVIDIA’s official examples, books, or the web.

Implementation difficulty: I tried “AI-assisted ‘vibe coding,’” but lots of examples were wrong; took ~3 months to become stable.

Crux: Don’t bounce through the CPU. Accurate sharing and synchronization in GPU memory (fences/barriers/ownership transitions) makes or breaks it.

2) NVENC “Traps” and Workarounds

Initially I planned to use NVENC in DirectX (D3D) mode, but hit these walls:

D3D12 → D3D11 bridging needed: With a CUDA → D3D12 → NVENC path, you often need to convert to D3D11 for NVENC submission.

NV12 cannot be UAV: Making a standard NV12 resource a UAV (Unordered Access View) isn’t allowed on consumer RTX cards. This can force enterprise GPUs and thus made DirectX mode a non-starter for me.

Switch to CUDA mode: I pivoted to running NVENC in CUDA mode.
But CUDA mode had its own pitfall:

“Write NV12 into a single linear buffer → NVENC rejects”:
Writing NV12 directly into one linear buffer from a CUDA kernel led to NVENC refusing the input.

What worked:

Generate Y plane and UV plane separately.

Using the correct computed pitch, copy each plane into a linear GPU buffer.

Submit those plane buffers to NVENC.
→ Stable operation achieved. This wasn’t clearly documented; took ~2 months to figure out.

3) Synchronization and Stabilization

Zero CPU-wait design: Avoid GPU queue stalls and unnecessary syncs—let everything that can progress keep progressing.

Frame pipeline depth: Tune the capture → processing → encode → send stages so it’s neither too deep nor too shallow. Find the sweet spot between latency and stability.

Result: Smooth perceived motion and robust, glitch-free rendering.

A Standout Feature: Multi-Monitor

Up to 4 monitors captured simultaneously, with instant switching.

Great for gaming on one display while browsing/reading guides/using streaming tools on another—strong differentiation for gaming use.

Off-the-shelf products have limited multi-monitor support, so this is very practical.

What I Learned

The power of a “GPU-complete” pipeline: Eliminating CPU round-trips dramatically improves both latency and stability.

Undocumented behaviors are real: The version that “just works in practice” doesn’t always match the official samples.

AI is a great partner, but not gospel: You still need your own testing. Persistence is required to squash inaccuracies.

What’s Next

Intel iGPU support: Broaden compatibility for decode & rendering across many laptops.

Further latency cuts:

Optimize capture path and memory layout

Adjust frame pipeline depth

Simplify render passes and push zero-copy further

Network resilience: Refine VBR, jitter absorption, and flow control.

User features: Quality/latency presets, hotkeys, streaming mode, recording, and more.

Summary

Goal: Play comfortably from lightweight PCs outside the home.

Method: Build a GPU-complete pipeline with CUDA × DirectX × NVENC.

Result: ~40–60 ms perceived latency, 4-display capture with instant switching.

Challenges: Interop and NVENC info is scarce, and specs can be tricky—but experimentation revealed the path through.

Next: iGPU support and more latency reduction to move from “usable” into truly comfortable territory.

Test Environment

Server

CPU: Ryzen 5 4500 (6 cores)

Memory: 32 GB

GPU: NVIDIA RTX 4070 12 GB

SSD: 1 TB

Client

CPU: Core i5-11400H

Memory: 16 GB

GPU: GTX 1650

SSD: 500 GB

Demo video of the app running Final Fantasy XIV (FFXIV) at high quality on the server.

Demo video

DEV Community

Development of Real-Time Remote Desktop App

Achieving a Real-Time Gaming Remote Desktop

Overview

Why I Built It

Goals and Design Principles

Tech Stack & System Overview

Processing Flow

Server

Client

Server Architecture for Real-Time Performance

Analogy

Client Side: Lightweight and Compatible

Pain Points During Development (Details)

1) CUDA × DirectX Interop

2) NVENC “Traps” and Workarounds

What worked:

3) Synchronization and Stabilization

A Standout Feature: Multi-Monitor

What I Learned

What’s Next

Summary

Test Environment

Server

Client

Demo video of the app running Final Fantasy XIV (FFXIV) at high quality on the server.

Top comments (0)