DEV Community

soy
soy

Posted on • Originally published at media.patentllm.org

Personal AI Development Environment Built with RTX 5090 + WSL2 — A Practical Setup Fully Utilizing 32GB GPU

Why RTX 5090 + WSL2?

The 32GB VRAM of the RTX 5090 is a practical choice for local inference of large LLM models. Compared to the RTX 4090 (24GB), VRAM capacity is improved by 33%, increasing the room for model size expansion. With vLLM's batch processing, parallel inference can fully utilize the 32GB VRAM.

CUDA 12.8 is the latest toolkit, offering full compatibility with PyTorch and triton. In a WSL2 environment, the Windows host's GPU driver directly provides the GPU, allowing users to benefit from Linux toolchains (vLLM, TensorRT, llama.cpp, etc.).

Overall System Configuration

vLLM Server (Resident Process)

systemctl --user enable vllm.service
systemctl --user start vllm.service
Enter fullscreen mode Exit fullscreen mode
  • Model: Infers models like Nemotron 9B in FP8.
  • Controls usage with gpu-memory-utilization.

TensorRT Shogi AI

Optimizes FP8 quantized models with TensorRT to achieve high-speed inference.

Streamlit App

Provides UI for displaying LLM inference results, search forms, and more.

GPU Sharing in Practice

The vLLM server starts as a resident process and specifies a particular GPU using CUDA_VISIBLE_DEVICES. When launching the Shogi AI, the gpu-memory-utilization parameter is used to limit vLLM's usage, thereby sharing resources.

The switching procedure is as follows:

  • Check vLLM's memory usage.
  • Restart the vLLM service as needed to adjust memory allocation.
  • Launch the TensorRT process.

WSL2-Specific Pitfalls

Setting Memory Limits

WSL2's default memory limits may be insufficient.

# ~/.wslconfig (on Windows)
[wsl2]
memory=16GB
Enter fullscreen mode Exit fullscreen mode

After changing settings, apply them with wsl --shutdown.

Disk I/O Latency

I/O performance degrades when accessing the Windows filesystem (/mnt/c/...) from WSL2. By placing data files within the WSL2 distribution (/home/...), you can leverage the performance of the native Linux filesystem.

systemd Service Configuration

If you use systemd in WSL2, add the following to /etc/wsl.conf.

[boot]
systemd=true
Enter fullscreen mode Exit fullscreen mode

To auto-start user services, loginctl enable-linger is required.

Example Workloads

LLM Inference (vLLM)

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese \
  --dtype auto \
  --max-model-len 32768
Enter fullscreen mode Exit fullscreen mode

Shogi AI (TensorRT Optimized)

FP8 quantization enables high-speed inference while significantly saving VRAM.

SQLite FTS5 Search

Fast data searching leveraging a full-text search engine can also be operated concurrently.

Summary

The combination of RTX 5090 + WSL2 is a practical setup that allows dedicating the full 32GB VRAM capacity to AI development. WSL2 challenges (memory limits, disk I/O) can be resolved by adjusting configuration files, enabling full utilization of the latest vLLM and TensorRT features. Placing data files within the WSL2's Linux filesystem is key to performance.

This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted/verified by Gemini 2.5 Flash.

Top comments (0)