Why RTX 5090 + WSL2?
The 32GB VRAM of the RTX 5090 is a practical choice for local inference of large LLM models. Compared to the RTX 4090 (24GB), VRAM capacity is improved by 33%, increasing the room for model size expansion. With vLLM's batch processing, parallel inference can fully utilize the 32GB VRAM.
CUDA 12.8 is the latest toolkit, offering full compatibility with PyTorch and triton. In a WSL2 environment, the Windows host's GPU driver directly provides the GPU, allowing users to benefit from Linux toolchains (vLLM, TensorRT, llama.cpp, etc.).
Overall System Configuration
vLLM Server (Resident Process)
systemctl --user enable vllm.service
systemctl --user start vllm.service
- Model: Infers models like Nemotron 9B in FP8.
- Controls usage with
gpu-memory-utilization.
TensorRT Shogi AI
Optimizes FP8 quantized models with TensorRT to achieve high-speed inference.
Streamlit App
Provides UI for displaying LLM inference results, search forms, and more.
GPU Sharing in Practice
The vLLM server starts as a resident process and specifies a particular GPU using CUDA_VISIBLE_DEVICES. When launching the Shogi AI, the gpu-memory-utilization parameter is used to limit vLLM's usage, thereby sharing resources.
The switching procedure is as follows:
- Check vLLM's memory usage.
- Restart the vLLM service as needed to adjust memory allocation.
- Launch the TensorRT process.
WSL2-Specific Pitfalls
Setting Memory Limits
WSL2's default memory limits may be insufficient.
# ~/.wslconfig (on Windows)
[wsl2]
memory=16GB
After changing settings, apply them with wsl --shutdown.
Disk I/O Latency
I/O performance degrades when accessing the Windows filesystem (/mnt/c/...) from WSL2. By placing data files within the WSL2 distribution (/home/...), you can leverage the performance of the native Linux filesystem.
systemd Service Configuration
If you use systemd in WSL2, add the following to /etc/wsl.conf.
[boot]
systemd=true
To auto-start user services, loginctl enable-linger is required.
Example Workloads
LLM Inference (vLLM)
python -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese \
--dtype auto \
--max-model-len 32768
Shogi AI (TensorRT Optimized)
FP8 quantization enables high-speed inference while significantly saving VRAM.
SQLite FTS5 Search
Fast data searching leveraging a full-text search engine can also be operated concurrently.
Summary
The combination of RTX 5090 + WSL2 is a practical setup that allows dedicating the full 32GB VRAM capacity to AI development. WSL2 challenges (memory limits, disk I/O) can be resolved by adjusting configuration files, enabling full utilization of the latest vLLM and TensorRT features. Placing data files within the WSL2's Linux filesystem is key to performance.
This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted/verified by Gemini 2.5 Flash.
Top comments (0)