soy

Posted on Mar 8 • Edited on Mar 24 • Originally published at media.patentllm.org

Shogi AI with RTX 5090 — Record of TensorRT FP8 Quantization and Floodgate Practical Games

#ai #gpu #performance

What is dlshogi?

dlshogi is a Shogi engine incorporating deep learning, consisting of a C++ implementation and a Python wrapper. It operates with ONNX Runtime and DirectML on Windows, and TensorRT and CUDA on Linux. In this project, we implemented it leveraging TensorRT in a WSL2 Ubuntu 24.04 environment equipped with an RTX 5090.

Key features are as follows:

Evaluation value generation by neural network
Hybrid approach with traditional αβ search
Coordinated control of multiple engines via wrapper scripts

Fuka40B Model Architecture

Fuka40B is a model adopting the ResNet40x384 architecture with 107.2M parameters. It uses the ReLU activation function and is designed to accurately evaluate Shogi board positions.

Training Data: Distilled dataset
Optimization Algorithm: AdamW (learning rate 0.00005, weight decay 0.01)
Batch Size: 4096

Effects of FP8 Quantization

TensorRT's FP8 quantization is a technique that improves inference speed while reducing VRAM load. Compared to INT4, FP8 has smaller quantization error, offering an excellent balance of accuracy and performance.

Actual Measurement Results (RTX 5090 + 32GB VRAM):

FP16: VRAM 12.4GB, 35k NPS (Nodes Per Second), 82% Match Rate
FP8: VRAM 7.8GB, 90k NPS, 81% Match Rate
INT4: VRAM 6.2GB, 75k NPS, 73% Match Rate

FP8 demonstrates higher accuracy compared to INT4 and also shows superior performance in NPS (Nodes Per Second). The reduced VRAM usage facilitates coexistence with other processes.

TensorRT Optimization Considerations

TensorRT's --best option can sometimes degrade inference quality for specific model and quantization settings. With the 40B model, using --best reduced both NPS and match rate.

The correct setting is to explicitly specify FP8.

trtexec \
  --onnx=models/eval/model_fp8.onnx \
  --fp8 \
  --minShapes=input1:1x62x9x9,input2:1x57x9x9 \
  --optShapes=input1:256x62x9x9,input2:256x57x9x9 \
  --maxShapes=input1:256x62x9x9,input2:256x57x9x9 \
  --saveEngine=model_fp8_trt

Performance on RTX 5090

We achieved 90k NPS with optimized settings.

# Excerpt from floodgate_client.py
UCT_THREADS = 4
DNN_BATCH_SIZE = 256
GPU_MEMORY_FRACTION = 0.8  # 80% VRAM usage

At batch size 256: 90,200 NPS
At batch size 512: 87,500 NPS (decreased due to memory pressure)
VRAM Usage: 7.8GB (with FP8 quantization)

Floodgate Match Results

In actual Floodgate matches, the Time_Margin (byoyomi time) setting significantly influenced wins and losses.

Time_Margin 0ms setting: 3-2 victory (won with precise endgame evaluation)
Time_Margin 1000ms setting: 0-5 defeat (lost all games due to misconfiguration of thinking time allocation)

It is crucial to always set Time_Margin to 0 for short byoyomi times.

3-Phase Hybrid System

We constructed a 3-phase system that coordinates Fuka40B and Suisho11.

Phase 1 (Opening): Fuka40B evaluates a wide range of positions.
Phase 2 (Middlegame): Suisho11 specializes in tactical positions.
Phase 3 (Endgame): Fuka40B resumes deep search.

When Fuka40B resumes in the endgame, it inherits the search tree from the previous phase, shortening the resumption time.

Summary

This report details the practical results of a Shogi AI leveraging RTX 5090 and TensorRT FP8 quantization. FP8 quantization reduced VRAM load while maintaining accuracy, achieving high-speed inference of 90k NPS. In Floodgate matches, the Time_Margin setting was a critical factor in determining wins and losses, and the 3-phase hybrid system ensures a stable win rate.

This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted/verified by Gemini 2.5 Flash.

DEV Community