What is dlshogi?
dlshogi is a Shogi engine incorporating deep learning, consisting of a C++ implementation and a Python wrapper. It operates with ONNX Runtime and DirectML on Windows, and TensorRT and CUDA on Linux. In this project, we implemented it leveraging TensorRT in a WSL2 Ubuntu 24.04 environment equipped with an RTX 5090.
Key features are as follows:
- Evaluation value generation by neural network
- Hybrid approach with traditional αβ search
- Coordinated control of multiple engines via wrapper scripts
Fuka40B Model Architecture
Fuka40B is a model adopting the ResNet40x384 architecture with 107.2M parameters. It uses the ReLU activation function and is designed to accurately evaluate Shogi board positions.
- Training Data: Distilled dataset
- Optimization Algorithm: AdamW (learning rate 0.00005, weight decay 0.01)
- Batch Size: 4096
Effects of FP8 Quantization
TensorRT's FP8 quantization is a technique that improves inference speed while reducing VRAM load. Compared to INT4, FP8 has smaller quantization error, offering an excellent balance of accuracy and performance.
Actual Measurement Results (RTX 5090 + 32GB VRAM):
- FP16: VRAM 12.4GB, 35k NPS (Nodes Per Second), 82% Match Rate
- FP8: VRAM 7.8GB, 90k NPS, 81% Match Rate
- INT4: VRAM 6.2GB, 75k NPS, 73% Match Rate
FP8 demonstrates higher accuracy compared to INT4 and also shows superior performance in NPS (Nodes Per Second). The reduced VRAM usage facilitates coexistence with other processes.
TensorRT Optimization Considerations
TensorRT's --best option can sometimes degrade inference quality for specific model and quantization settings. With the 40B model, using --best reduced both NPS and match rate.
The correct setting is to explicitly specify FP8.
trtexec \
--onnx=models/eval/model_fp8.onnx \
--fp8 \
--minShapes=input1:1x62x9x9,input2:1x57x9x9 \
--optShapes=input1:256x62x9x9,input2:256x57x9x9 \
--maxShapes=input1:256x62x9x9,input2:256x57x9x9 \
--saveEngine=model_fp8_trt
Performance on RTX 5090
We achieved 90k NPS with optimized settings.
# Excerpt from floodgate_client.py
UCT_THREADS = 4
DNN_BATCH_SIZE = 256
GPU_MEMORY_FRACTION = 0.8 # 80% VRAM usage
- At batch size 256: 90,200 NPS
- At batch size 512: 87,500 NPS (decreased due to memory pressure)
- VRAM Usage: 7.8GB (with FP8 quantization)
Floodgate Match Results
In actual Floodgate matches, the Time_Margin (byoyomi time) setting significantly influenced wins and losses.
- Time_Margin 0ms setting: 3-2 victory (won with precise endgame evaluation)
- Time_Margin 1000ms setting: 0-5 defeat (lost all games due to misconfiguration of thinking time allocation)
It is crucial to always set Time_Margin to 0 for short byoyomi times.
3-Phase Hybrid System
We constructed a 3-phase system that coordinates Fuka40B and Suisho11.
- Phase 1 (Opening): Fuka40B evaluates a wide range of positions.
- Phase 2 (Middlegame): Suisho11 specializes in tactical positions.
- Phase 3 (Endgame): Fuka40B resumes deep search.
When Fuka40B resumes in the endgame, it inherits the search tree from the previous phase, shortening the resumption time.
Summary
This report details the practical results of a Shogi AI leveraging RTX 5090 and TensorRT FP8 quantization. FP8 quantization reduced VRAM load while maintaining accuracy, achieving high-speed inference of 90k NPS. In Floodgate matches, the Time_Margin setting was a critical factor in determining wins and losses, and the 3-phase hybrid system ensures a stable win rate.
This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted/verified by Gemini 2.5 Flash.
Top comments (0)