As we develop askpaul.ai, a prediction market application requiring accurate AI-powered event outcome forecasting, we needed a reliable and high-performance model to power our prediction engine. After evaluating several options, we chose MiroMind's MiroThinker model for its exceptional predictive capabilities. This blog outlines our deployment process on a CentOS system with GPU acceleration.
Infrastructure Setup
Our deployment environment consists of:
- CentOS 8.3 operating system
- NVIDIA H20 GPU for accelerated computing
Prerequisites Installation
1. Python 3.12 Installation
We started by installing Python 3.12, which provides the necessary runtime environment for our application:
# Installation commands for Python 3.12 on CentOS 8.3
sudo dnf install -y gcc openssl-devel bzip2-devel libffi-devel
wget https://www.python.org/ftp/python/3.12.0/Python-3.12.0.tgz
tar xzf Python-3.12.0.tgz
cd Python-3.12.0
./configure --enable-optimizations
make altinstall
2. NVCC Installation
To leverage GPU acceleration, we installed the NVIDIA CUDA Compiler (nvcc):
# Install CUDA toolkit containing nvcc
sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf install -y cuda-toolkit
Note about CentOS 8.3 package management: CentOS 8 and later versions use
dnf
as the default package manager, which is an enhanced version ofyum
. Whileyum
commands still work as aliases todnf
, we recommend usingdnf
directly for better performance and dependency resolution.
Required Dependencies
We installed the following Python packages to ensure proper functionality:
pip3.12 install sglang pybase64 pydantic orjson uvicorn uvloop fastapi torch psutil zmq packaging Pillow openai partial_json_parser huggingface_hub transformers sentencepiece sgl_kernel dill compressed_tensors einops msgspec python-multipart pynvml torchao xgrammar openai_harmony
These packages provide essential functionality including:
- Web serving capabilities (uvicorn, fastapi)
- GPU-accelerated tensor operations (torch, torchao)
- Model management and inference (sglang, transformers, huggingface_hub)
- Data processing and serialization (orjson, msgspec, pybase64)
Deploying MiroThinker
With all prerequisites in place, we deployed the MiroThinker-32B-DPO-v0.2 model using sglang's server:
nohup python3.12 -m sglang.launch_server \
--model-path miromind-ai/MiroThinker-32B-DPO-v0.2 \
--tp 1 \
--dp 1 \
--host 0.0.0.0 \
--port 6666 \
--trust-remote-code \
--chat-template qwen3_nonthinking.jinja > miromind.log &
This command starts the server in the background with nohup
, ensuring it continues running even after logout. The model is deployed with:
- Tensor parallelism (tp) set to 1
- Data parallelism (dp) set to 1
These settings are appropriate for our single GPU setup.
For the nonthinking mode required by our prediction use case, we used the specialized template available at:
https://github.com/MiroMindAI/MiroThinker/blob/main/assets/qwen3_nonthinking.jinja
Conclusion
Deploying MiroThinker on our CentOS 8.3 system with an H20 GPU has significantly enhanced the prediction capabilities of askpaul.ai. The model's performance meets our expectations for accuracy and response time, making it an excellent fit for our prediction market application.
The sglang framework provided a straightforward deployment path, and the MiroThinker model has proven to be reliable and efficient in our production environment. We're excited to continue leveraging this powerful combination as we expand the capabilities of askpaul.ai.
Top comments (1)
Keep Going, Paul AI 🐙
Some comments may only be visible to logged-in visitors. Sign in to view all comments.