DEV Community

soy
soy

Posted on • Originally published at media.patentllm.org

Operational Techniques for Automatically Starting vLLM, Flask, and cron with systemd Services in WSL2

WSL2 systemd Support

To enable systemd in WSL2, configure /etc/wsl.conf.

# Add to /etc/wsl.conf
[boot]
systemd=true
Enter fullscreen mode Exit fullscreen mode

To apply the changes, restart WSL2.

wsl --shutdown
Enter fullscreen mode Exit fullscreen mode

After configuration, you can check the list of services with systemctl --all. To automatically start user services when WSL2 launches, you need to run the loginctl enable-linger command.

vLLM systemd Unit Files

Startup Script

#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0

python3 -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --tensor-parallel-size 1
Enter fullscreen mode Exit fullscreen mode

systemd Unit File (~/.config/systemd/user/vllm.service)

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
WorkingDirectory=%h/vllm
ExecStart=%h/vllm_server.sh
Restart=always
RestartSec=5s

[Install]
WantedBy=default.target
Enter fullscreen mode Exit fullscreen mode

Key points of the configuration are as follows:

  • Specify the GPU to use with CUDA_VISIBLE_DEVICES=0.
  • --trust-remote-code allows the execution of custom code for Hugging Face models. Use this only for trusted models.
  • Restart=always enables automatic recovery.

Activation Commands

systemctl --user daemon-reload
systemctl --user enable vllm.service
systemctl --user start vllm.service
Enter fullscreen mode Exit fullscreen mode

Service-ifying the Flask API

We will service-ify a Flask application that wraps the vLLM API.

from flask import Flask, request
import requests

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate():
    prompt = request.json['prompt']
    response = requests.post(
        'http://localhost:8000/v1/completions',
        json={'model': 'nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese',
              'prompt': prompt, 'max_tokens': 200}
    )
    return response.json()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8510)
Enter fullscreen mode Exit fullscreen mode

systemd Unit File

[Unit]
Description=Flask API for vLLM
After=vllm.service

[Service]
Type=simple
WorkingDirectory=%h/flask_api
ExecStart=%h/.venv/bin/python app.py
Restart=always
RestartSec=3s

[Install]
WantedBy=default.target
Enter fullscreen mode Exit fullscreen mode

By specifying After=vllm.service, the Flask API will only start after vLLM has finished launching.

Distinguishing from cron

For scheduled tasks, we use systemd timers.

Timer Configuration (~/.config/systemd/user/daily-report.timer)

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target
Enter fullscreen mode Exit fullscreen mode
systemctl --user enable daily-report.timer
Enter fullscreen mode Exit fullscreen mode

The advantages of systemd timers compared to cron are as follows:

  • Centralized log management with journalctl.
  • Explicit specification of dependencies.
  • Automatic completion of missed runs with Persistent=true.

Startup Order and Dependencies

  • vLLM: The foundational inference engine.
  • Flask API: Depends on vLLM (After=vllm.service).
  • Daily Report Generation: References logs from both.
systemctl --user list-dependencies vllm.service
Enter fullscreen mode Exit fullscreen mode

Log Confirmation (journalctl)

# Logs per service
journalctl --user -u vllm.service -f

# Last 100 logs
journalctl --user -u vllm.service -n 100

# Extract only error logs
journalctl --user -u vllm.service --since "24 hours ago" | grep -i "error\|fail"
Enter fullscreen mode Exit fullscreen mode

Summary

This post explained how to build an operational environment in WSL2 that leverages systemd to seamlessly integrate vLLM, Flask, and scheduled tasks.

  • Achieved vLLM service-ification, maximizing the utilization of CUDA 12.8 and RTX 5090.
  • Service dependency management using After= is key to resolving concurrent startup errors.
  • Real-time monitoring with journalctl ensures operational stability.

This configuration can be utilized as infrastructure to support the practical application of Japanese inference models based on Nemotron.

This article was generated by Nemotron-Nano-9B-v2-Japanese, and Gemini 2.5 Flash performed formatting and verification.

Top comments (0)