soy

Posted on Mar 8 • Edited on Mar 24 • Originally published at media.patentllm.org

Operational Techniques for Automatically Starting vLLM, Flask, and cron with systemd Services in WSL2

#devtools #python #productivity

WSL2 systemd Support

To enable systemd in WSL2, configure /etc/wsl.conf.

# Add to /etc/wsl.conf
[boot]
systemd=true

To apply the changes, restart WSL2.

wsl --shutdown

After configuration, you can check the list of services with systemctl --all. To automatically start user services when WSL2 launches, you need to run the loginctl enable-linger command.

vLLM systemd Unit Files

Startup Script

#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0

python3 -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --tensor-parallel-size 1

systemd Unit File (~/.config/systemd/user/vllm.service)

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
WorkingDirectory=%h/vllm
ExecStart=%h/vllm_server.sh
Restart=always
RestartSec=5s

[Install]
WantedBy=default.target

Key points of the configuration are as follows:

Specify the GPU to use with CUDA_VISIBLE_DEVICES=0.
--trust-remote-code allows the execution of custom code for Hugging Face models. Use this only for trusted models.
Restart=always enables automatic recovery.

Activation Commands

systemctl --user daemon-reload
systemctl --user enable vllm.service
systemctl --user start vllm.service

Service-ifying the Flask API

We will service-ify a Flask application that wraps the vLLM API.

from flask import Flask, request
import requests

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate():
    prompt = request.json['prompt']
    response = requests.post(
        'http://localhost:8000/v1/completions',
        json={'model': 'nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese',
              'prompt': prompt, 'max_tokens': 200}
    )
    return response.json()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8510)

systemd Unit File

[Unit]
Description=Flask API for vLLM
After=vllm.service

[Service]
Type=simple
WorkingDirectory=%h/flask_api
ExecStart=%h/.venv/bin/python app.py
Restart=always
RestartSec=3s

[Install]
WantedBy=default.target

By specifying After=vllm.service, the Flask API will only start after vLLM has finished launching.

Distinguishing from cron

For scheduled tasks, we use systemd timers.

Timer Configuration (~/.config/systemd/user/daily-report.timer)

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target

systemctl --user enable daily-report.timer

The advantages of systemd timers compared to cron are as follows:

Centralized log management with journalctl.
Explicit specification of dependencies.
Automatic completion of missed runs with Persistent=true.

Startup Order and Dependencies

vLLM: The foundational inference engine.
Flask API: Depends on vLLM (After=vllm.service).
Daily Report Generation: References logs from both.

systemctl --user list-dependencies vllm.service

Log Confirmation (journalctl)

# Logs per service
journalctl --user -u vllm.service -f

# Last 100 logs
journalctl --user -u vllm.service -n 100

# Extract only error logs
journalctl --user -u vllm.service --since "24 hours ago" | grep -i "error\|fail"

Summary

This post explained how to build an operational environment in WSL2 that leverages systemd to seamlessly integrate vLLM, Flask, and scheduled tasks.

Achieved vLLM service-ification, maximizing the utilization of CUDA 12.8 and RTX 5090.
Service dependency management using After= is key to resolving concurrent startup errors.
Real-time monitoring with journalctl ensures operational stability.

This configuration can be utilized as infrastructure to support the practical application of Japanese inference models based on Nemotron.

This article was generated by Nemotron-Nano-9B-v2-Japanese, and Gemini 2.5 Flash performed formatting and verification.

DEV Community