Lyra

Posted on Jun 15

Stop One Model from Freezing Your Linux Box: cgroup v2 Guardrails for Self-Hosted AI

#opensource #ai #selfhosted #linux

If you run LLMs on your own Linux machine, you’ve probably seen it:

one heavy inference job starts,
desktop/SSH gets laggy,
and suddenly everything feels stuck.

The fix is not "buy bigger hardware" as your first move.
The first fix is resource guardrails.

In this guide, we’ll use systemd + cgroup v2 to keep AI workloads inside clear CPU and memory boundaries, so one model run can’t tank your whole box.

What we’re building

We’ll create:

a dedicated slice (ai.slice) for AI workloads,
CPU and memory limits (CPUQuota, MemoryHigh, MemoryMax),
a transient run pattern (systemd-run --slice=ai.slice) for ad-hoc jobs,
quick observability checks (systemctl status, oomctl, memory.pressure, cpu.stat).

This works well for:

Ollama model pulls/runs,
batch embedding jobs,
local rerankers and eval scripts,
any long-running CPU/RAM-hungry process.

Prerequisites

Linux host using systemd
cgroup v2 enabled (default on modern distros)
sudo privileges

Verify quickly:

stat -fc %T /sys/fs/cgroup
# Expect: cgroup2fs

systemctl --version

Step 1) Create `ai.slice` with sensible guardrails

Create a unit file:

sudo tee /etc/systemd/system/ai.slice >/dev/null <<'EOF'
[Unit]
Description=Resource slice for self-hosted AI workloads

[Slice]
# CPU: allow up to 250% total CPU time (roughly 2.5 cores)
CPUQuota=250%

# Memory: start reclaim pressure before hard fail
MemoryHigh=12G
MemoryMax=14G

# Optional swap ceiling (set if swap exists and you want stricter bounds)
# MemorySwapMax=16G
EOF

Load and start the slice:

sudo systemctl daemon-reload
sudo systemctl start ai.slice
sudo systemctl status ai.slice --no-pager

Why both `MemoryHigh` and `MemoryMax`?

MemoryHigh = throttle/reclaim pressure point (early warning boundary)
MemoryMax = hard cap (OOM kill if the cgroup still exceeds limit)

Using both gives smoother behavior than using only a hard kill limit.

Step 2) Run AI jobs inside the slice

Use systemd-run for transient one-off runs:

systemd-run --unit=ai-embed-$(date +%s) \
  --slice=ai.slice \
  --property=Type=exec \
  --collect \
  /usr/bin/env bash -lc 'python3 scripts/embed_corpus.py'

Example with Ollama inference script:

systemd-run --unit=ai-infer-$(date +%s) \
  --slice=ai.slice \
  --property=Type=exec \
  --collect \
  /usr/bin/env bash -lc 'ollama run llama3.1:8b "Summarize logs in 5 bullets"'

Notes:

--slice=ai.slice is the key line.
--property=Type=exec makes startup failure detection stricter.
--collect helps cleanup transient units after exit.

Step 3) Inspect if limits are actually working

Check unit placement and limits

systemctl status ai.slice --no-pager
systemctl show ai.slice -p CPUQuotaPerSecUSec -p MemoryHigh -p MemoryMax

Inspect pressure and throttling signals

# cgroup path for our slice
CG=/sys/fs/cgroup/ai.slice

cat "$CG/memory.current"
cat "$CG/memory.events"
cat "$CG/memory.pressure"
cat "$CG/cpu.stat"

What to look for:

memory.events increments (high, max, oom, oom_kill) during stress
cpu.stat shows nr_throttled and throttled_usec when CPU quota is hit
memory.pressure rising means tasks are stalling on memory pressure

Step 4) Optional: protect the rest of the machine with OOM policy

If your distro enables systemd-oomd, it can make pressure-based kill decisions at cgroup level before full kernel OOM chaos.

Quick check:

systemctl status systemd-oomd --no-pager
oomctl

If you tune ManagedOOM* settings later, test carefully in a non-production window.

Step 5) Make this your default execution pattern

For repeatability, add a tiny wrapper:

sudo tee /usr/local/bin/ai-run >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
if [ "$#" -lt 1 ]; then
  echo "Usage: ai-run <command...>" >&2
  exit 1
fi
unit="ai-job-$(date +%s)"
exec systemd-run --unit="$unit" --slice=ai.slice --property=Type=exec --collect "$@"
EOF

sudo chmod +x /usr/local/bin/ai-run

Usage:

ai-run ollama run llama3.1:8b "Give me a 10-line summary"
ai-run python3 scripts/nightly_eval.py

Common pitfalls

Setting only MemoryMax
- You get abrupt kills without early reclaim behavior. Prefer MemoryHigh + MemoryMax.
Forgetting the slice on ad-hoc runs
- If you run commands directly, they escape your guardrails.
No observability loop
- Always inspect memory.events, memory.pressure, and cpu.stat after load tests.
Copy-pasting limits between machines
- Tune limits to your actual RAM/CPU and workload profile.

Final takeaway

Self-hosted AI gets dramatically more stable when you treat resource isolation as a first-class feature.

A dedicated systemd slice with cgroup v2 limits gives you:

fewer surprise lockups,
better multi-tenant behavior on one host,
and safer experimentation when you’re testing new models.

If you only implement one thing this week, make it ai.slice + systemd-run --slice=ai.slice.

References

Linux kernel docs — Control Group v2: https://docs.kernel.org/admin-guide/cgroup-v2.html
Linux kernel docs — Pressure Stall Information (PSI): https://docs.kernel.org/accounting/psi.html
man7 — systemd.resource-control(5): https://man7.org/linux/man-pages/man5/systemd.resource-control.5.html
man7 — systemd-run(1): https://man7.org/linux/man-pages/man1/systemd-run.1.html
man7 — systemd.slice(5): https://man7.org/linux/man-pages/man5/systemd.slice.5.html
man7 — systemd-oomd.service(8): https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html

DEV Community

Stop One Model from Freezing Your Linux Box: cgroup v2 Guardrails for Self-Hosted AI

What we’re building

Prerequisites

Step 1) Create `ai.slice` with sensible guardrails

Why both `MemoryHigh` and `MemoryMax`?

Step 2) Run AI jobs inside the slice

Step 3) Inspect if limits are actually working

Check unit placement and limits

Inspect pressure and throttling signals

Step 4) Optional: protect the rest of the machine with OOM policy

Step 5) Make this your default execution pattern

Common pitfalls

Final takeaway

References

Top comments (0)

What we’re building

Prerequisites

Step 1) Create ai.slice with sensible guardrails

Why both MemoryHigh and MemoryMax?

Step 2) Run AI jobs inside the slice

Step 3) Inspect if limits are actually working

Check unit placement and limits

Inspect pressure and throttling signals

Step 4) Optional: protect the rest of the machine with OOM policy

Step 5) Make this your default execution pattern

Common pitfalls

Final takeaway

References

Step 1) Create `ai.slice` with sensible guardrails

Why both `MemoryHigh` and `MemoryMax`?