If you run LLMs on your own Linux machine, you’ve probably seen it:
- one heavy inference job starts,
- desktop/SSH gets laggy,
- and suddenly everything feels stuck.
The fix is not "buy bigger hardware" as your first move.
The first fix is resource guardrails.
In this guide, we’ll use systemd + cgroup v2 to keep AI workloads inside clear CPU and memory boundaries, so one model run can’t tank your whole box.
What we’re building
We’ll create:
- a dedicated slice (
ai.slice) for AI workloads, - CPU and memory limits (
CPUQuota,MemoryHigh,MemoryMax), - a transient run pattern (
systemd-run --slice=ai.slice) for ad-hoc jobs, - quick observability checks (
systemctl status,oomctl,memory.pressure,cpu.stat).
This works well for:
- Ollama model pulls/runs,
- batch embedding jobs,
- local rerankers and eval scripts,
- any long-running CPU/RAM-hungry process.
Prerequisites
- Linux host using systemd
- cgroup v2 enabled (default on modern distros)
- sudo privileges
Verify quickly:
stat -fc %T /sys/fs/cgroup
# Expect: cgroup2fs
systemctl --version
Step 1) Create ai.slice with sensible guardrails
Create a unit file:
sudo tee /etc/systemd/system/ai.slice >/dev/null <<'EOF'
[Unit]
Description=Resource slice for self-hosted AI workloads
[Slice]
# CPU: allow up to 250% total CPU time (roughly 2.5 cores)
CPUQuota=250%
# Memory: start reclaim pressure before hard fail
MemoryHigh=12G
MemoryMax=14G
# Optional swap ceiling (set if swap exists and you want stricter bounds)
# MemorySwapMax=16G
EOF
Load and start the slice:
sudo systemctl daemon-reload
sudo systemctl start ai.slice
sudo systemctl status ai.slice --no-pager
Why both MemoryHigh and MemoryMax?
-
MemoryHigh= throttle/reclaim pressure point (early warning boundary) -
MemoryMax= hard cap (OOM kill if the cgroup still exceeds limit)
Using both gives smoother behavior than using only a hard kill limit.
Step 2) Run AI jobs inside the slice
Use systemd-run for transient one-off runs:
systemd-run --unit=ai-embed-$(date +%s) \
--slice=ai.slice \
--property=Type=exec \
--collect \
/usr/bin/env bash -lc 'python3 scripts/embed_corpus.py'
Example with Ollama inference script:
systemd-run --unit=ai-infer-$(date +%s) \
--slice=ai.slice \
--property=Type=exec \
--collect \
/usr/bin/env bash -lc 'ollama run llama3.1:8b "Summarize logs in 5 bullets"'
Notes:
-
--slice=ai.sliceis the key line. -
--property=Type=execmakes startup failure detection stricter. -
--collecthelps cleanup transient units after exit.
Step 3) Inspect if limits are actually working
Check unit placement and limits
systemctl status ai.slice --no-pager
systemctl show ai.slice -p CPUQuotaPerSecUSec -p MemoryHigh -p MemoryMax
Inspect pressure and throttling signals
# cgroup path for our slice
CG=/sys/fs/cgroup/ai.slice
cat "$CG/memory.current"
cat "$CG/memory.events"
cat "$CG/memory.pressure"
cat "$CG/cpu.stat"
What to look for:
-
memory.eventsincrements (high,max,oom,oom_kill) during stress -
cpu.statshowsnr_throttledandthrottled_usecwhen CPU quota is hit -
memory.pressurerising means tasks are stalling on memory pressure
Step 4) Optional: protect the rest of the machine with OOM policy
If your distro enables systemd-oomd, it can make pressure-based kill decisions at cgroup level before full kernel OOM chaos.
Quick check:
systemctl status systemd-oomd --no-pager
oomctl
If you tune ManagedOOM* settings later, test carefully in a non-production window.
Step 5) Make this your default execution pattern
For repeatability, add a tiny wrapper:
sudo tee /usr/local/bin/ai-run >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
if [ "$#" -lt 1 ]; then
echo "Usage: ai-run <command...>" >&2
exit 1
fi
unit="ai-job-$(date +%s)"
exec systemd-run --unit="$unit" --slice=ai.slice --property=Type=exec --collect "$@"
EOF
sudo chmod +x /usr/local/bin/ai-run
Usage:
ai-run ollama run llama3.1:8b "Give me a 10-line summary"
ai-run python3 scripts/nightly_eval.py
Common pitfalls
-
Setting only
MemoryMax- You get abrupt kills without early reclaim behavior. Prefer
MemoryHigh+MemoryMax.
- You get abrupt kills without early reclaim behavior. Prefer
-
Forgetting the slice on ad-hoc runs
- If you run commands directly, they escape your guardrails.
-
No observability loop
- Always inspect
memory.events,memory.pressure, andcpu.statafter load tests.
- Always inspect
-
Copy-pasting limits between machines
- Tune limits to your actual RAM/CPU and workload profile.
Final takeaway
Self-hosted AI gets dramatically more stable when you treat resource isolation as a first-class feature.
A dedicated systemd slice with cgroup v2 limits gives you:
- fewer surprise lockups,
- better multi-tenant behavior on one host,
- and safer experimentation when you’re testing new models.
If you only implement one thing this week, make it ai.slice + systemd-run --slice=ai.slice.
References
- Linux kernel docs — Control Group v2: https://docs.kernel.org/admin-guide/cgroup-v2.html
- Linux kernel docs — Pressure Stall Information (PSI): https://docs.kernel.org/accounting/psi.html
- man7 — systemd.resource-control(5): https://man7.org/linux/man-pages/man5/systemd.resource-control.5.html
- man7 — systemd-run(1): https://man7.org/linux/man-pages/man1/systemd-run.1.html
- man7 — systemd.slice(5): https://man7.org/linux/man-pages/man5/systemd.slice.5.html
- man7 — systemd-oomd.service(8): https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
Top comments (0)