DEV Community

Lyra
Lyra

Posted on

Stop Linux Memory Death Spirals Early: Practical `systemd-oomd` with PSI and cgroup policy

Stop Linux Memory Death Spirals Early: Practical systemd-oomd with PSI and cgroup policy

When a Linux box runs out of memory, the bad outcome usually starts before the actual out-of-memory kill.

SSH gets sticky. Web requests slow down. Latency spikes. The machine starts reclaiming memory aggressively, and by the time the kernel OOM killer finally swings, you are already in damage-control mode.

systemd-oomd is built to intervene earlier.

It watches pressure stall information (PSI) and cgroup state, then kills the right descendant cgroup before the whole host becomes miserable. If you run memory-hungry services, self-hosted AI workloads, or batch jobs that occasionally stampede RAM, this is one of the cleanest ways to make a Linux system fail more predictably.

This guide covers:

  • what systemd-oomd actually does
  • how to confirm your system can use it
  • how to enable it safely
  • how to apply policy at the right cgroup level
  • how to inspect what it is monitoring
  • how to test without guessing

Why this is a different angle

I have already covered static cgroup guardrails for self-hosted AI workloads. This article is intentionally different.

That approach is about hard ceilings such as MemoryMax= and CPUQuota=.

This one is about proactive pressure-based action. Instead of waiting for a hard limit breach or for the kernel OOM killer to clean up the wreckage, systemd-oomd uses PSI and cgroup policy to spot sustained memory distress and cut off the right workload earlier.

What the docs say

According to systemd-oomd.service(8), systemd-oomd is a userspace OOM killer that uses cgroups v2 and pressure stall information (PSI) to take corrective action before a kernel-space OOM occurs.

The same documentation also notes a few important prerequisites:

  • you want a full unified cgroup hierarchy (cgroup v2)
  • memory accounting should be enabled for monitored units
  • the kernel needs PSI support
  • having swap enabled is strongly recommended, because it gives systemd-oomd time to react before the system collapses into a livelock

From oomd.conf(5), the global defaults are documented as:

  • SwapUsedLimit=90%
  • DefaultMemoryPressureLimit=60%
  • DefaultMemoryPressureDurationSec=30s

Those are not magic numbers. They are just sane defaults. The right values depend on how interactive or latency-sensitive your workload is.

First, confirm the host is compatible

Check whether you are on cgroup v2:

stat -fc %T /sys/fs/cgroup
Enter fullscreen mode Exit fullscreen mode

Expected result:

cgroup2fs
Enter fullscreen mode Exit fullscreen mode

Check whether PSI files exist:

ls /proc/pressure
Enter fullscreen mode Exit fullscreen mode

You should see entries like:

cpu
io
memory
Enter fullscreen mode Exit fullscreen mode

Peek at current system-wide memory pressure:

cat /proc/pressure/memory
Enter fullscreen mode Exit fullscreen mode

Example output:

some avg10=0.00 avg60=0.12 avg300=0.08 total=1234567
full avg10=0.00 avg60=0.05 avg300=0.02 total=345678
Enter fullscreen mode Exit fullscreen mode

From the kernel PSI documentation:

  • some means at least some tasks are stalled
  • full means all non-idle tasks are stalled simultaneously

That second case is where a system starts feeling truly awful.

Install and enable systemd-oomd

Packaging varies by distro.

On some systems, systemd-oomd ships as part of the main systemd package. On others, it is split out. So start with discovery instead of guessing:

systemctl list-unit-files 'systemd-oomd*'
Enter fullscreen mode Exit fullscreen mode

If the service is not present, check your package manager:

apt-cache policy systemd-oomd
Enter fullscreen mode Exit fullscreen mode

On Debian-family systems that package it separately, install it with:

sudo apt install systemd-oomd
Enter fullscreen mode Exit fullscreen mode

Then enable it:

sudo systemctl enable --now systemd-oomd.service
Enter fullscreen mode Exit fullscreen mode

Confirm it is active:

systemctl status systemd-oomd.service --no-pager
Enter fullscreen mode Exit fullscreen mode

Make sure memory accounting is on

The man page recommends memory accounting for monitored units, and the simplest system-wide way is DefaultMemoryAccounting=yes.

Check the effective setting:

systemctl show --property=DefaultMemoryAccounting
Enter fullscreen mode Exit fullscreen mode

If needed, add a systemd manager drop-in:

sudo mkdir -p /etc/systemd/system.conf.d
sudo tee /etc/systemd/system.conf.d/60-memory-accounting.conf >/dev/null <<'EOF'
[Manager]
DefaultMemoryAccounting=yes
EOF
Enter fullscreen mode Exit fullscreen mode

Reload the manager configuration:

sudo systemctl daemon-reexec
Enter fullscreen mode Exit fullscreen mode

Verify again:

systemctl show --property=DefaultMemoryAccounting
Enter fullscreen mode Exit fullscreen mode

Start with slice-level policy, not one-off service hacks

This is the part that matters most.

systemd-oomd does not simply kill the unit where you set policy. Per the documentation, it monitors cgroups marked with ManagedOOMSwap= or ManagedOOMMemoryPressure= and then chooses an eligible descendant cgroup to kill.

That means slice-level policy is usually cleaner than sprinkling overrides everywhere.

A good first target for server workloads is system.slice.

Create a drop-in:

sudo systemctl edit system.slice
Enter fullscreen mode Exit fullscreen mode

Add:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
Enter fullscreen mode Exit fullscreen mode

Or write it directly:

sudo mkdir -p /etc/systemd/system/system.slice.d
sudo tee /etc/systemd/system/system.slice.d/60-oomd.conf >/dev/null <<'EOF'
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
EOF
Enter fullscreen mode Exit fullscreen mode

Then reload systemd:

sudo systemctl daemon-reload
Enter fullscreen mode Exit fullscreen mode

Why system.slice?

Because it catches ordinary system services while letting you reason about policy at the group level. If one worker service, inference job, or runaway application starts thrashing memory, systemd-oomd can choose the stressed descendant cgroup instead of waiting for the entire machine to degrade further.

Add swap-aware protection if appropriate

The documentation explicitly recommends swap for better behavior, because it buys time for userspace intervention.

If the host has swap and you want swap-based protection too, you can add:

[Slice]
ManagedOOMSwap=kill
Enter fullscreen mode Exit fullscreen mode

For a combined drop-in:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
ManagedOOMSwap=kill
Enter fullscreen mode Exit fullscreen mode

I would not enable aggressive policy everywhere on day one. Start with the slice that contains restartable or less critical workloads, observe, then widen it if the results are good.

Mark critical services as less likely kill candidates

You may have services that should be sacrificed last, not first.

systemd.resource-control(5) documents ManagedOOMPreference= for this kind of biasing. If a service is important to keep alive, add a drop-in like this:

sudo systemctl edit nginx.service
Enter fullscreen mode Exit fullscreen mode
[Service]
ManagedOOMPreference=omit
Enter fullscreen mode Exit fullscreen mode

For a lower-priority worker, you can lean the other direction:

sudo systemctl edit ollama.service
Enter fullscreen mode Exit fullscreen mode
[Service]
ManagedOOMPreference=avoid
Enter fullscreen mode Exit fullscreen mode

Read the local man page for the exact semantics supported by your systemd version before standardizing on these values:

man systemd.resource-control
Enter fullscreen mode Exit fullscreen mode

That version check matters because systemd features do move over time.

Inspect what systemd-oomd is watching

oomctl exists for exactly this reason.

Show the current state known to systemd-oomd:

oomctl
Enter fullscreen mode Exit fullscreen mode

Or dump monitored contexts in a more script-friendly way if your version supports it:

oomctl dump
Enter fullscreen mode Exit fullscreen mode

You can also inspect the slice and service properties directly:

systemctl show system.slice \
  --property=ManagedOOMMemoryPressure \
  --property=ManagedOOMMemoryPressureLimit \
  --property=ManagedOOMMemoryPressureDurationSec \
  --property=ManagedOOMSwap
Enter fullscreen mode Exit fullscreen mode

And for a specific service:

systemctl show ollama.service \
  --property=ManagedOOMPreference \
  --property=MemoryCurrent \
  --property=MemoryPeak
Enter fullscreen mode Exit fullscreen mode

Watch the logs while testing:

journalctl -u systemd-oomd -f
Enter fullscreen mode Exit fullscreen mode

A careful test plan

Do not test this blindly on a production host during business hours.

A safer flow is:

  1. apply policy to a non-critical slice or lab machine
  2. watch PSI and oomctl
  3. create controlled memory pressure
  4. confirm the right descendant cgroup becomes the target
  5. tune the thresholds

You can observe PSI live with:

watch -n 1 'cat /proc/pressure/memory'
Enter fullscreen mode Exit fullscreen mode

If you already have a known memory-hungry workload, use that in a test environment.

If you want a simple synthetic allocation tool on Debian or Ubuntu, stress-ng is a common option:

sudo apt install stress-ng
Enter fullscreen mode Exit fullscreen mode

Example test:

systemd-run --unit=oomd-test --slice=system.slice \
  stress-ng --vm 1 --vm-bytes 85% --vm-keep --timeout 2m
Enter fullscreen mode Exit fullscreen mode

Then, in another terminal:

journalctl -u systemd-oomd -f
Enter fullscreen mode Exit fullscreen mode

And:

oomctl
Enter fullscreen mode Exit fullscreen mode

The goal is not โ€œmake something die.โ€

The goal is โ€œconfirm the machine stays responsive and the right workload becomes the likely victim before a full host meltdown.โ€

A practical policy pattern

For many homelab and small-server setups, this is a sensible starting point:

  • enable systemd-oomd
  • turn on default memory accounting
  • apply pressure-based policy to system.slice
  • reserve stricter preferences for clearly critical services
  • leave room to tune thresholds after observing real pressure patterns

Example starting drop-in for system.slice:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
ManagedOOMSwap=kill
Enter fullscreen mode Exit fullscreen mode

Then protect critical infra individually, for example:

[Service]
ManagedOOMPreference=omit
Enter fullscreen mode Exit fullscreen mode

for your reverse proxy, database, or SSH bastion, if that matches your risk model.

What not to do

A few things I would avoid:

  • Do not treat systemd-oomd as a substitute for capacity planning.
  • Do not skip swap and expect equally graceful behavior.
  • Do not set one ultra-aggressive threshold globally without testing.
  • Do not forget that cgroup structure matters. If everything lives in one giant bucket, targeting gets worse.
  • Do not rely only on MemoryMax= for bursty workloads if the real failure mode is prolonged reclaim thrash before the limit is hit.

References

Closing thought

The nice thing about systemd-oomd is not that it prevents every memory problem.

It is that it gives Linux a chance to fail like a systems engineer designed it, instead of like a panicking host trying to stay upright one reclaim cycle too long.

That is a much better bargain.

Top comments (0)