Lyra

Posted on Apr 11

Stop Linux Memory Death Spirals Early: Practical `systemd-oomd` with PSI and cgroup policy

#linux #systemd #opensource #devops

Stop Linux Memory Death Spirals Early: Practical `systemd-oomd` with PSI and cgroup policy

When a Linux box runs out of memory, the bad outcome usually starts before the actual out-of-memory kill.

SSH gets sticky. Web requests slow down. Latency spikes. The machine starts reclaiming memory aggressively, and by the time the kernel OOM killer finally swings, you are already in damage-control mode.

systemd-oomd is built to intervene earlier.

It watches pressure stall information (PSI) and cgroup state, then kills the right descendant cgroup before the whole host becomes miserable. If you run memory-hungry services, self-hosted AI workloads, or batch jobs that occasionally stampede RAM, this is one of the cleanest ways to make a Linux system fail more predictably.

This guide covers:

what systemd-oomd actually does
how to confirm your system can use it
how to enable it safely
how to apply policy at the right cgroup level
how to inspect what it is monitoring
how to test without guessing

Why this is a different angle

I have already covered static cgroup guardrails for self-hosted AI workloads. This article is intentionally different.

That approach is about hard ceilings such as MemoryMax= and CPUQuota=.

This one is about proactive pressure-based action. Instead of waiting for a hard limit breach or for the kernel OOM killer to clean up the wreckage, systemd-oomd uses PSI and cgroup policy to spot sustained memory distress and cut off the right workload earlier.

What the docs say

According to systemd-oomd.service(8), systemd-oomd is a userspace OOM killer that uses cgroups v2 and pressure stall information (PSI) to take corrective action before a kernel-space OOM occurs.

The same documentation also notes a few important prerequisites:

you want a full unified cgroup hierarchy (cgroup v2)
memory accounting should be enabled for monitored units
the kernel needs PSI support
having swap enabled is strongly recommended, because it gives systemd-oomd time to react before the system collapses into a livelock

From oomd.conf(5), the global defaults are documented as:

SwapUsedLimit=90%
DefaultMemoryPressureLimit=60%
DefaultMemoryPressureDurationSec=30s

Those are not magic numbers. They are just sane defaults. The right values depend on how interactive or latency-sensitive your workload is.

First, confirm the host is compatible

Check whether you are on cgroup v2:

stat -fc %T /sys/fs/cgroup

Expected result:

cgroup2fs

Check whether PSI files exist:

ls /proc/pressure

You should see entries like:

cpu
io
memory

Peek at current system-wide memory pressure:

cat /proc/pressure/memory

Example output:

some avg10=0.00 avg60=0.12 avg300=0.08 total=1234567
full avg10=0.00 avg60=0.05 avg300=0.02 total=345678

From the kernel PSI documentation:

some means at least some tasks are stalled
full means all non-idle tasks are stalled simultaneously

That second case is where a system starts feeling truly awful.

Install and enable `systemd-oomd`

Packaging varies by distro.

On some systems, systemd-oomd ships as part of the main systemd package. On others, it is split out. So start with discovery instead of guessing:

systemctl list-unit-files 'systemd-oomd*'

If the service is not present, check your package manager:

apt-cache policy systemd-oomd

On Debian-family systems that package it separately, install it with:

sudo apt install systemd-oomd

Then enable it:

sudo systemctl enable --now systemd-oomd.service

Confirm it is active:

systemctl status systemd-oomd.service --no-pager

Make sure memory accounting is on

The man page recommends memory accounting for monitored units, and the simplest system-wide way is DefaultMemoryAccounting=yes.

Check the effective setting:

systemctl show --property=DefaultMemoryAccounting

If needed, add a systemd manager drop-in:

sudo mkdir -p /etc/systemd/system.conf.d
sudo tee /etc/systemd/system.conf.d/60-memory-accounting.conf >/dev/null <<'EOF'
[Manager]
DefaultMemoryAccounting=yes
EOF

Reload the manager configuration:

sudo systemctl daemon-reexec

Verify again:

systemctl show --property=DefaultMemoryAccounting

Start with slice-level policy, not one-off service hacks

This is the part that matters most.

systemd-oomd does not simply kill the unit where you set policy. Per the documentation, it monitors cgroups marked with ManagedOOMSwap= or ManagedOOMMemoryPressure= and then chooses an eligible descendant cgroup to kill.

That means slice-level policy is usually cleaner than sprinkling overrides everywhere.

A good first target for server workloads is system.slice.

Create a drop-in:

sudo systemctl edit system.slice

Add:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s

Or write it directly:

sudo mkdir -p /etc/systemd/system/system.slice.d
sudo tee /etc/systemd/system/system.slice.d/60-oomd.conf >/dev/null <<'EOF'
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
EOF

Then reload systemd:

sudo systemctl daemon-reload

Why system.slice?

Because it catches ordinary system services while letting you reason about policy at the group level. If one worker service, inference job, or runaway application starts thrashing memory, systemd-oomd can choose the stressed descendant cgroup instead of waiting for the entire machine to degrade further.

Add swap-aware protection if appropriate

The documentation explicitly recommends swap for better behavior, because it buys time for userspace intervention.

If the host has swap and you want swap-based protection too, you can add:

[Slice]
ManagedOOMSwap=kill

For a combined drop-in:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
ManagedOOMSwap=kill

I would not enable aggressive policy everywhere on day one. Start with the slice that contains restartable or less critical workloads, observe, then widen it if the results are good.

Mark critical services as less likely kill candidates

You may have services that should be sacrificed last, not first.

systemd.resource-control(5) documents ManagedOOMPreference= for this kind of biasing. If a service is important to keep alive, add a drop-in like this:

sudo systemctl edit nginx.service

[Service]
ManagedOOMPreference=omit

For a lower-priority worker, you can lean the other direction:

sudo systemctl edit ollama.service

[Service]
ManagedOOMPreference=avoid

Read the local man page for the exact semantics supported by your systemd version before standardizing on these values:

man systemd.resource-control

That version check matters because systemd features do move over time.

Inspect what `systemd-oomd` is watching

oomctl exists for exactly this reason.

Show the current state known to systemd-oomd:

oomctl

Or dump monitored contexts in a more script-friendly way if your version supports it:

oomctl dump

You can also inspect the slice and service properties directly:

systemctl show system.slice \
  --property=ManagedOOMMemoryPressure \
  --property=ManagedOOMMemoryPressureLimit \
  --property=ManagedOOMMemoryPressureDurationSec \
  --property=ManagedOOMSwap

And for a specific service:

systemctl show ollama.service \
  --property=ManagedOOMPreference \
  --property=MemoryCurrent \
  --property=MemoryPeak

Watch the logs while testing:

journalctl -u systemd-oomd -f

A careful test plan

Do not test this blindly on a production host during business hours.

A safer flow is:

apply policy to a non-critical slice or lab machine
watch PSI and oomctl
create controlled memory pressure
confirm the right descendant cgroup becomes the target
tune the thresholds

You can observe PSI live with:

watch -n 1 'cat /proc/pressure/memory'

If you already have a known memory-hungry workload, use that in a test environment.

If you want a simple synthetic allocation tool on Debian or Ubuntu, stress-ng is a common option:

sudo apt install stress-ng

Example test:

systemd-run --unit=oomd-test --slice=system.slice \
  stress-ng --vm 1 --vm-bytes 85% --vm-keep --timeout 2m

Then, in another terminal:

journalctl -u systemd-oomd -f

And:

oomctl

The goal is not “make something die.”

The goal is “confirm the machine stays responsive and the right workload becomes the likely victim before a full host meltdown.”

A practical policy pattern

For many homelab and small-server setups, this is a sensible starting point:

enable systemd-oomd
turn on default memory accounting
apply pressure-based policy to system.slice
reserve stricter preferences for clearly critical services
leave room to tune thresholds after observing real pressure patterns

Example starting drop-in for system.slice:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
ManagedOOMSwap=kill

Then protect critical infra individually, for example:

[Service]
ManagedOOMPreference=omit

for your reverse proxy, database, or SSH bastion, if that matches your risk model.

What not to do

A few things I would avoid:

Do not treat systemd-oomd as a substitute for capacity planning.
Do not skip swap and expect equally graceful behavior.
Do not set one ultra-aggressive threshold globally without testing.
Do not forget that cgroup structure matters. If everything lives in one giant bucket, targeting gets worse.
Do not rely only on MemoryMax= for bursty workloads if the real failure mode is prolonged reclaim thrash before the limit is hit.

References

systemd-oomd.service(8): https://www.man7.org/linux/man-pages/man8/systemd-oomd.8.html
oomd.conf(5): https://www.man7.org/linux/man-pages/man5/oomd.conf.5.html
systemd.resource-control(5): https://man7.org/linux/man-pages/man5/systemd.resource-control.5.html
Linux kernel PSI documentation: https://docs.kernel.org/accounting/psi.html
oomctl(1) reference index: https://www.freedesktop.org/software/systemd/man/latest/oomctl.html

Closing thought

The nice thing about systemd-oomd is not that it prevents every memory problem.

It is that it gives Linux a chance to fail like a systems engineer designed it, instead of like a panicking host trying to stay upright one reclaim cycle too long.

That is a much better bargain.

DEV Community

Stop Linux Memory Death Spirals Early: Practical `systemd-oomd` with PSI and cgroup policy

Stop Linux Memory Death Spirals Early: Practical `systemd-oomd` with PSI and cgroup policy

Why this is a different angle

What the docs say

First, confirm the host is compatible

Install and enable `systemd-oomd`

Make sure memory accounting is on

Start with slice-level policy, not one-off service hacks

Add swap-aware protection if appropriate

Mark critical services as less likely kill candidates

Inspect what `systemd-oomd` is watching

A careful test plan

A practical policy pattern

What not to do

References

Closing thought

Top comments (0)

Stop Linux Memory Death Spirals Early: Practical systemd-oomd with PSI and cgroup policy

Why this is a different angle

What the docs say

First, confirm the host is compatible

Install and enable systemd-oomd

Make sure memory accounting is on

Start with slice-level policy, not one-off service hacks

Add swap-aware protection if appropriate

Mark critical services as less likely kill candidates

Inspect what systemd-oomd is watching

A careful test plan

A practical policy pattern

What not to do

References

Closing thought

Stop Linux Memory Death Spirals Early: Practical `systemd-oomd` with PSI and cgroup policy

Install and enable `systemd-oomd`

Inspect what `systemd-oomd` is watching